Legal

Contract Q&A (RAG grounded)

Answer contract questions grounded in the actual contract text.

task.rag_answer_with_citationstask.kb_navigation

Evidence quality is currently limited for this use case. Rankings below are useful for exploration, not a strong winner claim.

Provisional leader

gemini-3.1-pro-preview

Current leader based on limited benchmark evidence. Treat this ranking as directional until coverage improves.

32.4%

Best benchmark score

37.1%

Confidence

All ranked models — top 3

🥇

gemini-3.1-pro-preview

32.4%

🥈

gemini-2.5-pro

31.6%

🥉

gpt-5-mini-2025-08-07

30.5%

Ranked Models

Evidence Quality

84%

Evidence Points

Top Signal

SimpleQA Verified: simpleqa_verified_score_pct

All Ranked Models

30 of 30 models

Rank	Model	Score	Confidence	Price / 1M	Evidence sources
🥇	gemini-3.1-pro-preview Strong on SimpleQA Verified simpleqa_verified_score_pct and FACTS Benchmark Suite facts_grounding_score_pct	32.4%	37%	$4.50	SimpleQA VerifiedFACTS Benchmark Suite
🥈	gemini-2.5-pro Strong on FACTS Benchmark Suite facts_grounding_score_pct and LEXam Leaderboard average_score_pct	31.6%	50%	$3.44	FACTS Benchmark SuiteLEXam Leaderboard
🥉	gpt-5-mini-2025-08-07 Strong on Vals Case Law v2 overall_accuracy_pct and FACTS Benchmark Suite facts_grounding_score_pct	30.5%	45%	—	Vals Case Law v2FACTS Benchmark Suite
#4	gpt-5-2025-08-07 Strong on FACTS Benchmark Suite facts_grounding_score_pct and LEXam Leaderboard average_score_pct	30.1%	39%	—	FACTS Benchmark SuiteLEXam Leaderboard
#5	claude-sonnet-4 Strong on Galileo Agent Leaderboard v2 Avg TSQ and Vals Legal Bench overall_accuracy_pct	27.5%	39%	$6.00	Galileo Agent Leaderboard v2Vals Legal Bench
#6	gemini-3-pro-preview Strong on SimpleQA Verified simpleqa_verified_score_pct and Vals Legal Bench overall_accuracy_pct	27.2%	38%	$4.50	SimpleQA VerifiedVals Legal Bench
#7	gemini-2.5-flash Strong on FACTS Benchmark Suite facts_grounding_score_pct and Galileo Agent Leaderboard v2 Avg TSQ	24.7%	36%	$0.17	FACTS Benchmark SuiteGalileo Agent Leaderboard v2
#8	gpt-4.1-20250414 Strong on Vals Case Law v2 overall_accuracy_pct and Vectara HHEM Leaderboard overall_hallucination_error_pct	24.6%	35%	—	Vals Case Law v2Vectara HHEM Leaderboard
#9	Grok-4-0709 Strong on Vals Legal Bench overall_accuracy_pct and Vals Case Law v2 overall_accuracy_pct	24.4%	35%	—	Vals Legal BenchVals Case Law v2
#10	gemini-3-flash-preview Strong on Vals Legal Bench overall_accuracy_pct and FACTS Benchmark Suite facts_grounding_score_pct	24.2%	33%	$1.13	Vals Legal BenchFACTS Benchmark Suite
#11	claude-sonnet-4.6 Strong on Vals Finance Agent overall_accuracy_pct and Vals Legal Bench overall_accuracy_pct	23.6%	30%	$6.00	Vals Finance AgentVals Legal Bench
#12	gemini-3.1-flash-lite-preview Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vals Legal Bench overall_accuracy_pct	22.9%	33%	$0.56	FACTS Benchmark SuiteVals Legal Bench
#13	gpt-5.4-2026-03-05 Strong on Vals Legal Bench overall_accuracy_pct and Vectara HHEM Leaderboard overall_hallucination_error_pct	22.6%	28%	—	Vals Legal BenchVectara HHEM Leaderboard
#14	gpt-5.2-2025-12-11 Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vals Legal Bench overall_accuracy_pct	22.4%	27%	—	FACTS Benchmark SuiteVals Legal Bench
#15	claude-opus-4-5-20251101 Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vals Legal Bench overall_accuracy_pct	21.1%	31%	—	FACTS Benchmark SuiteVals Legal Bench
#16	grok-4-fast-reasoning Strong on Vals CorpFin v2 overall_accuracy_pct and Vals Legal Bench overall_accuracy_pct	20.6%	37%	$0.28	Vals CorpFin v2Vals Legal Bench
#17	gpt-5.1-2025-11-13 Strong on Vals Case Law v2 overall_accuracy_pct and Vals Legal Bench overall_accuracy_pct	19.8%	29%	—	Vals Case Law v2Vals Legal Bench
#18	grok-4-1-fast-reasoning Strong on Vals Legal Bench overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct	18.2%	27%	$0.28	Vals Legal BenchVals CorpFin v2
#19	o3-20250416 Strong on Vals Legal Bench overall_accuracy_pct and SimpleQA Verified simpleqa_verified_score_pct	16.7%	27%	$3.50	Vals Legal BenchSimpleQA Verified
#20	grok-3 Strong on Vectara HHEM Leaderboard overall_hallucination_error_pct and Vals Legal Bench overall_accuracy_pct	15.9%	22%	$6.00	Vectara HHEM LeaderboardVals Legal Bench
#21	deepseek-r1 Strong on SYCON Bench (Table 2) sycon_unethical_tof_pct and LEXam Leaderboard average_score_pct	15.3%	25%	$0.27	SYCON Bench (Table 2)LEXam Leaderboard
#22	claude-opus-4-6-thinking Strong on Vals Legal Bench overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct	15.1%	17%	—	Vals Legal BenchVals CorpFin v2
#23	mistral-large-2512 Strong on Vals Legal Bench overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct	14.7%	24%	—	Vals Legal BenchVals CorpFin v2
#24	claude-opus-4-1-20250805 Strong on Vals Legal Bench overall_accuracy_pct and FACTS Benchmark Suite facts_grounding_score_pct	14.5%	25%	—	Vals Legal BenchFACTS Benchmark Suite
#25	claude-opus-4-5-20251101-thinking Strong on Vals Legal Bench overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct	14.3%	17%	—	Vals Legal BenchVals Finance Agent
#26	gpt-4.1 Strong on LEXam Leaderboard average_score_pct and LanguageBench translation_to:bleu	13.7%	17%	$3.50	LEXam LeaderboardLanguageBench
#27	claude-sonnet-4-5-20250929-thinking Strong on Vals Legal Bench overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct	13.5%	17%	—	Vals Legal BenchVals Finance Agent
#28	grok-4-1-fast-non-reasoning Strong on Vals Legal Bench overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct	13.5%	23%	$0.28	Vals Legal BenchVals Finance Agent
#31	glm-5-thinking Strong on Vals Legal Bench overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct	12.8%	20%	—	Vals Legal BenchVals CorpFin v2
#32	deepseek-v3 Strong on Vectara HHEM Leaderboard overall_hallucination_error_pct and SYCON Bench (Table 2) sycon_unethical_tof_pct	12.6%	19%	—	Vectara HHEM LeaderboardSYCON Bench (Table 2)

Compare Models

Select two different models above to compare their evidence side by side.

▶Ranking diagnostics & missing models

Source lift

Ranked

Sources

Quality

Low

Vals CorpFin v2

44 rows · 1.2% avg lift

Vals Legal Bench

44 rows · 1.8% avg lift

Vals Finance Agent

31 rows · 1.1% avg lift

Vals Case Law v2

30 rows · 1.3% avg lift

Missing frontier models

No obvious gaps right now.

▶Taxonomy & task details

Core tasks

task.rag_answer_with_citationstask.kb_navigation

Required modes

mode.citations

Domains

domain.legal_contracts

Related in Legal

Contract Drafting & Redlining

Drafting, reviewing, and suggesting edits to legal contracts and agreements.

Regulatory summary

Summarize and compare regulatory text with conservative interpretation.

Contract redline summary

Summarize material changes between contract versions with clause refs.

Clause playbook check

Check extracted terms against a playbook and flag deviations.