Legal

Legal translation

Translate legal text with terminology consistency and format safety.

task.translate_technicaltask.glossary_terminology_consistency

Evidence quality is currently limited for this use case. Rankings below are useful for exploration, not a strong winner claim.

Provisional leader

claude-sonnet-4

Current leader based on limited benchmark evidence. Treat this ranking as directional until coverage improves.

33.3%

Best benchmark score

40.6%

Confidence

All ranked models — top 3

🥇

claude-sonnet-4

33.3%

🥈

gemini-2.5-flash

30.4%

🥉

gemini-2.5-pro

25.3%

Ranked Models

Evidence Quality

81%

Evidence Points

Top Signal

LanguageBench Translation Official (Split): translation_to:bleu

All Ranked Models

30 of 30 models

Rank	Model	Score	Confidence	Price / 1M	Evidence sources
🥇	claude-sonnet-4 Strong on LanguageBench Translation Official (Split) translation_to:bleu and LanguageBench overall:mean	33.3%	41%	$6.00	LanguageBench Translation Official (Split)LanguageBench
🥈	gemini-2.5-flash Strong on LanguageBench Translation Official (Split) translation_to:bleu and LanguageBench overall:mean	30.4%	36%	$0.17	LanguageBench Translation Official (Split)LanguageBench
#4	gemini-2.5-pro Strong on LEXam Leaderboard average_score_pct and Galileo Agent Leaderboard v2 Avg TSQ	25.3%	50%	$3.44	LEXam LeaderboardGalileo Agent Leaderboard v2
#5	gpt-4.1 Strong on LanguageBench Translation Official (Split) translation_to:bleu and LanguageBench overall:mean	24.7%	29%	$3.50	LanguageBench Translation Official (Split)LanguageBench
#6	gpt-4.1-20250414 Strong on Vals Case Law v2 overall_accuracy_pct and Vals Legal Bench overall_accuracy_pct	23.8%	33%	—	Vals Case Law v2Vals Legal Bench
#7	gpt-5-mini-2025-08-07 Strong on Vals Case Law v2 overall_accuracy_pct and Vals Legal Bench overall_accuracy_pct	23.6%	31%	—	Vals Case Law v2Vals Legal Bench
#8	gpt-5-2025-08-07 Strong on LEXam Leaderboard average_score_pct and Vals Legal Bench overall_accuracy_pct	23.4%	28%	—	LEXam LeaderboardVals Legal Bench
#10	gemini-2.0-flash-001 Strong on LanguageBench Translation Official (Split) translation_to:bleu and LanguageBench overall:mean	22.6%	26%	—	LanguageBench Translation Official (Split)LanguageBench
#11	Claude-3.5-Sonnet Strong on LanguageBench Translation Official (Split) translation_to:bleu and LanguageBench overall:mean	21.8%	28%	$6.00	LanguageBench Translation Official (Split)LanguageBench
#13	deepseek-r1 Strong on LanguageBench Translation Official (Split) translation_to:bleu and LEXam Leaderboard average_score_pct	19.3%	36%	$0.27	LanguageBench Translation Official (Split)LEXam Leaderboard
#15	gemini-3.1-pro-preview Strong on Vals Legal Bench overall_accuracy_pct and Vals Case Law v2 overall_accuracy_pct	18.8%	22%	$4.50	Vals Legal BenchVals Case Law v2
#17	gemini-3-pro-preview Strong on Vals Legal Bench overall_accuracy_pct and LEXam Leaderboard average_score_pct	17.4%	24%	$4.50	Vals Legal BenchLEXam Leaderboard
#21	Grok-4-0709 Strong on Vals Legal Bench overall_accuracy_pct and Vals Case Law v2 overall_accuracy_pct	15.7%	22%	—	Vals Legal BenchVals Case Law v2
#24	gpt-5.4-2026-03-05 Strong on Vals Legal Bench overall_accuracy_pct and Vals Case Law v2 overall_accuracy_pct	14.9%	18%	—	Vals Legal BenchVals Case Law v2
#26	claude-sonnet-4.6 Strong on Vals Legal Bench overall_accuracy_pct and Vals Case Law v2 overall_accuracy_pct	14.6%	18%	$6.00	Vals Legal BenchVals Case Law v2
#27	gpt-4.1-mini-20250414 Strong on Vals Legal Bench overall_accuracy_pct and OpenVLM OCRBench Official ocrbench_score_pct	14.4%	21%	—	Vals Legal BenchOpenVLM OCRBench Official
#28	gemini-3-flash-preview Strong on Vals Legal Bench overall_accuracy_pct and Vectara HHEM Leaderboard law_hallucination_error_pct	14.1%	19%	$1.13	Vals Legal BenchVectara HHEM Leaderboard
#29	grok-4-fast-reasoning Strong on Vals Legal Bench overall_accuracy_pct and Vals Case Law v2 overall_accuracy_pct	13.2%	21%	$0.28	Vals Legal BenchVals Case Law v2
#30	Llama-3.3-70B-Instruct Strong on LanguageBench Translation Official (Split) translation_to:bleu and LanguageBench overall:mean	13.2%	20%	—	LanguageBench Translation Official (Split)LanguageBench
#31	gpt-5.1-2025-11-13 Strong on Vals Case Law v2 overall_accuracy_pct and Vals Legal Bench overall_accuracy_pct	13.1%	16%	—	Vals Case Law v2Vals Legal Bench
#32	gemini-3.1-flash-lite-preview Strong on Vals Legal Bench overall_accuracy_pct and Vectara HHEM Leaderboard law_hallucination_error_pct	13.1%	19%	$0.56	Vals Legal BenchVectara HHEM Leaderboard
#33	Llama-3.1-70B-Instruct Strong on LanguageBench Translation Official (Split) translation_to:bleu and LanguageBench overall:mean	13.1%	23%	—	LanguageBench Translation Official (Split)LanguageBench
#35	claude-opus-4-5-20251101 Strong on Vals Legal Bench overall_accuracy_pct and Vectara HHEM Leaderboard law_hallucination_error_pct	12.8%	16%	—	Vals Legal BenchVectara HHEM Leaderboard
#36	gpt-5.2-2025-12-11 Strong on Vals Legal Bench overall_accuracy_pct and Vals Case Law v2 overall_accuracy_pct	12.6%	15%	—	Vals Legal BenchVals Case Law v2
#37	grok-4-1-fast-reasoning Strong on Vals Legal Bench overall_accuracy_pct and Vals Case Law v2 overall_accuracy_pct	11.5%	17%	$0.28	Vals Legal BenchVals Case Law v2
#38	gpt-4o Strong on LEXam Leaderboard average_score_pct and OpenVLM OCRBench Official ocrbench_score_pct	11.4%	17%	$0.26	LEXam LeaderboardOpenVLM OCRBench Official
#39	phi-4 Strong on LanguageBench overall:mean and Vectara HHEM Leaderboard law_hallucination_error_pct	11.2%	25%	—	LanguageBenchVectara HHEM Leaderboard
#43	claude-opus-4-1-20250805 Strong on Vals Legal Bench overall_accuracy_pct and Vectara HHEM Leaderboard law_hallucination_error_pct	10.8%	15%	—	Vals Legal BenchVectara HHEM Leaderboard
#44	o3-20250416 Strong on Vals Legal Bench overall_accuracy_pct and SimpleQA Verified simpleqa_verified_score_pct	10.5%	15%	$3.50	Vals Legal BenchSimpleQA Verified
#45	mistral-large-2512 Strong on Vals Legal Bench overall_accuracy_pct and Vals Case Law v2 overall_accuracy_pct	10.4%	16%	—	Vals Legal BenchVals Case Law v2

Compare Models

Select two different models above to compare their evidence side by side.

▶Ranking diagnostics & missing models

Source lift

Ranked

Sources

Quality

Low

Vals Legal Bench

44 rows · 2.2% avg lift

Vals LiveCodeBench

40 rows · 0.3% avg lift

Vals Tax Eval v2

38 rows · 0.3% avg lift

Vals MedQA

37 rows · 0.3% avg lift

Missing frontier models

No obvious gaps right now.

▶Taxonomy & task details

Core tasks

task.translate_technicaltask.glossary_terminology_consistency

Required modes

mode.multilingualmode.format_preservation

Domains

domain.legal_regulatorydomain.legal_contracts

Related in Legal

Contract Drafting & Redlining

Drafting, reviewing, and suggesting edits to legal contracts and agreements.

Contract Q&A (RAG grounded)

Answer contract questions grounded in the actual contract text.

Regulatory summary

Summarize and compare regulatory text with conservative interpretation.

Contract redline summary

Summarize material changes between contract versions with clause refs.