BasedAGIBasedAGI
Menu
Rankings live

history_linguistics

Archaic and historical translation

Translate older or domain-specific language into modern equivalents.

#1 Recommendation

gemini-2.5-flash

Strong on LanguageBench Translation Official (Split) translation_to:bleu (92%) and LanguageBench overall:mean (100%)

external/google/gemini-2-5-flash

33.4%

Score

38.2%

Confidence

Limited benchmark evidence for this use case.

33 ranked models with average evidence of 14.1 points. Rankings may shift as more benchmark data is ingested.

Ranked Models

30

Evidence Quality

82%

Scoring

Benchmark-backed

Top Signal

LanguageBench Translation Official (Split): translation_to:bleu

All Ranked Models

Max params:
Min confidence:
30 of 30
RankModelScore
#1gemini-2.5-flash

Strong on LanguageBench Translation Official (Split) translation_to:bleu (92%) and LanguageBench overall:mean (100%)

33.4%
#5google/gemini-2.0-flash-001
27.6%
#15gpt-4.1-20250414
20.0%
#17Llama-3.1-70B-Instruct
19.0%
#18Llama-3.3-70B-Instruct
18.9%
#20gpt-4.1-mini-20250414
16.4%
#22gemini-2.5-pro
14.7%
#26deepseek/deepseek-r1
14.0%
#30gpt-5-2025-08-07
13.0%
#33gemini-3-pro-preview
12.3%
#34gpt-5-mini-2025-08-07
12.1%
#36google/gemini-3.1-pro-preview
11.4%
#38phi-4
11.3%
#43claude-sonnet-4-20250514
10.8%
#48openai/gpt-5.4-2026-03-05
10.4%
#54gpt-5.1-2025-11-13
10.1%
#56claude-opus-4-5-20251101
9.9%
#57gemini-3-flash-preview
9.7%
#59Grok-4-0709
9.6%
#71kimi/kimi-k2.5-thinking
8.6%
#73gpt-4o
8.5%
#80google/gemini-3.1-flash-lite-preview
8.3%
#86GPT-4.1-nano-2025-04-14
8.0%
#98Qwen-VL-Chat
7.3%
#109gpt-4o-2024-08-06
6.7%
#112o4-mini-20250416
6.6%
#115openai/gpt-4o-mini-2024-07-18
6.4%
#144qwen-2.5-72b-instruct
5.1%
#154Phi-4-multimodal-instruct
3.7%
#155Meta-Llama-3-8B-Instruct
3.6%

Compare Models

Model A leads by +5.8%

Shareable Link →

Model A

gemini-2.5-flash

external/google/gemini-2-5-flash

33.4%

Rank #1

Confidence 38.2%20 evidence pts

LanguageBench Translation Official (Split): translation_to:bleu

Value 92.0% · Conf 100.0% · Weight 7.2%

languagebench_translation_official.translation_to_bleu (Mar 12, 2026)

LanguageBench: overall:mean

Value 100.0% · Conf 100.0% · Weight 5.4%

languagebench.overall_mean (Mar 12, 2026)

LanguageBench: translation_to:bleu

Value 92.0% · Conf 100.0% · Weight 2.7%

languagebench.translation_to_bleu (Mar 12, 2026)

LanguageBench Grammar/Clarity Official (Split): grammar_clarity_score_pct

Value 100.0% · Conf 100.0% · Weight 2.2%

languagebench_grammar_clarity_official.grammar_clarity_score_pct (Mar 12, 2026)

Model B

google/gemini-2.0-flash-001

external/google/gemini-2-0-flash-001

27.6%

Rank #5

Confidence 32.1%15 evidence pts

LanguageBench Translation Official (Split): translation_to:bleu

Value 88.0% · Conf 100.0% · Weight 6.9%

languagebench_translation_official.translation_to_bleu (Mar 12, 2026)

LanguageBench: overall:mean

Value 99.9% · Conf 100.0% · Weight 5.4%

languagebench.overall_mean (Mar 12, 2026)

LanguageBench: translation_to:bleu

Value 88.0% · Conf 100.0% · Weight 2.6%

languagebench.translation_to_bleu (Mar 12, 2026)

LanguageBench Grammar/Clarity Official (Split): grammar_clarity_score_pct

Value 95.9% · Conf 100.0% · Weight 2.1%

languagebench_grammar_clarity_official.grammar_clarity_score_pct (Mar 12, 2026)

Ranking Diagnostics & Missing Models

Source Lift

Ranked

33

Sources

8

Quality

Insufficient

Icelandic LLM Leaderboard

icelandic_llm_leaderboard

22 rows

1.3% avg lift

Vals Tax Eval v2

vals_tax_eval_v2

19 rows

0.5% avg lift

Vals Mortgage Tax

vals_mortgage_tax

18 rows

0.5% avg lift

Vals MedQA

vals_medqa

18 rows

0.5% avg lift

Missing Strong Models

anthropic/claude-sonnet-4.6

external/anthropic/claude-sonnet-4-6

Rank #4

21.1%

Thin evidence after weighting

gpt-5.2-2025-12-11

external/openai/gpt-5-2-2025-12-11

Rank #16

16.2%

Thin evidence after weighting

anthropic/claude-opus-4-6-thinking

external/anthropic/claude-opus-4-6-thinking

Rank #17

16.1%

Thin evidence after weighting

xai-org/grok-4-fast-reasoning

external/xai-org/grok-4-fast-reasoning

Rank #18

15.7%

Thin evidence after weighting
Taxonomy Details

Core Tasks

task.translate_technicaltask.glossary_terminology_consistency

Required Modes

mode.multilingualmode.format_preservation

Domains

domain.history_linguistics

Related Use Cases