BasedAGIBasedAGI
Menu
Rankings live

history_linguistics

Historical document summarization

Summarize historical documents into timelines and key entities.

#1 Recommendation

gemini-2.5-flash

Strong on LanguageBench overall:mean (100%) and LanguageBench Translation Official (Split) translation_to:bleu (92%)

external/google/gemini-2-5-flash

25.1%

Score

29.1%

Confidence

Limited benchmark evidence for this use case.

28 ranked models with average evidence of 13.9 points. Rankings may shift as more benchmark data is ingested.

Ranked Models

28

Evidence Quality

80%

Scoring

Benchmark-backed

Top Signal

LanguageBench: overall:mean

All Ranked Models

Max params:
Min confidence:
28 of 28
RankModelScore
#1gemini-2.5-flash

Strong on LanguageBench overall:mean (100%) and LanguageBench Translation Official (Split) translation_to:bleu (92%)

25.1%
#3google/gemini-2.0-flash-001

Strong on LanguageBench overall:mean (100%) and LanguageBench Translation Official (Split) translation_to:bleu (88%)

20.8%
#11gpt-4.1-20250414
15.6%
#15Llama-3.3-70B-Instruct
14.8%
#18Llama-3.1-70B-Instruct
14.1%
#19deepseek/deepseek-r1
12.2%
#23gemini-3-pro-preview
10.8%
#28gemini-2.5-pro
10.1%
#31google/gemini-3.1-pro-preview
10.1%
#33gpt-4o-20241120
9.9%
#38claude-sonnet-4-20250514
9.6%
#40gpt-5-2025-08-07
9.2%
#41openai/gpt-5.4-2026-03-05
9.1%
#42phi-4
8.9%
#43gpt-5.1-2025-11-13
8.9%
#44claude-opus-4-5-20251101
8.7%
#45gemini-3-flash-preview
8.5%
#46Grok-4-0709
8.4%
#47gpt-5-mini-2025-08-07
8.3%
#48qwen-2.5-72b-instruct
8.3%
#50kimi/kimi-k2.5-thinking
7.6%
#52gpt-4o-2024-08-06
7.2%
#54openai/gpt-4o-mini-2024-07-18
6.8%
#61Meta-Llama-3-8B-Instruct
3.7%
#62Qwen2-72B-Instruct
3.3%
#63Qwen3-30B-A3B
2.7%
#64Llama-3.1-8B-Instruct
2.4%
#65Phi-4-multimodal-instruct
2.2%

Compare Models

Model A leads by +4.2%

Shareable Link →

Model A

gemini-2.5-flash

external/google/gemini-2-5-flash

25.1%

Rank #1

Confidence 29.1%19 evidence pts

LanguageBench: overall:mean

Value 100.0% · Conf 100.0% · Weight 4.5%

languagebench.overall_mean (Mar 12, 2026)

LanguageBench Translation Official (Split): translation_to:bleu

Value 92.0% · Conf 100.0% · Weight 4.3%

languagebench_translation_official.translation_to_bleu (Mar 12, 2026)

LanguageBench: translation_to:bleu

Value 92.0% · Conf 100.0% · Weight 2.3%

languagebench.translation_to_bleu (Mar 12, 2026)

LanguageBench Translation Official (Split): translation_to:chrf

Value 97.5% · Conf 100.0% · Weight 1.8%

languagebench_translation_official.translation_to_chrf (Mar 12, 2026)

Model B

google/gemini-2.0-flash-001

external/google/gemini-2-0-flash-001

20.8%

Rank #3

Confidence 25.2%14 evidence pts

LanguageBench: overall:mean

Value 99.9% · Conf 100.0% · Weight 4.5%

languagebench.overall_mean (Mar 12, 2026)

LanguageBench Translation Official (Split): translation_to:bleu

Value 88.0% · Conf 100.0% · Weight 4.1%

languagebench_translation_official.translation_to_bleu (Mar 12, 2026)

LanguageBench: translation_to:bleu

Value 88.0% · Conf 100.0% · Weight 2.2%

languagebench.translation_to_bleu (Mar 12, 2026)

LanguageBench Translation Official (Split): translation_to:chrf

Value 93.3% · Conf 100.0% · Weight 1.7%

languagebench_translation_official.translation_to_chrf (Mar 12, 2026)

Ranking Diagnostics & Missing Models

Source Lift

Ranked

28

Sources

8

Quality

Insufficient

Icelandic LLM Leaderboard

icelandic_llm_leaderboard

17 rows

1.2% avg lift

Vals Tax Eval v2

vals_tax_eval_v2

16 rows

0.4% avg lift

Vals CorpFin v2

vals_corp_fin_v2

16 rows

0.3% avg lift

Vals MedQA

vals_medqa

15 rows

0.4% avg lift

Missing Strong Models

anthropic/claude-sonnet-4.6

external/anthropic/claude-sonnet-4-6

Rank #4

21.1%

Thin evidence after weighting

gpt-5.2-2025-12-11

external/openai/gpt-5-2-2025-12-11

Rank #16

16.2%

Thin evidence after weighting

anthropic/claude-opus-4-6-thinking

external/anthropic/claude-opus-4-6-thinking

Rank #17

16.1%

Thin evidence after weighting

xai-org/grok-4-fast-reasoning

external/xai-org/grok-4-fast-reasoning

Rank #18

15.7%

Thin evidence after weighting
Taxonomy Details

Core Tasks

task.summarize_doctask.timeline_extraction

Required Modes

mode.long_context

Domains

domain.history_linguistics

Related Use Cases