BasedAGIBasedAGI
Menu
Rankings live

education

Language conversation partner

Conversational practice with gentle corrections and explanations.

#1 Recommendation

gemini-2.5-flash

Strong on LanguageBench Translation Official (Split) translation_to:bleu (92%) and LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct (100%)

external/google/gemini-2-5-flash

19.0%

Score

21.7%

Confidence

Limited benchmark evidence for this use case.

26 ranked models with average evidence of 12.6 points. Rankings may shift as more benchmark data is ingested.

Ranked Models

26

Evidence Quality

80%

Scoring

Benchmark-backed

Top Signal

LanguageBench Translation Official (Split): translation_to:bleu

All Ranked Models

Max params:
Min confidence:
26 of 26
RankModelScore
#4gemini-2.5-flash
19.0%
#5gpt-4.1-20250414
18.2%
#6google/gemini-2.0-flash-001
17.3%
#10gpt-4.1-mini-20250414
16.2%
#20Llama-3.1-70B-Instruct
13.7%
#54Llama-3.3-70B-Instruct
11.3%
#68gemini-2.5-pro
10.7%
#80gpt-5-2025-08-07
10.0%
#81google/gemini-3.1-pro-preview
10.0%
#92Qwen-VL-Chat
9.6%
#96gpt-5-mini-2025-08-07
9.5%
#121Arch-Agent-32B
8.6%
#129gpt-4o
8.2%
#135gemini-3-pro-preview
8.0%
#141phi-4
7.8%
#148Grok-4-0709
7.6%
#149GPT-4.1-nano-2025-04-14
7.5%
#153deepseek/deepseek-r1
7.4%
#170kimi/kimi-k2.5-thinking
7.1%
#176claude-sonnet-4-20250514
6.9%
#200Arch-Agent-3B
6.2%
#206Arch-Agent-1.5B
6.0%
#229qwen-2.5-72b-instruct
4.9%
#254Meta-Llama-3-8B-Instruct
3.8%
#256Phi-4-multimodal-instruct
3.5%
#271Qwen3-30B-A3B
1.0%

Compare Models

Model A leads by +0.8%

Shareable Link →

Model A

gemini-2.5-flash

external/google/gemini-2-5-flash

19.0%

Rank #4

Confidence 21.7%18 evidence pts

LanguageBench Translation Official (Split): translation_to:bleu

Value 92.0% · Conf 100.0% · Weight 4.7%

languagebench_translation_official.translation_to_bleu (Mar 12, 2026)

LanguageBench Grammar/Clarity Official (Split): grammar_clarity_score_pct

Value 100.0% · Conf 100.0% · Weight 3.3%

languagebench_grammar_clarity_official.grammar_clarity_score_pct (Mar 12, 2026)

LanguageBench: overall:mean

Value 100.0% · Conf 100.0% · Weight 1.9%

languagebench.overall_mean (Mar 12, 2026)

LanguageBench: mmlu:accuracy

Value 94.1% · Conf 100.0% · Weight 1.7%

languagebench.mmlu_accuracy (Mar 12, 2026)

Model B

gpt-4.1-20250414

external/openai/gpt-4-1-20250414

18.2%

Rank #5

Confidence 28.5%23 evidence pts

OpenVLM TextVQA Official: textvqa_score_pct

Value 76.8% · Conf 100.0% · Weight 2.9%

openvlm_textvqa_official.textvqa_score_pct (Mar 12, 2026)

OpenVLM OCRBench Official: ocrbench_score_pct

Value 87.7% · Conf 100.0% · Weight 2.9%

openvlm_ocrbench_official.ocrbench_score_pct (Mar 12, 2026)

OpenVLM MTVQA Official: mtvqa_score_pct

Value 92.4% · Conf 100.0% · Weight 2.4%

openvlm_mtvqa_official.mtvqa_score_pct (Mar 12, 2026)

Galileo Agent Leaderboard v2: Avg AC

Value 100.0% · Conf 100.0% · Weight 1.3%

galileo_agent_v2.avg_ac (Mar 12, 2026)

Ranking Diagnostics & Missing Models

Source Lift

Ranked

26

Sources

8

Quality

Insufficient

Vals GPQA

vals_gpqa

11 rows

1.2% avg lift

Vals Mortgage Tax

vals_mortgage_tax

11 rows

0.3% avg lift

Vals Tax Eval v2

vals_tax_eval_v2

11 rows

0.3% avg lift

Vals MedQA

vals_medqa

10 rows

0.3% avg lift

Missing Strong Models

anthropic/claude-sonnet-4.6

external/anthropic/claude-sonnet-4-6

Rank #4

21.1%

Thin evidence after weighting

openai/gpt-5.4-2026-03-05

external/openai/gpt-5-4-2026-03-05

Rank #10

18.9%

Thin evidence after weighting

claude-opus-4-5-20251101

external/anthropic/claude-opus-4-5-20251101

Rank #13

17.0%

Thin evidence after weighting

gpt-5.1-2025-11-13

external/openai/gpt-5-1-2025-11-13

Rank #14

17.0%

Thin evidence after weighting
Taxonomy Details

Core Tasks

task.casual_conversationtask.translate_general

Required Modes

mode.multilingual

Domains

domain.language_learning

Related Use Cases