BasedAGIBasedAGI
Menu
Rankings live

education

Grammar and writing coach

Correct grammar and explain fixes at the learner's level.

#1 Recommendation

gemini-2.5-flash

Strong on LanguageBench Translation Official (Split) translation_to:bleu (92%) and LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct (100%)

external/google/gemini-2-5-flash

20.5%

Score

23.3%

Confidence

Limited benchmark evidence for this use case.

23 ranked models with average evidence of 13.7 points. Rankings may shift as more benchmark data is ingested.

Ranked Models

23

Evidence Quality

81%

Scoring

Benchmark-backed

Top Signal

LanguageBench Translation Official (Split): translation_to:bleu

All Ranked Models

Max params:
Min confidence:
23 of 23
RankModelScore
#4gemini-2.5-flash
20.5%
#5gpt-4.1-20250414
19.6%
#6google/gemini-2.0-flash-001
18.6%
#10gpt-4.1-mini-20250414
17.4%
#19Llama-3.1-70B-Instruct
14.7%
#46Llama-3.3-70B-Instruct
12.1%
#56gemini-2.5-pro
11.5%
#66gpt-5-2025-08-07
10.7%
#67google/gemini-3.1-pro-preview
10.7%
#75Qwen-VL-Chat
10.3%
#77gpt-5-mini-2025-08-07
10.2%
#96gpt-4o
8.8%
#100gemini-3-pro-preview
8.6%
#104phi-4
8.4%
#108Grok-4-0709
8.1%
#109GPT-4.1-nano-2025-04-14
8.1%
#111deepseek/deepseek-r1
8.0%
#122kimi/kimi-k2.5-thinking
7.6%
#127claude-sonnet-4-20250514
7.5%
#151qwen-2.5-72b-instruct
5.3%
#158Meta-Llama-3-8B-Instruct
4.1%
#159Phi-4-multimodal-instruct
3.8%
#168Qwen3-30B-A3B
1.0%

Compare Models

Model A leads by +0.9%

Shareable Link →

Model A

gemini-2.5-flash

external/google/gemini-2-5-flash

20.5%

Rank #4

Confidence 23.3%18 evidence pts

LanguageBench Translation Official (Split): translation_to:bleu

Value 92.0% · Conf 100.0% · Weight 5.0%

languagebench_translation_official.translation_to_bleu (Mar 12, 2026)

LanguageBench Grammar/Clarity Official (Split): grammar_clarity_score_pct

Value 100.0% · Conf 100.0% · Weight 3.5%

languagebench_grammar_clarity_official.grammar_clarity_score_pct (Mar 12, 2026)

LanguageBench: overall:mean

Value 100.0% · Conf 100.0% · Weight 2.0%

languagebench.overall_mean (Mar 12, 2026)

LanguageBench: mmlu:accuracy

Value 94.1% · Conf 100.0% · Weight 1.8%

languagebench.mmlu_accuracy (Mar 12, 2026)

Model B

gpt-4.1-20250414

external/openai/gpt-4-1-20250414

19.6%

Rank #5

Confidence 30.6%23 evidence pts

OpenVLM TextVQA Official: textvqa_score_pct

Value 76.8% · Conf 100.0% · Weight 3.1%

openvlm_textvqa_official.textvqa_score_pct (Mar 12, 2026)

OpenVLM OCRBench Official: ocrbench_score_pct

Value 87.7% · Conf 100.0% · Weight 3.1%

openvlm_ocrbench_official.ocrbench_score_pct (Mar 12, 2026)

OpenVLM MTVQA Official: mtvqa_score_pct

Value 92.4% · Conf 100.0% · Weight 2.5%

openvlm_mtvqa_official.mtvqa_score_pct (Mar 12, 2026)

Galileo Agent Leaderboard v2: Avg AC

Value 100.0% · Conf 100.0% · Weight 1.4%

galileo_agent_v2.avg_ac (Mar 12, 2026)

Ranking Diagnostics & Missing Models

Source Lift

Ranked

23

Sources

8

Quality

Insufficient

Vals GPQA

vals_gpqa

11 rows

1.3% avg lift

Vals Mortgage Tax

vals_mortgage_tax

11 rows

0.4% avg lift

Vals Tax Eval v2

vals_tax_eval_v2

11 rows

0.3% avg lift

Vals MedQA

vals_medqa

10 rows

0.4% avg lift

Missing Strong Models

anthropic/claude-sonnet-4.6

external/anthropic/claude-sonnet-4-6

Rank #4

21.1%

Thin evidence after weighting

openai/gpt-5.4-2026-03-05

external/openai/gpt-5-4-2026-03-05

Rank #10

18.9%

Thin evidence after weighting

claude-opus-4-5-20251101

external/anthropic/claude-opus-4-5-20251101

Rank #13

17.0%

Thin evidence after weighting

gpt-5.1-2025-11-13

external/openai/gpt-5-1-2025-11-13

Rank #14

17.0%

Thin evidence after weighting
Taxonomy Details

Core Tasks

task.rewrite_claritytask.translate_general

Required Modes

mode.multilingual

Domains

domain.language_learning

Related Use Cases