education
Grammar and writing coach
Correct grammar and explain fixes at the learner's level.
#1 Recommendation
gemini-2.5-flash
Strong on LanguageBench Translation Official (Split) translation_to:bleu (92%) and LanguageBench Grammar/Clarity Official (Split) grammar_clarity_score_pct (100%)
external/google/gemini-2-5-flash
20.5%
Score
23.3%
Confidence
Limited benchmark evidence for this use case.
23 ranked models with average evidence of 13.7 points. Rankings may shift as more benchmark data is ingested.
Ranked Models
23
Evidence Quality
81%
Scoring
Benchmark-backed
Top Signal
LanguageBench Translation Official (Split): translation_to:bleu
All Ranked Models
| Rank | Model | Score |
|---|---|---|
| #4 | gemini-2.5-flash | 20.5% |
| #5 | gpt-4.1-20250414 | 19.6% |
| #6 | google/gemini-2.0-flash-001 | 18.6% |
| #10 | gpt-4.1-mini-20250414 | 17.4% |
| #19 | Llama-3.1-70B-Instruct | 14.7% |
| #46 | Llama-3.3-70B-Instruct | 12.1% |
| #56 | gemini-2.5-pro | 11.5% |
| #66 | gpt-5-2025-08-07 | 10.7% |
| #67 | google/gemini-3.1-pro-preview | 10.7% |
| #75 | Qwen-VL-Chat | 10.3% |
| #77 | gpt-5-mini-2025-08-07 | 10.2% |
| #96 | gpt-4o | 8.8% |
| #100 | gemini-3-pro-preview | 8.6% |
| #104 | phi-4 | 8.4% |
| #108 | Grok-4-0709 | 8.1% |
| #109 | GPT-4.1-nano-2025-04-14 | 8.1% |
| #111 | deepseek/deepseek-r1 | 8.0% |
| #122 | kimi/kimi-k2.5-thinking | 7.6% |
| #127 | claude-sonnet-4-20250514 | 7.5% |
| #151 | qwen-2.5-72b-instruct | 5.3% |
| #158 | Meta-Llama-3-8B-Instruct | 4.1% |
| #159 | Phi-4-multimodal-instruct | 3.8% |
| #168 | Qwen3-30B-A3B | 1.0% |
Compare Models
Model A leads by +0.9%
Shareable Link →Model A
gemini-2.5-flash
external/google/gemini-2-5-flash
Rank #4
LanguageBench Translation Official (Split): translation_to:bleu
Value 92.0% · Conf 100.0% · Weight 5.0%
languagebench_translation_official.translation_to_bleu (Mar 12, 2026)
LanguageBench Grammar/Clarity Official (Split): grammar_clarity_score_pct
Value 100.0% · Conf 100.0% · Weight 3.5%
languagebench_grammar_clarity_official.grammar_clarity_score_pct (Mar 12, 2026)
LanguageBench: overall:mean
Value 100.0% · Conf 100.0% · Weight 2.0%
languagebench.overall_mean (Mar 12, 2026)
LanguageBench: mmlu:accuracy
Value 94.1% · Conf 100.0% · Weight 1.8%
languagebench.mmlu_accuracy (Mar 12, 2026)
Model B
gpt-4.1-20250414
external/openai/gpt-4-1-20250414
Rank #5
OpenVLM TextVQA Official: textvqa_score_pct
Value 76.8% · Conf 100.0% · Weight 3.1%
openvlm_textvqa_official.textvqa_score_pct (Mar 12, 2026)
OpenVLM OCRBench Official: ocrbench_score_pct
Value 87.7% · Conf 100.0% · Weight 3.1%
openvlm_ocrbench_official.ocrbench_score_pct (Mar 12, 2026)
OpenVLM MTVQA Official: mtvqa_score_pct
Value 92.4% · Conf 100.0% · Weight 2.5%
openvlm_mtvqa_official.mtvqa_score_pct (Mar 12, 2026)
Galileo Agent Leaderboard v2: Avg AC
Value 100.0% · Conf 100.0% · Weight 1.4%
galileo_agent_v2.avg_ac (Mar 12, 2026)
▶Ranking Diagnostics & Missing Models
Source Lift
Ranked
23
Sources
8
Quality
Insufficient
Vals GPQA
vals_gpqa
11 rows
1.3% avg lift
Vals Mortgage Tax
vals_mortgage_tax
11 rows
0.4% avg lift
Vals Tax Eval v2
vals_tax_eval_v2
11 rows
0.3% avg lift
Vals MedQA
vals_medqa
10 rows
0.4% avg lift
Missing Strong Models
anthropic/claude-sonnet-4.6
external/anthropic/claude-sonnet-4-6
Rank #4
21.1%
openai/gpt-5.4-2026-03-05
external/openai/gpt-5-4-2026-03-05
Rank #10
18.9%
claude-opus-4-5-20251101
external/anthropic/claude-opus-4-5-20251101
Rank #13
17.0%
gpt-5.1-2025-11-13
external/openai/gpt-5-1-2025-11-13
Rank #14
17.0%
▶Taxonomy Details
Core Tasks
Required Modes
Domains
Related Use Cases
education
Lesson plan generator
Generate lesson plans with objectives, activities, and assessments.
Top: gpt-4.1-20250414
education
Socratic tutor
Teach concepts by guiding with questions and stepwise hints.
Top: gpt-4.1-20250414
education
Language conversation partner
Conversational practice with gentle corrections and explanations.
Top: gemini-2.5-flash
education
Grading and feedback assistant
Provide rubric-tagged feedback drafts for educator review.
Top: gpt-4.1-20250414