BasedAGIBasedAGI
Menu
Rankings live

education

Grading and feedback assistant

Provide rubric-tagged feedback drafts for educator review.

#1 Recommendation

gpt-4.1-20250414

Strong on OpenVLM TextVQA Official textvqa_score_pct (77%) and OpenVLM OCRBench Official ocrbench_score_pct (88%)

external/openai/gpt-4-1-20250414

21.2%

Score

32.8%

Confidence

Limited benchmark evidence for this use case.

24 ranked models with average evidence of 13.6 points. Rankings may shift as more benchmark data is ingested.

Ranked Models

24

Evidence Quality

80%

Scoring

Benchmark-backed

Top Signal

OpenVLM TextVQA Official: textvqa_score_pct

All Ranked Models

Max params:
Min confidence:
24 of 24
RankModelScore
#1gpt-4.1-20250414

Strong on OpenVLM TextVQA Official textvqa_score_pct (77%) and OpenVLM OCRBench Official ocrbench_score_pct (88%)

21.2%
#7gpt-4.1-mini-20250414
17.6%
#10gpt-4o
16.6%
#16gemini-2.5-flash
14.7%
#26google/gemini-2.0-flash-001
13.6%
#38qwen-2.5-72b-instruct
12.1%
#53Llama-3.1-70B-Instruct
11.2%
#54gemini-2.5-pro
11.2%
#57gpt-5-2025-08-07
10.8%
#58google/gemini-3.1-pro-preview
10.8%
#66Qwen-VL-Chat
10.4%
#68gpt-5-mini-2025-08-07
10.3%
#83Llama-3.3-70B-Instruct
9.3%
#94gemini-3-pro-preview
8.7%
#101Grok-4-0709
8.3%
#102GPT-4.1-nano-2025-04-14
8.2%
#115kimi/kimi-k2.5-thinking
7.7%
#120claude-sonnet-4-20250514
7.5%
#135phi-4
6.9%
#138deepseek/deepseek-r1
6.7%
#152openai/gpt-4o-mini-2024-07-18
5.0%
#154Meta-Llama-3-8B-Instruct
4.6%
#158Phi-4-multimodal-instruct
3.1%
#163Qwen3-30B-A3B
1.7%

Compare Models

Model A leads by +3.6%

Shareable Link →

Model A

gpt-4.1-20250414

external/openai/gpt-4-1-20250414

21.2%

Rank #1

Confidence 32.8%23 evidence pts

OpenVLM TextVQA Official: textvqa_score_pct

Value 76.8% · Conf 100.0% · Weight 2.9%

openvlm_textvqa_official.textvqa_score_pct (Mar 12, 2026)

OpenVLM OCRBench Official: ocrbench_score_pct

Value 87.7% · Conf 100.0% · Weight 2.9%

openvlm_ocrbench_official.ocrbench_score_pct (Mar 12, 2026)

OpenVLM MTVQA Official: mtvqa_score_pct

Value 92.4% · Conf 100.0% · Weight 2.3%

openvlm_mtvqa_official.mtvqa_score_pct (Mar 12, 2026)

MMLongBench-Doc Leaderboard: acc_score_pct

Value 74.6% · Conf 100.0% · Weight 1.3%

mmlongbench_doc_leaderboard.acc_score_pct (Mar 12, 2026)

Model B

gpt-4.1-mini-20250414

external/openai/gpt-4-1-mini-20250414

17.6%

Rank #7

Confidence 27.5%15 evidence pts

OpenVLM OCRBench Official: ocrbench_score_pct

Value 88.4% · Conf 100.0% · Weight 2.9%

openvlm_ocrbench_official.ocrbench_score_pct (Mar 12, 2026)

OpenVLM TextVQA Official: textvqa_score_pct

Value 70.2% · Conf 100.0% · Weight 2.6%

openvlm_textvqa_official.textvqa_score_pct (Mar 12, 2026)

OpenVLM MTVQA Official: mtvqa_score_pct

Value 100.0% · Conf 100.0% · Weight 2.5%

openvlm_mtvqa_official.mtvqa_score_pct (Mar 12, 2026)

OpenVLM ChartQA Human Official: chartqa_human_score_pct

Value 46.9% · Conf 100.0% · Weight 1.2%

openvlm_chartqa_human_official.chartqa_human_score_pct (Mar 12, 2026)

Ranking Diagnostics & Missing Models

Source Lift

Ranked

24

Sources

8

Quality

Insufficient

Vals GPQA

vals_gpqa

12 rows

1.1% avg lift

Vals Mortgage Tax

vals_mortgage_tax

12 rows

0.3% avg lift

Vals Tax Eval v2

vals_tax_eval_v2

12 rows

0.3% avg lift

Vals MedQA

vals_medqa

11 rows

0.3% avg lift

Missing Strong Models

anthropic/claude-sonnet-4.6

external/anthropic/claude-sonnet-4-6

Rank #4

21.1%

Thin evidence after weighting

openai/gpt-5.4-2026-03-05

external/openai/gpt-5-4-2026-03-05

Rank #10

18.9%

Thin evidence after weighting

claude-opus-4-5-20251101

external/anthropic/claude-opus-4-5-20251101

Rank #13

17.0%

Thin evidence after weighting

gpt-5.1-2025-11-13

external/openai/gpt-5-1-2025-11-13

Rank #14

17.0%

Thin evidence after weighting
Taxonomy Details

Core Tasks

task.topic_taggingtask.write_report

Required Modes

mode.json_schema

Domains

domain.education_tutoring

Related Use Cases