BasedAGIBasedAGI
Menu
Rankings live

education

Lesson plan generator

Generate lesson plans with objectives, activities, and assessments.

#1 Recommendation

gpt-4.1-20250414

Strong on OpenVLM TextVQA Official textvqa_score_pct (77%) and OpenVLM OCRBench Official ocrbench_score_pct (88%)

external/openai/gpt-4-1-20250414

23.3%

Score

36.1%

Confidence

Limited benchmark evidence for this use case.

24 ranked models with average evidence of 13.3 points. Rankings may shift as more benchmark data is ingested.

Ranked Models

24

Evidence Quality

80%

Scoring

Benchmark-backed

Top Signal

OpenVLM TextVQA Official: textvqa_score_pct

All Ranked Models

Max params:
Min confidence:
24 of 24
RankModelScore
#1gpt-4.1-20250414

Strong on OpenVLM TextVQA Official textvqa_score_pct (77%) and OpenVLM OCRBench Official ocrbench_score_pct (88%)

23.3%
#5gpt-4.1-mini-20250414
19.4%
#15gemini-2.5-flash
16.2%
#30google/gemini-2.0-flash-001
14.3%
#50gemini-2.5-pro
12.3%
#53gpt-5-2025-08-07
11.9%
#54google/gemini-3.1-pro-preview
11.9%
#57Llama-3.1-70B-Instruct
11.8%
#63Qwen-VL-Chat
11.4%
#65gpt-5-mini-2025-08-07
11.3%
#83gpt-4o
9.8%
#89gemini-3-pro-preview
9.6%
#97Grok-4-0709
9.1%
#98Llama-3.3-70B-Instruct
9.0%
#99GPT-4.1-nano-2025-04-14
9.0%
#113kimi/kimi-k2.5-thinking
8.5%
#118claude-sonnet-4-20250514
8.3%
#141phi-4
6.7%
#147deepseek/deepseek-r1
6.0%
#148qwen-2.5-72b-instruct
5.9%
#155Meta-Llama-3-8B-Instruct
4.6%
#157openai/gpt-4o-mini-2024-07-18
4.4%
#160Phi-4-multimodal-instruct
3.4%
#169Qwen3-30B-A3B
0.9%

Compare Models

Model A leads by +3.9%

Shareable Link →

Model A

gpt-4.1-20250414

external/openai/gpt-4-1-20250414

23.3%

Rank #1

Confidence 36.1%23 evidence pts

OpenVLM TextVQA Official: textvqa_score_pct

Value 76.8% · Conf 100.0% · Weight 3.2%

openvlm_textvqa_official.textvqa_score_pct (Mar 12, 2026)

OpenVLM OCRBench Official: ocrbench_score_pct

Value 87.7% · Conf 100.0% · Weight 3.2%

openvlm_ocrbench_official.ocrbench_score_pct (Mar 12, 2026)

OpenVLM MTVQA Official: mtvqa_score_pct

Value 92.4% · Conf 100.0% · Weight 2.6%

openvlm_mtvqa_official.mtvqa_score_pct (Mar 12, 2026)

MMLongBench-Doc Leaderboard: acc_score_pct

Value 74.6% · Conf 100.0% · Weight 1.5%

mmlongbench_doc_leaderboard.acc_score_pct (Mar 12, 2026)

Model B

gpt-4.1-mini-20250414

external/openai/gpt-4-1-mini-20250414

19.4%

Rank #5

Confidence 30.3%15 evidence pts

OpenVLM OCRBench Official: ocrbench_score_pct

Value 88.4% · Conf 100.0% · Weight 3.2%

openvlm_ocrbench_official.ocrbench_score_pct (Mar 12, 2026)

OpenVLM TextVQA Official: textvqa_score_pct

Value 70.2% · Conf 100.0% · Weight 3.0%

openvlm_textvqa_official.textvqa_score_pct (Mar 12, 2026)

OpenVLM MTVQA Official: mtvqa_score_pct

Value 100.0% · Conf 100.0% · Weight 2.8%

openvlm_mtvqa_official.mtvqa_score_pct (Mar 12, 2026)

OpenVLM ChartQA Human Official: chartqa_human_score_pct

Value 46.9% · Conf 100.0% · Weight 1.3%

openvlm_chartqa_human_official.chartqa_human_score_pct (Mar 12, 2026)

Ranking Diagnostics & Missing Models

Source Lift

Ranked

24

Sources

8

Quality

Insufficient

Vals GPQA

vals_gpqa

12 rows

1.2% avg lift

Vals Mortgage Tax

vals_mortgage_tax

12 rows

0.4% avg lift

Vals Tax Eval v2

vals_tax_eval_v2

12 rows

0.3% avg lift

Vals MedQA

vals_medqa

11 rows

0.4% avg lift

Missing Strong Models

anthropic/claude-sonnet-4.6

external/anthropic/claude-sonnet-4-6

Rank #4

21.1%

Thin evidence after weighting

openai/gpt-5.4-2026-03-05

external/openai/gpt-5-4-2026-03-05

Rank #10

18.9%

Thin evidence after weighting

claude-opus-4-5-20251101

external/anthropic/claude-opus-4-5-20251101

Rank #13

17.0%

Thin evidence after weighting

gpt-5.1-2025-11-13

external/openai/gpt-5-1-2025-11-13

Rank #14

17.0%

Thin evidence after weighting
Taxonomy Details

Core Tasks

task.outline_generationtask.tutoring_socratic

Required Modes

none

Domains

domain.education_tutoring

Related Use Cases