benchmark evidence

MMMU

Massive Multidiscipline Multimodal Understanding — 11K expert questions across 57 subjects.

winner on MMMU

direct benchmark result, not a broad vertical composite | source row dated 2026-05-15

scored on 2026-05-15 · stale source data (60d)

latest mapped results | top 20

#	Model	Score	Evidence	Tested
1	OpenAI: GPT-5.1 Openai	85.4	model-only independent_benchmark	2026-05-15
2	OpenAI: o3 Openai	82.9	model-only independent_benchmark	2000-01-01
3	OpenAI: o4 Mini Openai	81.6	model-only independent_benchmark	2000-01-01
4	Google: Gemini 2.5 Flash Google	79.7	model-only independent_benchmark	2000-01-01
5	Google: Gemini 2.5 Pro Google	79.6	model-only independent_benchmark	2000-01-01
6	Anthropic: Claude Sonnet 4.5 Anthropic	77.8	model-only independent_benchmark	2026-05-15
7	OpenAI: o1 Openai	77.6	model-only independent_benchmark	2000-01-01
8	Anthropic: Claude Opus 4.7 Anthropic	76.5	model-only independent_benchmark	2026-05-08
9	Anthropic: Claude Opus 4.5 Anthropic	76.5	model-only independent_benchmark	2026-05-15
10	OpenAI: GPT-4.1 Openai	74.8	model-only independent_benchmark	2000-01-01
11	Anthropic: Claude Sonnet 4 Anthropic	74.4	model-only independent_benchmark	2026-05-15
12	Anthropic: Claude Sonnet 4.6 Anthropic	74.4	model-only independent_benchmark	2026-05-08
13	Meta: Llama 4 Maverick Meta Llama	73.4	model-only independent_benchmark	2026-05-15
14	Google: Gemini 2.0 Flash Lite Google	71.7	model-only independent_benchmark	2026-05-15
15	Google: Gemini 2.0 Flash Google	70.7	model-only independent_benchmark	2000-01-01
16	Meta: Llama 4 Scout Meta Llama	69.4	model-only independent_benchmark	2026-05-15
17	Google: Gemma 3 27B Google	64.9	model-only independent_benchmark	2026-05-15
18	Amazon: Nova Pro 1.0 Amazon	62.0	model-only independent_benchmark	2026-05-15
19	OpenAI: GPT-5 Openai	56.8	model-only independent_benchmark	2026-05-15

what this result means

Massive Multidiscipline Multimodal Understanding — 11K expert questions across 57 subjects.

This benchmark contributes direct public evidence. Read its scope before generalizing the result.

A win here is a win on MMMU. Broad task pages require independent corroboration before naming a general winner.

source record

category: reasoning

metric: accuracy

matched models: 19

latest source date: 2026-05-15

direction: higher is better