benchmark evidence
MMLU-Pro
Massive Multitask Language Understanding — Professional. 14 subjects, 5-shot CoT.
saturatingupstream source ->
winner on MMLU-Pro
OpenAI: GPT-5.186.4
direct benchmark result, not a broad vertical composite | source row dated 2025-05-01
scored on 2025-05-01 · stale source data (394d)
latest mapped results | top 20
| # | Model | Score | Evidence | Tested |
|---|---|---|---|---|
| 1 | OpenAI: GPT-5.1 | 86.4 | model-only independent_benchmark | 2025-05-01 |
| 2 | OpenAI: o3 | 85.0 | model-only independent_benchmark | 2025-05-01 |
| 3 | Google: Gemini 2.5 Pro | 84.5 | model-only independent_benchmark | 2025-05-01 |
| 4 | DeepSeek: R1 | 84.0 | model-only independent_benchmark | 2025-05-01 |
| 5 | OpenAI: GPT-4.1 | 81.8 | model-only independent_benchmark | 2025-05-01 |
| 6 | DeepSeek: DeepSeek V3.2 | 81.3 | model-only independent_benchmark | 2025-05-01 |
| 7 | OpenAI: o1 | 80.3 | model-only independent_benchmark | 2025-05-01 |
| 8 | OpenAI: o3 Mini | 79.4 | model-only independent_benchmark | 2025-05-01 |
| 9 | Google: Gemini 2.0 Flash Lite | 76.2 | model-only independent_benchmark | 2025-05-01 |
| 10 | Google: Gemini 2.0 Flash | 76.2 | model-only independent_benchmark | 2025-05-01 |
| 11 | OpenAI: GPT-5 | 72.5 | model-only independent_benchmark | 2025-05-01 |
| 12 | Google: Gemma 3 27B | 67.5 | model-only independent_benchmark | 2025-05-01 |
| 13 | Meta: Llama 3.3 70B Instruct | 65.9 | model-only independent_benchmark | 2025-05-01 |
| 14 | Mistral Large | 65.9 | model-only independent_benchmark | 2025-05-01 |
| 15 | Mistral: Mistral Small 4 | 48.4 | model-only independent_benchmark | 2025-05-01 |
| 16 | Mistral: Mistral Nemo | 44.8 | model-only independent_benchmark | 2025-05-01 |
what this result means
Massive Multitask Language Understanding — Professional. 14 subjects, 5-shot CoT.
Frontier scores are compressing. Use it as one input, not a complete answer.
A win here is a win on MMLU-Pro. Broad task pages require independent corroboration before naming a general winner.
source record
category: reasoning
metric: accuracy
matched models: 16
latest source date: 2025-05-01
direction: higher is better