live
weekly refresh
basedagi.org
benchmark evidence

Humanity's Last Exam

Frontier multi-domain benchmark. 2,500 expert-crafted questions across math, science, humanities. Published in Nature 2026.

winner on Humanity's Last Exam
direct benchmark result, not a broad vertical composite | source row dated 2000-01-01
scored on 2025-04-03 · stale source data (422d)
latest mapped results | top 20
#ModelScoreEvidenceTested
1Anthropic: Claude Opus 4.7
Anthropic
54.7
model-only
independent_benchmark
2000-01-01
2Anthropic: Claude Opus 4.6
Anthropic
53.1
model-only
independent_benchmark
2000-01-01
3OpenAI: GPT-5.5
Openai
52.2
model-only
independent_benchmark
2000-01-01
4MoonshotAI: Kimi K2 Thinking
Moonshotai
51.0
model-only
independent_benchmark
2000-01-01
5Anthropic: Claude Sonnet 4.6
Anthropic
49.0
model-only
independent_benchmark
2000-01-01
6DeepSeek: DeepSeek V4 Pro
Deepseek
48.2
model-only
independent_benchmark
2000-01-01
7DeepSeek: DeepSeek V4 Flash
Deepseek
45.1
model-only
independent_benchmark
2000-01-01
8DeepSeek: DeepSeek V3.2
Deepseek
40.8
model-only
independent_benchmark
2000-01-01
9OpenAI: GPT-5.4
Openai
39.8
model-only
independent_benchmark
2000-01-01
10OpenAI: GPT-5.2
Openai
34.5
model-only
independent_benchmark
2000-01-01
11OpenAI: GPT-5.4 Mini
Openai
28.2
model-only
independent_benchmark
2000-01-01
12OpenAI: GPT-5.1
Openai
25.3
model-only
independent_benchmark
2025-04-03
13Google: Gemini 2.5 Pro
Google
21.6
model-only
independent_benchmark
2025-04-03
14Anthropic: Claude Sonnet 4.5
Anthropic
13.7
model-only
independent_benchmark
2025-04-03
15Google: Gemini 2.5 Flash
Google
12.1
model-only
independent_benchmark
2025-04-03
16DeepSeek: R1
Deepseek
8.5
model-only
independent_benchmark
2025-04-03
17OpenAI: GPT-5
Openai
2.7
model-only
independent_benchmark
2025-04-03
what this result means

Frontier multi-domain benchmark. 2,500 expert-crafted questions across math, science, humanities. Published in Nature 2026.

This benchmark contributes direct public evidence. Read its scope before generalizing the result.

A win here is a win on Humanity's Last Exam. Broad task pages require independent corroboration before naming a general winner.

source record
category: reasoning
metric: accuracy
matched models: 17
latest source date: 2025-04-03
direction: higher is better
inspect upstream source ->