benchmark evidence
GPQA Diamond
Google-proof PhD-level science QA. 198 expert-crafted multiple-choice questions in biology, chemistry, physics.
activeupstream source ->
winner on GPQA Diamond
direct benchmark result, not a broad vertical composite | source row dated 2000-01-01
scored on 2025-07-01 · stale source data (333d)
latest mapped results | top 20
| # | Model | Score | Evidence | Tested |
|---|---|---|---|---|
| 1 | Anthropic: Claude Opus 4.7 | 94.2 | model-only independent_benchmark | 2000-01-01 |
| 2 | OpenAI: GPT-5.5 | 93.6 | model-only independent_benchmark | 2000-01-01 |
| 3 | OpenAI: GPT-5.4 | 92.8 | model-only independent_benchmark | 2000-01-01 |
| 4 | OpenAI: GPT-5.2 | 92.4 | model-only independent_benchmark | 2000-01-01 |
| 5 | Anthropic: Claude Opus 4.6 | 91.3 | model-only independent_benchmark | 2000-01-01 |
| 6 | DeepSeek: DeepSeek V4 Pro | 90.1 | model-only independent_benchmark | 2000-01-01 |
| 7 | Anthropic: Claude Sonnet 4.6 | 89.9 | model-only independent_benchmark | 2000-01-01 |
| 8 | OpenAI: GPT-5.1 | 88.1 | model-only independent_benchmark | 2000-01-01 |
| 9 | DeepSeek: DeepSeek V4 Flash | 88.1 | model-only independent_benchmark | 2000-01-01 |
| 10 | OpenAI: GPT-5.4 Mini | 88.0 | model-only independent_benchmark | 2000-01-01 |
| 11 | Anthropic: Claude Opus 4.5 | 87.0 | model-only independent_benchmark | 2000-01-01 |
| 12 | MoonshotAI: Kimi K2 Thinking | 84.5 | model-only independent_benchmark | 2000-01-01 |
| 13 | OpenAI: o3 Mini | 77.2 | model-only independent_benchmark | 2025-07-01 |
| 14 | OpenAI: o1 | 73.3 | model-only independent_benchmark | 2025-07-01 |
| 15 | OpenAI: GPT-4.1 | 66.3 | model-only independent_benchmark | 2025-07-01 |
| 16 | OpenAI: GPT-5 | 46.0 | model-only independent_benchmark | 2025-07-01 |
what this result means
Google-proof PhD-level science QA. 198 expert-crafted multiple-choice questions in biology, chemistry, physics.
This benchmark contributes direct public evidence. Read its scope before generalizing the result.
A win here is a win on GPQA Diamond. Broad task pages require independent corroboration before naming a general winner.
source record
category: reasoning
metric: accuracy
matched models: 16
latest source date: 2025-07-01
direction: higher is better