benchmark evidence
GAIA
GAIA general assistant benchmark using the public Hugging Face results dataset.
winner on GAIA
OpenAI: GPT-5.484.7
direct benchmark result, not a broad vertical composite | source row dated 2026-04-27 | agent: SB-Agent-4
scored on 2026-05-15
latest mapped results | top 20
| # | Model | Score | Evidence | Tested |
|---|---|---|---|---|
| 1 | OpenAI: GPT-5.4 | 84.7 | agent-dependent independent_benchmark | SB-Agent-4 | 2026-04-27 |
| 2 | OpenAI: GPT-4.1 | 83.1 | agent-dependent independent_benchmark | Agent_v0.1.4 | 2025-08-11 |
| 3 | OpenAI: GPT-5 | 78.4 | agent-dependent independent_benchmark | GenAgent_v0.0.3 | 2025-12-03 |
| 4 | OpenAI: GPT-5.1 | 75.8 | agent-dependent independent_benchmark | XXP Agent | 2025-11-25 |
| 5 | DeepSeek: DeepSeek V4 Pro | 74.8 | agent-dependent independent_benchmark | Corint v1.1 | 2026-05-15 |
| 6 | Anthropic: Claude Opus 4.5 | 74.1 | agent-dependent independent_benchmark | Clawdbot | 2026-01-29 |
| 7 | Anthropic: Claude Sonnet 4.5 | 71.4 | agent-dependent independent_benchmark | Nexus test 1 | 2026-03-11 |
| 8 | Google: Gemini 2.5 Pro | 66.1 | agent-dependent independent_benchmark | ktc-agent-v2.0.2 | 2025-09-16 |
| 9 | OpenAI: o3 | 62.1 | agent-dependent independent_benchmark | MetaAgentv0.5.11 | 2025-10-19 |
| 10 | Anthropic: Claude Sonnet 4.6 | 53.8 | agent-dependent independent_benchmark | mt_agent_2.0 | 2025-08-22 |
| 11 | Anthropic: Claude Sonnet 4 | 51.2 | agent-dependent independent_benchmark | OpenHands-Versa | 2025-06-09 |
| 12 | OpenAI: o1 | 49.8 | agent-dependent independent_benchmark | open Deep Research | pass@1 | 2025-02-10 |
| 13 | OpenAI: o4 Mini | 42.5 | agent-dependent independent_benchmark | Magentic-UI | 2025-05-24 |
| 14 | DeepSeek: DeepSeek V3.2 | 37.2 | agent-dependent independent_benchmark | meta-agent | 2026-01-09 |
| 15 | Google: Gemini 2.5 Flash | 30.2 | agent-dependent independent_benchmark | zzzzzzzz | 2025-07-19 |
| 16 | Qwen: Qwen3 32B | 21.6 | agent-dependent independent_benchmark | Qwen-3-Memory | 2025-06-19 |
| 17 | Google: Gemini 2.0 Flash | 6.3 | agent-dependent independent_benchmark | gemini-cot | 2025-08-17 |
what this result means
GAIA general assistant benchmark using the public Hugging Face results dataset.
This benchmark contributes direct public evidence. Read its scope before generalizing the result.
A win here is a win on GAIA. Broad task pages require independent corroboration before naming a general winner.
source record
category: agentic
metric: accuracy
matched models: 17
latest source date: 2026-05-15
direction: higher is better