live
weekly refresh
basedagi.org
benchmark evidence

GAIA

GAIA general assistant benchmark using the public Hugging Face results dataset.

winner on GAIA
direct benchmark result, not a broad vertical composite | source row dated 2026-04-27 | agent: SB-Agent-4
scored on 2026-05-15
latest mapped results | top 20
#ModelScoreEvidenceTested
1OpenAI: GPT-5.4
Openai
84.7
agent-dependent
independent_benchmark | SB-Agent-4
2026-04-27
2OpenAI: GPT-4.1
Openai
83.1
agent-dependent
independent_benchmark | Agent_v0.1.4
2025-08-11
3OpenAI: GPT-5
Openai
78.4
agent-dependent
independent_benchmark | GenAgent_v0.0.3
2025-12-03
4OpenAI: GPT-5.1
Openai
75.8
agent-dependent
independent_benchmark | XXP Agent
2025-11-25
5DeepSeek: DeepSeek V4 Pro
Deepseek
74.8
agent-dependent
independent_benchmark | Corint v1.1
2026-05-15
6Anthropic: Claude Opus 4.5
Anthropic
74.1
agent-dependent
independent_benchmark | Clawdbot
2026-01-29
7Anthropic: Claude Sonnet 4.5
Anthropic
71.4
agent-dependent
independent_benchmark | Nexus test 1
2026-03-11
8Google: Gemini 2.5 Pro
Google
66.1
agent-dependent
independent_benchmark | ktc-agent-v2.0.2
2025-09-16
9OpenAI: o3
Openai
62.1
agent-dependent
independent_benchmark | MetaAgentv0.5.11
2025-10-19
10Anthropic: Claude Sonnet 4.6
Anthropic
53.8
agent-dependent
independent_benchmark | mt_agent_2.0
2025-08-22
11Anthropic: Claude Sonnet 4
Anthropic
51.2
agent-dependent
independent_benchmark | OpenHands-Versa
2025-06-09
12OpenAI: o1
Openai
49.8
agent-dependent
independent_benchmark | open Deep Research | pass@1
2025-02-10
13OpenAI: o4 Mini
Openai
42.5
agent-dependent
independent_benchmark | Magentic-UI
2025-05-24
14DeepSeek: DeepSeek V3.2
Deepseek
37.2
agent-dependent
independent_benchmark | meta-agent
2026-01-09
15Google: Gemini 2.5 Flash
Google
30.2
agent-dependent
independent_benchmark | zzzzzzzz
2025-07-19
16Qwen: Qwen3 32B
Qwen
21.6
agent-dependent
independent_benchmark | Qwen-3-Memory
2025-06-19
17Google: Gemini 2.0 Flash
Google
6.3
agent-dependent
independent_benchmark | gemini-cot
2025-08-17
what this result means

GAIA general assistant benchmark using the public Hugging Face results dataset.

This benchmark contributes direct public evidence. Read its scope before generalizing the result.

A win here is a win on GAIA. Broad task pages require independent corroboration before naming a general winner.

source record
category: agentic
metric: accuracy
matched models: 17
latest source date: 2026-05-15
direction: higher is better
inspect upstream source ->