live
weekly refresh
basedagi.org
benchmark evidence

Toolathlon

Toolathlon pass@1 on multi-tool agent tasks.

winner on Toolathlon
direct benchmark result, not a broad vertical composite | source row dated 2026-04-25
scored on 2026-04-25 · stale source data (35d)
latest mapped results | top 20
#ModelScoreEvidenceTested
1DeepSeek: DeepSeek V4 Pro
Deepseek
52.8
model-only
independent_benchmark
2026-04-25
2DeepSeek: DeepSeek V4 Flash
Deepseek
48.2
model-only
independent_benchmark
2026-04-25
3Anthropic: Claude Sonnet 4.6
Anthropic
44.8
model-only
independent_benchmark
2026-02-23
4Anthropic: Claude Opus 4.5
Anthropic
43.5
model-only
independent_benchmark
2025-11-27
5Anthropic: Claude Sonnet 4.5
Anthropic
38.9
model-only
independent_benchmark
2025-10-28
6Anthropic: Claude Haiku 4.5
Anthropic
26.2
model-only
independent_benchmark
2025-10-28
7Google: Gemini 2.5 Pro
Google
10.5
model-only
independent_benchmark
2025-10-28
8Google: Gemini 2.5 Flash
Google
3.7
model-only
independent_benchmark
2025-10-28
what this result means

Toolathlon pass@1 on multi-tool agent tasks.

This benchmark contributes direct public evidence. Read its scope before generalizing the result.

A win here is a win on Toolathlon. Broad task pages require independent corroboration before naming a general winner.

source record
category: tool_use
metric: accuracy
matched models: 8
latest source date: 2026-04-25
direction: higher is better
inspect upstream source ->