benchmark evidence

Toolathlon

Toolathlon pass@1 on multi-tool agent tasks.

winner on Toolathlon

direct benchmark result, not a broad vertical composite | source row dated 2026-06-30

scored on 2026-06-30

latest mapped results | top 20

#	Model	Score	Evidence	Tested
1	DeepSeek: DeepSeek V4 Pro Deepseek	55.9	model-only independent_benchmark	2026-06-30
2	DeepSeek: DeepSeek V4 Flash Deepseek	48.2	model-only independent_benchmark	2026-04-25
3	Anthropic: Claude Sonnet 4.6 Anthropic	44.8	model-only independent_benchmark	2026-02-23
4	Anthropic: Claude Opus 4.5 Anthropic	43.5	model-only independent_benchmark	2025-11-27
5	Anthropic: Claude Sonnet 4.5 Anthropic	38.9	model-only independent_benchmark	2025-10-28
6	Anthropic: Claude Haiku 4.5 Anthropic	26.2	model-only independent_benchmark	2025-10-28
7	Google: Gemini 2.5 Pro Google	10.5	model-only independent_benchmark	2025-10-28
8	Google: Gemini 2.5 Flash Google	3.7	model-only independent_benchmark	2025-10-28

what this result means

Toolathlon pass@1 on multi-tool agent tasks.

This benchmark contributes direct public evidence. Read its scope before generalizing the result.

A win here is a win on Toolathlon. Broad task pages require independent corroboration before naming a general winner.

source record

category: tool_use

metric: accuracy

matched models: 8

latest source date: 2026-06-30

direction: higher is better