Do agent benchmarks measure the model alone?

No. GAIA and Terminal-Bench publish results for an agent or scaffold using an underlying model. BasedAGI maps only attributable model rows and labels this evidence as agent-dependent.

▸ vertical

What is the best LLM
for agents?

Ranked from public autonomous-task results. Agent benchmarks test a model inside a harness, so this ranking names the underlying model in the strongest attributable published runs. A named winner requires two current independent source publishers across at least 20 callable models; the scaffold is never treated as irrelevant.

agentic leaderboard →how scores work →

▸ no current agents winner published

No current winner is published: qualifying independent evidence is older than 30 days. The table below shows available corroborated evidence, not a publishable current winner.

▸ corroborated evidence · agentic score

#	Model	Agents	Tools	Computer	Price/M
1	OpenAI: GPT-5.5 Openai	76.3	—	—	$5.00/M
2	Anthropic: Claude Opus 4.5 Anthropic	66.3	43.0	—	$5.00/M
3	OpenAI: GPT-5.1 Openai	60.9	—	—	$1.25/M
4	OpenAI: GPT-5.4 Openai	60.8	—	—	$2.50/M
5	Anthropic: Claude Opus 4.7 Anthropic	59.2	78.6	—	$5.00/M
6	OpenAI: GPT-5 Openai	54.5	—	0.3	$1.25/M
7	Anthropic: Claude Sonnet 4.5 Anthropic	50.3	38.4	62.4	$3.00/M
8	Anthropic: Claude Opus 4.6 Anthropic	49.7	76.3	—	$5.00/M
9	Google: Gemini 2.5 Pro Google	48.8	10.0	—	$1.25/M
10	Anthropic: Claude Sonnet 4.6 Anthropic	47.9	58.9	71.6	$3.00/M
11	DeepSeek: DeepSeek V3.2 Deepseek	34.7	—	—	$0.21/M
12	Google: Gemini 2.5 Flash Google	21.3	3.2	—	$0.30/M

▸ evidence used

GAIAgeneral assistant tasks · public test set · agent-dependent

Terminal-Bench 2.0shell task completion · model plus harness

tau2-benchcustomer-support domains · verified standard submissions

SWE Atlas QnAcodebase investigation tasks

HiL-Benchhuman-in-the-loop escalation tasks

Remote Labor Indexeconomically scoped remote work tasks

▸ read this before trusting the rank

Agent performance is not model performance with a different label. A harness, tools, prompts, retry policy, and budget can move the score.

Rows from GAIA and Terminal-Bench are stored as agent-dependent evidence. Rows that cannot be attributed to one underlying callable model are skipped.

Use this page to choose an LLM for agent work. Use the underlying source leaderboard to choose the harness.

What is the best LLMfor agents?

What is the best LLM
for agents?