live
weekly refresh
basedagi.org
▸ vertical

What is the best LLM
for agents?

Ranked from public autonomous-task results. Agent benchmarks test a model inside a harness, so this ranking names the underlying model in the strongest attributable published runs. A named winner requires two current independent source publishers across at least 20 callable models; the scaffold is never treated as irrelevant.

▸ evidence used
GAIAgeneral assistant tasks · public test set · agent-dependent
Terminal-Bench 2.0shell task completion · model plus harness
tau2-benchcustomer-support domains · verified standard submissions
SWE Atlas QnAcodebase investigation tasks
HiL-Benchhuman-in-the-loop escalation tasks
Remote Labor Indexeconomically scoped remote work tasks
▸ read this before trusting the rank

Agent performance is not model performance with a different label. A harness, tools, prompts, retry policy, and budget can move the score.

Rows from GAIA and Terminal-Bench are stored as agent-dependent evidence. Rows that cannot be attributed to one underlying callable model are skipped.

Use this page to choose an LLM for agent work. Use the underlying source leaderboard to choose the harness.