The best LLM
right now.
Anthropic: Claude Opus 4.6 leads overall. No current math winner is published. score date unavailable
Math answer withheld: No current winner is published: qualifying independent evidence is older than 30 days.
Today's primary vertical is agentic coding — the most data-rich and the wedge for this project. More verticals come online as their evidence base matures.
Public benchmark data aggregated and normalized into one score per model. Filter by task, price, or context window. Every number traces back to a source you can verify.
Best LLMs for coding.
Five headline benchmarks anchor the coding story here: SWE-Bench Verified, Aider Polyglot, BFCL, Terminal-Bench, and LiveCodeBench. The full coding composite currently draws from 14 benchmark inputs listed on /methodology.
Emotional intelligence
EQ evidence, with a broad winner withheld until weighting rationale is finalized.
Factual accuracy
Factuality evidence, with a broad winner withheld until weighting rationale is finalized.
Agent workflows
GAIA, tau2-bench, and terminal-task evidence, with harness dependence stated plainly.
Multilingual
Two independent cross-language suites. Winner withheld until coverage is comparable.
One index. Every benchmark sourced.
One index. Every public benchmark worth running, aggregated and scored continuously.
Every score traces back to a public benchmark run you can verify yourself. If a model is declining, the index says so.
Every score traces back to a public benchmark run and carries its source status and refresh date.
Each ranking averages the public benchmarks assigned to that category, with recency and missing-data penalties.
Models within 2 points are marked effectively tied. A missing score stays missing rather than becoming a fake neutral score.
Every score links to its source benchmark run. Dates and confidence levels are shown next to each result.
Every ranked surface displays its score date. Source data older than 30 days is labeled stale.
Configured runs such as thinking or high-effort modes stay distinct unless an identity mapping is reviewed.