independent · open benchmarks · weekly refresh

The best LLM
right now.

Anthropic: Claude Opus 4.6 leads overall. No current math winner is published. score date unavailable

Math answer withheld: No current winner is published: qualifying independent evidence is older than 30 days.

Today's primary vertical is agentic coding — the most data-rich and the wedge for this project. More verticals come online as their evidence base matures.

▸ chart

top models · composite score · default weights · orientation only

full index →

composite scores · default weights · higher = stronger aggregatemethodology →

Public benchmark data aggregated and normalized into one score per model. Filter by task, price, or context window. Every number traces back to a source you can verify.

How scores are calculated →

▸ index status

models tracked40

benchmarks89 across 8 verticals

refresh cadenceweekly

this weekDeepSeek: DeepSeek V4 Flash +0.5 (largest move) · no significant moves

▸ coding

Best LLMs for coding.

Five headline benchmarks anchor the coding story here: SWE-Bench Verified, Aider Polyglot, BFCL, Terminal-Bench, and LiveCodeBench. The full coding composite currently draws from 14 benchmark inputs listed on /methodology.

coding leaderboard →

▸ coding benchmarks · 5 headline / 14 total

SWE-Bench Verifiedreal bugs · Python · independent

Aider Polyglotmulti-language code edits

BFCLfunction calling · tool use

Terminal-Benchagentic shell tasks

LiveCodeBenchtime-windowed · low contamination

▸ ranked answers

vertical

Emotional intelligence

EQ evidence, with a broad winner withheld until weighting rationale is finalized.

see evidence →

vertical

Factual accuracy

Hallucination evidence, without naming a broad winner from one independent source.

see evidence →

vertical

Agent workflows

GAIA, tau2-bench, and terminal-task evidence, with harness dependence stated plainly.

see evidence →

vertical

Multilingual

Two independent cross-language suites. Winner withheld until coverage is comparable.

see evidence →

▸ scoring

One index. Every benchmark sourced.

One index. Every public benchmark worth running, aggregated and scored continuously.

Every score traces back to a public benchmark run you can verify yourself. If a model is declining, the index says so.

Full methodology →

open benchmarks

Every score traces back to a public benchmark run and carries its source status and refresh date.

category composites

Each ranking averages the public benchmarks assigned to that category, with recency and missing-data penalties.

honest margins

Models within 2 points are marked effectively tied. A missing score stays missing rather than becoming a fake neutral score.

verifiable

Every score links to its source benchmark run. Dates and confidence levels are shown next to each result.

freshness

Every ranked surface displays its score date. Source data older than 30 days is labeled stale.

attribution

Configured runs such as thinking or high-effort modes stay distinct unless an identity mapping is reviewed.

The best LLMright now.

Best LLMs for coding.

Emotional intelligence

Factual accuracy

Agent workflows

Multilingual

One index. Every benchmark sourced.

The best LLM
right now.