live
weekly refresh
basedagi.org
independent · open benchmarks · weekly refresh

The best LLM
right now.

Anthropic: Claude Opus 4.6 leads overall. No current math winner is published. score date unavailable

Math answer withheld: No current winner is published: qualifying independent evidence is older than 30 days.

Today's primary vertical is agentic coding — the most data-rich and the wedge for this project. More verticals come online as their evidence base matures.

▸ chart
top models · composite score · default weights · orientation only
full index →
2550751001Claude Opus 4.695.72Claude Opus 4.795.43GPT-5.591.14GPT-5.491.05Claude Sonnet 4.686.66GLM 585.17GPT-5.4 Mini84.98Claude Sonnet 4.584.89GPT-5.281.610DeepSeek V3.281.311Gemini 2.0 Flash Lite81.212Claude Opus 4.580.2
composite scores · default weights · higher = stronger aggregatemethodology →

Public benchmark data aggregated and normalized into one score per model. Filter by task, price, or context window. Every number traces back to a source you can verify.

How scores are calculated →

▸ index status
models tracked40
benchmarks83 across 8 verticals
refresh cadenceweekly
▸ coding

Best LLMs for coding.

Five headline benchmarks anchor the coding story here: SWE-Bench Verified, Aider Polyglot, BFCL, Terminal-Bench, and LiveCodeBench. The full coding composite currently draws from 14 benchmark inputs listed on /methodology.

▸ coding benchmarks · 5 headline / 14 total
SWE-Bench Verifiedreal bugs · Python · independent
Aider Polyglotmulti-language code edits
BFCLfunction calling · tool use
Terminal-Benchagentic shell tasks
LiveCodeBenchtime-windowed · low contamination
▸ scoring

One index. Every benchmark sourced.

One index. Every public benchmark worth running, aggregated and scored continuously.

Every score traces back to a public benchmark run you can verify yourself. If a model is declining, the index says so.

Full methodology →

open benchmarks

Every score traces back to a public benchmark run and carries its source status and refresh date.

category composites

Each ranking averages the public benchmarks assigned to that category, with recency and missing-data penalties.

honest margins

Models within 2 points are marked effectively tied. A missing score stays missing rather than becoming a fake neutral score.

verifiable

Every score links to its source benchmark run. Dates and confidence levels are shown next to each result.

freshness

Every ranked surface displays its score date. Source data older than 30 days is labeled stale.

attribution

Configured runs such as thinking or high-effort modes stay distinct unless an identity mapping is reviewed.