▸ methodology

How scores are calculated.

Every score traces back to a public benchmark run you can verify. Sources, dates, and confidence levels are shown for every result.

▸ data sources

SWE-Bench Verifiedreal GitHub issue resolution · 500 tasks

Aider Polyglotmulti-language code editing

BFCL v4function calling · tool use accuracy

Terminal-Benchagentic shell tasks

GPQA Diamondgraduate-level science · expert-verified

MMLU-Promulti-domain academic · 10-way

HLEhumanity's last exam · frontier-level

LiveBenchtime-windowed · contamination-resistant

AlpacaEval 2.0instruction following · GPT-4 judged

Artificial Analysisindependent intelligence index

Open LLM Leaderboard v2IFEval · BBH · MATH · GPQA

OpenRouterreal-time pricing data

refresh cadence varies by source · all sources public

▸ approach

BasedAGI aggregates public benchmark results, labels vendor-published evidence explicitly, normalizes scores to a 0-100 scale, and computes task-specific composite rankings using documented benchmark mappings.

One named winner is published for each question. Overall and task-specific winners can differ because they answer different questions. Each vertical maps to the benchmarks listed below and does not change silently.

Models without enough benchmark coverage show — (insufficient data) rather than an interpolated or neutral score. A missing score is not the same as a bad score.

Broad task and vertical winners are published only when at least two independent source publishers have tested the model within the last 30 days and at least 20 callable models meet that standard. Multiple panels from one publisher and older results remain visible as evidence, but cannot establish a current general winner.

▸ normalization

NormalizedScore =

(raw − min) / (max − min) × 100

Raw scores are converted to a 0–100 scale per benchmark. Percentile scoring is used where available to account for benchmark difficulty differences.

▸ category inputs

CodingSWE-Bench Verified · BigCodeBench · EvalPlus · Aider · LiveBench Coding · BenchLM · Terminal-Bench

ReasoningGPQA Diamond · MMLU-Pro · HLE · SimpleQA · BenchLM · LegalBench · MMMU

WritingLMSYS Elo · LiveBench Language · AlpacaEval LC · BenchLM Instruction Following

JSONBFCL Overall · Non-Live · Live · Multi-Turn

weight = recency × source confidence × data status × result confidence · composite = weighted average − missing-data penalty

▸ benchmark inventory

The homepage stays compact. This page carries the full benchmark inventory behind the index, with scope made explicit.

Per-language multilingual rows count separately here because they are separate public benchmark entries in the database, even when a broad leaderboard rolls them up into one composite.

The homepage talks about 8 public-facing verticals. This table is broader: it shows all 13 benchmark categories currently tracked underneath the index, including supporting categories like tool use, computer use, uncensored, and overall.

▸ vertical breakdown · 85

Vertical	Benchmarks	Count
Coding	Aider Coding Benchmark · BenchLM Coding · BigCodeBench Complete · BigCodeBench Instruct · EvalPlus HumanEval+ · EvalPlus MBPP+ · LiveBench Coding · LiveCodeBench · Scale Coding Evaluation · SWE Atlas - Refactoring · SWE Atlas - Test Writing · SWE-bench Verified · Terminal-Bench 2.0 · WebDev Arena	14
Reasoning	BBH (Open LLM Leaderboard) · BenchLM Knowledge · BenchLM Reasoning · EnigmaEval · GPQA Diamond · Humanity's Last Exam · LegalBench · MMLU-Pro · MMMU · SimpleQA Factual Accuracy	10
Math	AIME 2025 · BenchLM Math · MATH Level 5 (Open LLM Leaderboard) · MATH-500 · Scale Math Evaluation	5
Writing	AlpacaEval 2.0 LC Win Rate · BenchLM Instruction Following · LiveBench Language · TutorBench	4
Function Calling	BFCL Live · BFCL Multi-Turn · BFCL Non-Live · BFCL Overall	4
EQ	EQ-Bench Creative Writing v3 · EQ-Bench Judgemark · EQ-Bench v3	3
Factuality	TruthfulQA · Vectara HHEM	2
Agentic	GAIA · HiL-Bench Pass@3 · Remote Labor Index · SWE Atlas - Codebase QnA · tau2-bench Airline · tau2-bench Banking Knowledge · tau2-bench Retail · tau2-bench Telecom	8
Tool Use	MCP Atlas · Toolathlon	2
Computer Use	OSWorld-Verified · ScreenSpot-Pro	2
Multilingual	AI Language Proficiency Monitor Arabic · AI Language Proficiency Monitor Average · AI Language Proficiency Monitor Bengali · AI Language Proficiency Monitor French · AI Language Proficiency Monitor German · AI Language Proficiency Monitor Hindi · AI Language Proficiency Monitor Japanese · AI Language Proficiency Monitor Mandarin Chinese · AI Language Proficiency Monitor Portuguese · AI Language Proficiency Monitor Russian · AI Language Proficiency Monitor Spanish · Global-MMLU-Lite Arabic · Global-MMLU-Lite Average · Global-MMLU-Lite Bengali · Global-MMLU-Lite French · Global-MMLU-Lite German · Global-MMLU-Lite Hindi · Global-MMLU-Lite Japanese · Global-MMLU-Lite Mandarin Chinese · Global-MMLU-Lite Portuguese · Global-MMLU-Lite Spanish · MGSM (Aggregate)	22
Uncensored	UGI — ugi-natural-intelligence · UGI — ugi-overall · UGI — ugi-willingness · UGI — ugi-writing	4
Overall	Arena-Hard · Artificial Analysis Intelligence Index · Chatbot Arena (LMSYS) · IFEval (Open LLM Leaderboard) · LiveBench Instruction Following	5
Total	Full benchmark inventory currently tracked on BasedAGI.	85

▸ how each vertical is weighted

These are the benchmark panels that currently drive each public vertical, plus the argument for why each benchmark deserves its share of the composite.

If a vertical is marked withheld, the evidence stays public but BasedAGI does not publish a named winner from that composite until the weighting rationale is stable enough to defend.

▸ Agentic · published

last reviewed 2026-05-27

GAIA leads because it is the broadest public agent benchmark with recognizable frontier difficulty. Terminal-Bench and tau2-bench stay close because they measure operational, tool-using task completion rather than pure QA. SWE Atlas, HiL-Bench, and the Remote Labor Index add narrower agent behaviors, but they remain supporting evidence instead of the backbone of the claim.

Benchmark	Weight	Rationale
gaia	0.22	Broad autonomous assistant tasks make GAIA the best single public signal for general agent capability.
terminal-bench	0.16	Shell execution under harness constraints captures real operational competence instead of static QA.
tau2-bench-retail	0.12	Customer-support task completion stresses long-horizon policy following in a practical domain.
tau2-bench-airline	0.10	Airline workflows add a different constrained-service domain with distinct policy and planning structure.
tau2-bench-telecom	0.08	Telecom support introduces procedural tool use that is less redundant than a generic average would suggest.
tau2-bench-banking-knowledge	0.07	Banking knowledge tasks emphasize policy-sensitive agent reasoning rather than broad task generality.
swe-atlas-qna	0.10	Codebase investigation is a distinct agent workflow that general assistant suites underweight.
hil-bench-pass-at-3	0.08	Human-in-the-loop escalation behavior matters because good agents must know when to defer.
remote-labor-index	0.07	Economic task framing is valuable, but the suite is still too narrow to outweigh broader agent panels.

▸ Coding · published

last reviewed 2026-05-27

SWE-bench Verified stays on top because it is still the cleanest public proxy for real bug-fix work under repo constraints. LiveCodeBench and Aider stay near it because they measure contamination-resistant code generation and edit quality in active workflows. BigCodeBench, EvalPlus, and the agentic coding suites broaden coverage, but they are downweighted when they duplicate the same narrow pass/fail coding loop.

Benchmark	Weight	Rationale
swe-bench-verified	0.18	Real repository bug fixing under realistic constraints; still the strongest public proxy for production coding.
livecodebench-pass-at-1	0.14	Time-windowed coding evaluation reduces contamination and captures current code-generation ability.
aider-coding	0.12	Measures iterative edit quality in an agent-style coding loop rather than one-shot generation.
terminal-bench	0.10	Tests whether a model can execute code-adjacent shell tasks, which pure code benchmarks miss.
bigcodebench-complete	0.09	Full problem completion captures longer synthesis than patch-style editing benchmarks.
bigcodebench-instruct	0.06	Instruction-following variant adds prompt-compliance signal that the complete set does not isolate.
evalplus-humaneval	0.06	Still useful as a clean floor for function-level correctness despite saturation at the frontier.
evalplus-mbpp	0.05	Broader short-program coverage than HumanEval, but lower ceiling signal for frontier models.
livebench-coding	0.05	Fresh coding questions provide recency signal distinct from curated software-engineering suites.
benchlm-coding	0.04	Aggregated coding evidence broadens model coverage, but overlaps with stronger primary coding sources.
swe-atlas-test-writing	0.03	Isolates test-authoring behavior that bug-fix benchmarks do not explicitly measure.
swe-atlas-refactoring	0.03	Captures structural code transformation skill rather than raw patch success.
webdev-arena	0.02	Front-end implementation signal is useful, but the task family is narrower than general coding.
scale-coding-eval	0.03	Helpful as an extra modern coding check, but too close to adjacent coding suites to carry more weight.

▸ Computer Use · published

last reviewed 2026-05-27

OSWorld-Verified leads because end-to-end computer-use task completion matters more than pointwise grounding alone. ScreenSpot-Pro remains important because visual grounding is a prerequisite skill, but it is still a sub-capability rather than the whole desktop task.

Benchmark	Weight	Rationale
osworld-verified	0.60	End-to-end GUI task execution is the strongest current public proxy for actual computer-use competence.
screenspot-pro	0.40	Screen grounding matters because models that cannot localize UI targets will fail before full task execution begins.

▸ EQ · withheld

last reviewed 2026-05-27

EQ-Bench v3, Creative Writing v3, and Judgemark all come from the same publisher and partially share evaluation machinery. I can describe what each sub-panel measures, but I cannot yet defend a public broad-EQ weighting as an independent ranking claim against a hostile reviewer. This vertical stays visible as source evidence until a second independent public EQ-style suite exists or the panel is narrowed to a claim I can defend.

withholding reason: weighting rationale not yet strong enough for a public winner claim

Benchmark	Weight	Rationale
eq-bench-v3	0.50	Core emotional-intelligence rubric across empathy, validation, and social reasoning; the main signal in the suite.
eq-bench-creative-writing-v3	0.30	Measures prose warmth, style, and originality, which are adjacent to EQ but not reducible to empathy alone.
eq-bench-judgemark	0.20	Evaluator-calibration signal matters because EQ scoring is judge-sensitive, but it is still not an independent task source.

▸ Factuality · published

last reviewed 2026-05-31

Three distinct failure modes, one per benchmark. Vectara HHEM targets document-grounded faithfulness — the failure mode most common in RAG pipelines where a model fabricates content that contradicts its source. SimpleQA tests closed-domain factual recall using short questions with unambiguous correct answers, exposed on OpenAI's own models but independently verifiable. TruthfulQA (new) isolates common misconceptions and conspiracy- adjacent claims that models absorb from pretraining data — a different failure mode from both faithfulness and factual recall. Together the panel covers the three most important hallucination axes; two of the three sources are fully independent.

Benchmark	Weight	Rationale
vectara-hhem	0.55	Strongest independent signal — document-grounded faithfulness is the most common real-world hallucination failure mode.
simpleqa	0.25	Closed-domain factual recall adds a distinct second failure mode; vendor-published but independently verifiable.
truthfulqa	0.20	Targets pretraining-absorbed misconceptions and conspiracy-adjacent falsehoods — failure mode invisible to faithfulness and factual-recall suites.

▸ Function Calling · published

last reviewed 2026-05-27

BFCL Overall is weighted highest because it is the best broad proxy for production function-calling reliability across the suite. Multi-Turn and Live stay next because real systems break on stateful tool use and live tool execution, not just one-shot JSON. Non-Live remains useful, but it is the easiest part of the panel and overlaps heavily with the aggregate score.

Benchmark	Weight	Rationale
bfcl-overall	0.40	Broad aggregate function-calling accuracy is still the best single headline measure for production reliability.
bfcl-multi-turn	0.25	Stateful multi-turn tool use captures failure modes that one-shot structured output never exposes.
bfcl-live	0.20	Live tool interaction tests execution under realistic runtime constraints instead of static call formatting alone.
bfcl-non-live	0.15	Static invocation accuracy is still useful as a floor, but it is the easiest BFCL slice and overlaps with Overall.

▸ Instruction Following · published

last reviewed 2026-05-31

Instruction following is a production-critical capability orthogonal to raw reasoning or writing quality — a model can write brilliantly while ignoring format constraints, word limits, or explicit negative rules. IFEval (Instruction Following Eval) is the cleanest public benchmark for this: 541 verifiable instructions (JSON output, word limits, forbidden keywords, language constraints) scored with no LLM judge — pure programmatic pass/fail. It is independently published by Google Research and run by both the Open LLM Leaderboard and LLM-Stats, giving two independent source publishers for confidence scoring. Panel deliberately stays narrow until a second benchmark with comparable rigor emerges.

Benchmark	Weight	Rationale
open-llm-ifeval	1.00	Only benchmark in the panel with fully programmatic, LLM-judge-free scoring of explicit instruction constraints.

▸ Long Context · published

last reviewed 2026-05-31

Long-context capability is a distinct production requirement orthogonal to per-token quality: a model can score well on standard benchmarks while completely failing to synthesize information across hundreds of thousands of tokens. LongBench-v2 (THUDM/Tsinghua, 2024) is the strongest open benchmark for this: 503 bilingual, human-verified long-context questions across single-doc QA, multi-doc QA, long in-context learning, and code repo comprehension, with gold labels and no LLM-judge dependency — unlike NIAH and RULER which rely on synthetic retrieval tasks. Panel deliberately narrow while the field standardises: long-context evaluation methodology is still converging and most alternatives either lack verified data or cover too few models to be actionable.

Benchmark	Weight	Rationale
longbench-v2	1.00	Only benchmark in the panel with verified, multi-task long-context evaluation and no LLM-judge dependency.

▸ Math · published

last reviewed 2026-05-27

AIME 2025 leads because it is the best public proxy here for hard, multi-step symbolic reasoning under real contest pressure. MATH-500 stays close because it covers broader math domains and solution styles. Open LLM Math and BenchLM Math broaden model coverage, while MMLU-Pro and Scale Math Evaluation stay lower as supporting evidence because they partially overlap with the harder dedicated math panels.

Benchmark	Weight	Rationale
aime-2025	0.28	Hard contest math forces multi-step chain construction and remains a top frontier discriminator.
math-500	0.22	Broader subject coverage adds general mathematical robustness beyond olympiad-style question patterns.
open-llm-math	0.16	Open leaderboard math signal improves breadth and comparability across more callable models.
benchlm-math	0.14	Aggregated math evidence is useful for model coverage, but secondary to direct dedicated math benchmarks.
scale-math-eval	0.10	Modern applied math evaluation contributes current signal not fully captured by legacy benchmark sets.
mmlu-pro	0.10	Quantitative academic reasoning adds a cross-domain math-adjacent check without dominating the panel.

▸ Multilingual · published

last reviewed 2026-05-30

The broad multilingual claim is intentionally narrow: one aggregate row from each independent public suite, not every per-language slice multiplied into fake certainty. Global-MMLU-Lite gets the highest weight because it is the cleanest broad cross-language knowledge readout. LanguageBench adds diversity by mixing ARC, MGSM, and MMLU across languages. MGSM-avg is a dedicated multilingual math benchmark (10 languages) from OpenAI simple-evals — a useful third signal that catches language-math reasoning gaps invisible to knowledge-only suites.

Benchmark	Weight	Rationale
global-mmlu-lite-avg	0.50	Cross-language knowledge consistency is the cleanest single broad multilingual capability readout in the current panel.
languagebench-avg	0.35	Mixed-task language evaluation adds reasoning and instruction-following diversity that pure knowledge averages miss.
mgsm-avg	0.15	Multilingual math word problems across 10 languages catch language-math gaps that pure knowledge suites miss entirely.

▸ Multimodal · published

last reviewed 2026-05-30

Multimodal capability is a first-class differentiator between frontier models — most major models in 2026 are vision-capable and users routinely submit images, charts, and documents. MMMU and MMMU-Pro share top weight: MMMU provides the broadest coverage (57 academic subjects, strong model count) while MMMU-Pro uses vision-only inputs that prevent text-shortcut cheating, making it the harder and cleaner signal. MathVista isolates math reasoning with visual context (diagrams, charts, geometry) — a distinct capability from text-only math. MMStar is manually filtered to require genuine visual perception rather than language priors, guarding against contamination that plagues simpler VQA benchmarks.

Benchmark	Weight	Rationale
mmmu	0.30	Broadest multimodal coverage — 11K expert questions across 57 disciplines. Largest model count in the panel.
mmmu-pro	0.28	Vision-only inputs eliminate text-shortcut cheating and discriminate frontier models more reliably than standard MMMU.
mathvista	0.24	Math reasoning conditioned on visual inputs is a distinct capability — geometry, chart reading, visual equations.
mmstar	0.18	Manually filtered to ensure visual necessity; cleanest contamination-resistant multimodal signal in the panel.

▸ Overall · published

last reviewed 2026-05-30

The overall composite weights general instruction following and conversation quality highest, since these best predict whether a model is broadly useful. Reasoning and coding add depth. Agentic and tool use round out the picture for users who care about autonomous task performance.

Benchmark	Weight	Rationale
lmsys-elo	0.25	Largest human preference signal; crowd-sourced from 6M+ pairwise votes.
artificial-analysis-intelligence	0.20	Independent 10-evaluation composite; most rigorous aggregator outside of crowdsourced preference.
livebench-instruction-following	0.15	Contamination-resistant instruction-following tasks with fresh monthly questions.
open-llm-ifeval	0.15	Format-adherence and constraint-following benchmark with precise pass/fail grading.
alpaca-eval-lc	0.13	Length-controlled win-rate eliminates verbosity bias from AlpacaEval 2.0.
benchlm-instruction-following	0.07	Aggregator instruction-following score adds model coverage for less-benchmarked models.
arena-hard	0.05	Hard instruction-following arena tasks provide differentiation at the frontier.

▸ Reasoning · published

last reviewed 2026-05-27

GPQA Diamond is weighted highest because it is still the strongest public reasoning benchmark for novel, contamination-resistant expert questions. HLE and MMLU-Pro stay close because they stress frontier breadth and multi-domain rigor at larger scale. The rest are supporting panels that cover reasoning sub-modes such as legal analysis, factual recall, multimodal reasoning, and adversarial puzzle solving.

Benchmark	Weight	Rationale
gpqa-diamond	0.22	Contamination-resistant expert science questions remain the cleanest single proxy for frontier reasoning.
hle	0.15	Frontier-difficulty breadth matters because it stresses compositional reasoning outside one academic domain.
mmlu-pro	0.12	Harder multi-domain academic reasoning still adds scale and category diversity beyond GPQA.
mmmu	0.09	Multimodal reasoning catches failure modes text-only reasoning suites never see.
legalbench	0.08	LegalBench isolates rule application and textual nuance in a way general academic suites do not.
benchlm-reasoning	0.08	Broad aggregator signal is useful for coverage, but not strong enough to outrank primary independent suites.
benchlm-knowledge	0.06	Knowledge-heavy reasoning complements pure puzzle solving by stressing retrieval plus inference.
open-llm-bbh	0.06	BBH still captures compositional oddball reasoning tasks that benchmark-specific optimization can miss.
simpleqa	0.05	Short factual reasoning under uncertainty is a distinct failure mode from long-form academic problem solving.
enigma-eval	0.04	Adversarial puzzle-style evaluation adds a harder edge-case reasoning check.
scale-math-eval	0.03	Math-heavy reasoning belongs in the panel, but direct math overlap keeps it below dedicated reasoning suites.
tutorbench	0.01	Explanatory tutoring signal is useful, but it overlaps with writing and instruction following more than core reasoning.
ugi-natural-intelligence	0.01	Useful as a stress signal for broad intelligence claims, but too methodologically loose to weight heavily.

▸ Tool Use · published

last reviewed 2026-05-27

MCP Atlas gets the lead because it is the more direct benchmark for structured, multi-step tool orchestration in agent stacks. Toolathlon stays substantial because it adds breadth across tool scenarios, but it does not yet have the same practical signal weight as the stronger MCP-style workflow benchmark.

Benchmark	Weight	Rationale
mcp-atlas	0.60	Directly measures orchestrated tool workflows in a way closest to production agent systems.
toolathlon	0.40	Broad tool-use coverage adds complementary scenario diversity beyond MCP-shaped orchestration tasks.

▸ Uncensored · withheld

last reviewed 2026-05-27

The uncensored panel is currently all one suite and mixes several normative ideas: willingness, broad intelligence, and free-form writing latitude. I can preserve the source evidence, but I cannot defend one public composite winner from this panel without importing value judgments that are not benchmark-stable. This vertical stays withheld.

withholding reason: weighting rationale not yet strong enough for a public winner claim

Benchmark	Weight	Rationale
ugi-overall	0.40	Broad headline willingness signal is the obvious anchor if this category is ever published.
ugi-willingness	0.25	Direct willingness should matter separately from general capability, but it is still one normative dimension.
ugi-natural-intelligence	0.20	General intelligence overlap helps contextualize willingness, though it is not uniquely about uncensored behavior.
ugi-writing	0.15	Writing latitude contributes style freedom, but it is too indirect to dominate an uncensored claim.

▸ Writing · published

last reviewed 2026-05-27

LMSYS Elo leads because large-scale pairwise preference still best captures whether humans actually prefer the prose. LiveBench Language and AlpacaEval stay close because they add fresher instruction-following and style quality checks. BenchLM, TutorBench, and UGI Writing remain supporting evidence that broaden writing coverage without overriding stronger direct preference and language-quality measures.

Benchmark	Weight	Rationale
lmsys-elo	0.35	Large human-preference comparisons still best approximate whether readers prefer the output.
livebench-language	0.20	Fresh language tasks reduce contamination and capture current instruction-following quality in prose.
alpaca-eval-lc	0.18	Pairwise instruction-following quality complements free-form preference with tighter prompt compliance signal.
benchlm-instruction-following	0.12	Aggregated instruction-following evidence broadens model coverage but overlaps with stronger direct preference panels.
tutorbench	0.08	Teaching-style exposition checks clarity and explanation structure rather than just surface prose polish.
ugi-writing	0.07	Willingness-heavy writing signal captures style latitude, but normative bias keeps it below core writing suites.

▸ confidence levels

High≥ 0.7

Medium< 0.7

Insufficientno input

▸ what confidence means

High: The weighted source-confidence value is at least 0.7.

Medium: The weighted source-confidence value is below 0.7.

Insufficient: No mapped result is available. Shown as — rather than a fabricated score.

When two models are within 2 points of each other, they are marked as effectively tied. Missing benchmark inputs reduce a composite through the documented missing-data penalty.

▸ pricing

Pricing is sourced from OpenRouter, which aggregates real-time provider rates. Input and output costs are shown separately in $ per million tokens. Prices can change without notice — treat displayed prices as a recent snapshot.

▸ agent-assisted scores

Some SWE-Bench results use agent scaffolds (SWE-agent, OpenHands, mini-swe-agent). These are not pure model-only scores — the scaffold can add 10–25 percentage points. Agent-assisted rows are marked separately in benchmark breakdowns.

▸ data quality

Rankings are computed algorithmically from public benchmark data. Scores are reproducible — every result links back to its source run.

If you find a data error or a missing source, open an issue →

full index →EQ rankings →