How scores are calculated.
Every score traces back to a public benchmark run you can verify. Sources, dates, and confidence levels are shown for every result.
refresh cadence varies by source · all sources public
BasedAGI aggregates public benchmark results, labels vendor-published evidence explicitly, normalizes scores to a 0-100 scale, and computes task-specific composite rankings using documented benchmark mappings.
One named winner is published for each question. Overall and task-specific winners can differ because they answer different questions. Each vertical maps to the benchmarks listed below and does not change silently.
Models without enough benchmark coverage show — (insufficient data) rather than an interpolated or neutral score. A missing score is not the same as a bad score.
Broad task and vertical winners are published only when at least two independent source publishers have tested the model within the last 30 days and at least 20 callable models meet that standard. Multiple panels from one publisher and older results remain visible as evidence, but cannot establish a current general winner.
Raw scores are converted to a 0–100 scale per benchmark. Percentile scoring is used where available to account for benchmark difficulty differences.
weight = recency × source confidence × data status × result confidence · composite = weighted average − missing-data penalty
The homepage stays compact. This page carries the full benchmark inventory behind the index, with scope made explicit.
Per-language multilingual rows count separately here because they are separate public benchmark entries in the database, even when a broad leaderboard rolls them up into one composite.
The homepage talks about 8 public-facing verticals. This table is broader: it shows all 13 benchmark categories currently tracked underneath the index, including supporting categories like tool use, computer use, uncensored, and overall.
| Vertical | Benchmarks | Count |
|---|---|---|
| Coding | Aider Coding Benchmark · BenchLM Coding · BigCodeBench Complete · BigCodeBench Instruct · EvalPlus HumanEval+ · EvalPlus MBPP+ · LiveBench Coding · LiveCodeBench · Scale Coding Evaluation · SWE Atlas - Refactoring · SWE Atlas - Test Writing · SWE-bench Verified · Terminal-Bench 2.0 · WebDev Arena | 14 |
| Reasoning | BBH (Open LLM Leaderboard) · BenchLM Knowledge · BenchLM Reasoning · EnigmaEval · GPQA Diamond · Humanity's Last Exam · LegalBench · MMLU-Pro · MMMU · SimpleQA Factual Accuracy | 10 |
| Math | AIME 2025 · BenchLM Math · MATH Level 5 (Open LLM Leaderboard) · MATH-500 · Scale Math Evaluation | 5 |
| Writing | AlpacaEval 2.0 LC Win Rate · BenchLM Instruction Following · LiveBench Language · TutorBench | 4 |
| Function Calling | BFCL Live · BFCL Multi-Turn · BFCL Non-Live · BFCL Overall | 4 |
| EQ | EQ-Bench Creative Writing v3 · EQ-Bench Judgemark · EQ-Bench v3 | 3 |
| Factuality | TruthfulQA · Vectara HHEM | 2 |
| Agentic | GAIA · HiL-Bench Pass@3 · Remote Labor Index · SWE Atlas - Codebase QnA · tau2-bench Airline · tau2-bench Banking Knowledge · tau2-bench Retail · tau2-bench Telecom | 8 |
| Tool Use | MCP Atlas · Toolathlon | 2 |
| Computer Use | OSWorld-Verified · ScreenSpot-Pro | 2 |
| Multilingual | AI Language Proficiency Monitor Arabic · AI Language Proficiency Monitor Average · AI Language Proficiency Monitor Bengali · AI Language Proficiency Monitor French · AI Language Proficiency Monitor German · AI Language Proficiency Monitor Hindi · AI Language Proficiency Monitor Japanese · AI Language Proficiency Monitor Mandarin Chinese · AI Language Proficiency Monitor Portuguese · AI Language Proficiency Monitor Russian · AI Language Proficiency Monitor Spanish · Global-MMLU-Lite Arabic · Global-MMLU-Lite Average · Global-MMLU-Lite Bengali · Global-MMLU-Lite French · Global-MMLU-Lite German · Global-MMLU-Lite Hindi · Global-MMLU-Lite Japanese · Global-MMLU-Lite Mandarin Chinese · Global-MMLU-Lite Portuguese · Global-MMLU-Lite Spanish · MGSM (Aggregate) | 22 |
| Uncensored | UGI — ugi-natural-intelligence · UGI — ugi-overall · UGI — ugi-willingness · UGI — ugi-writing | 4 |
| Overall | Arena-Hard · Artificial Analysis Intelligence Index · Chatbot Arena (LMSYS) · IFEval (Open LLM Leaderboard) · LiveBench Instruction Following | 5 |
| Total | Full benchmark inventory currently tracked on BasedAGI. | 85 |
These are the benchmark panels that currently drive each public vertical, plus the argument for why each benchmark deserves its share of the composite.
If a vertical is marked withheld, the evidence stays public but BasedAGI does not publish a named winner from that composite until the weighting rationale is stable enough to defend.
GAIA leads because it is the broadest public agent benchmark with recognizable frontier difficulty. Terminal-Bench and tau2-bench stay close because they measure operational, tool-using task completion rather than pure QA. SWE Atlas, HiL-Bench, and the Remote Labor Index add narrower agent behaviors, but they remain supporting evidence instead of the backbone of the claim.
| Benchmark | Weight | Rationale |
|---|---|---|
| gaia | 0.22 | Broad autonomous assistant tasks make GAIA the best single public signal for general agent capability. |
| terminal-bench | 0.16 | Shell execution under harness constraints captures real operational competence instead of static QA. |
| tau2-bench-retail | 0.12 | Customer-support task completion stresses long-horizon policy following in a practical domain. |
| tau2-bench-airline | 0.10 | Airline workflows add a different constrained-service domain with distinct policy and planning structure. |
| tau2-bench-telecom | 0.08 | Telecom support introduces procedural tool use that is less redundant than a generic average would suggest. |
| tau2-bench-banking-knowledge | 0.07 | Banking knowledge tasks emphasize policy-sensitive agent reasoning rather than broad task generality. |
| swe-atlas-qna | 0.10 | Codebase investigation is a distinct agent workflow that general assistant suites underweight. |
| hil-bench-pass-at-3 | 0.08 | Human-in-the-loop escalation behavior matters because good agents must know when to defer. |
| remote-labor-index | 0.07 | Economic task framing is valuable, but the suite is still too narrow to outweigh broader agent panels. |
SWE-bench Verified stays on top because it is still the cleanest public proxy for real bug-fix work under repo constraints. LiveCodeBench and Aider stay near it because they measure contamination-resistant code generation and edit quality in active workflows. BigCodeBench, EvalPlus, and the agentic coding suites broaden coverage, but they are downweighted when they duplicate the same narrow pass/fail coding loop.
| Benchmark | Weight | Rationale |
|---|---|---|
| swe-bench-verified | 0.18 | Real repository bug fixing under realistic constraints; still the strongest public proxy for production coding. |
| livecodebench-pass-at-1 | 0.14 | Time-windowed coding evaluation reduces contamination and captures current code-generation ability. |
| aider-coding | 0.12 | Measures iterative edit quality in an agent-style coding loop rather than one-shot generation. |
| terminal-bench | 0.10 | Tests whether a model can execute code-adjacent shell tasks, which pure code benchmarks miss. |
| bigcodebench-complete | 0.09 | Full problem completion captures longer synthesis than patch-style editing benchmarks. |
| bigcodebench-instruct | 0.06 | Instruction-following variant adds prompt-compliance signal that the complete set does not isolate. |
| evalplus-humaneval | 0.06 | Still useful as a clean floor for function-level correctness despite saturation at the frontier. |
| evalplus-mbpp | 0.05 | Broader short-program coverage than HumanEval, but lower ceiling signal for frontier models. |
| livebench-coding | 0.05 | Fresh coding questions provide recency signal distinct from curated software-engineering suites. |
| benchlm-coding | 0.04 | Aggregated coding evidence broadens model coverage, but overlaps with stronger primary coding sources. |
| swe-atlas-test-writing | 0.03 | Isolates test-authoring behavior that bug-fix benchmarks do not explicitly measure. |
| swe-atlas-refactoring | 0.03 | Captures structural code transformation skill rather than raw patch success. |
| webdev-arena | 0.02 | Front-end implementation signal is useful, but the task family is narrower than general coding. |
| scale-coding-eval | 0.03 | Helpful as an extra modern coding check, but too close to adjacent coding suites to carry more weight. |
OSWorld-Verified leads because end-to-end computer-use task completion matters more than pointwise grounding alone. ScreenSpot-Pro remains important because visual grounding is a prerequisite skill, but it is still a sub-capability rather than the whole desktop task.
| Benchmark | Weight | Rationale |
|---|---|---|
| osworld-verified | 0.60 | End-to-end GUI task execution is the strongest current public proxy for actual computer-use competence. |
| screenspot-pro | 0.40 | Screen grounding matters because models that cannot localize UI targets will fail before full task execution begins. |
EQ-Bench v3, Creative Writing v3, and Judgemark all come from the same publisher and partially share evaluation machinery. I can describe what each sub-panel measures, but I cannot yet defend a public broad-EQ weighting as an independent ranking claim against a hostile reviewer. This vertical stays visible as source evidence until a second independent public EQ-style suite exists or the panel is narrowed to a claim I can defend.
withholding reason: weighting rationale not yet strong enough for a public winner claim
| Benchmark | Weight | Rationale |
|---|---|---|
| eq-bench-v3 | 0.50 | Core emotional-intelligence rubric across empathy, validation, and social reasoning; the main signal in the suite. |
| eq-bench-creative-writing-v3 | 0.30 | Measures prose warmth, style, and originality, which are adjacent to EQ but not reducible to empathy alone. |
| eq-bench-judgemark | 0.20 | Evaluator-calibration signal matters because EQ scoring is judge-sensitive, but it is still not an independent task source. |
Three distinct failure modes, one per benchmark. Vectara HHEM targets document-grounded faithfulness — the failure mode most common in RAG pipelines where a model fabricates content that contradicts its source. SimpleQA tests closed-domain factual recall using short questions with unambiguous correct answers, exposed on OpenAI's own models but independently verifiable. TruthfulQA (new) isolates common misconceptions and conspiracy- adjacent claims that models absorb from pretraining data — a different failure mode from both faithfulness and factual recall. Together the panel covers the three most important hallucination axes; two of the three sources are fully independent.
| Benchmark | Weight | Rationale |
|---|---|---|
| vectara-hhem | 0.55 | Strongest independent signal — document-grounded faithfulness is the most common real-world hallucination failure mode. |
| simpleqa | 0.25 | Closed-domain factual recall adds a distinct second failure mode; vendor-published but independently verifiable. |
| truthfulqa | 0.20 | Targets pretraining-absorbed misconceptions and conspiracy-adjacent falsehoods — failure mode invisible to faithfulness and factual-recall suites. |
BFCL Overall is weighted highest because it is the best broad proxy for production function-calling reliability across the suite. Multi-Turn and Live stay next because real systems break on stateful tool use and live tool execution, not just one-shot JSON. Non-Live remains useful, but it is the easiest part of the panel and overlaps heavily with the aggregate score.
| Benchmark | Weight | Rationale |
|---|---|---|
| bfcl-overall | 0.40 | Broad aggregate function-calling accuracy is still the best single headline measure for production reliability. |
| bfcl-multi-turn | 0.25 | Stateful multi-turn tool use captures failure modes that one-shot structured output never exposes. |
| bfcl-live | 0.20 | Live tool interaction tests execution under realistic runtime constraints instead of static call formatting alone. |
| bfcl-non-live | 0.15 | Static invocation accuracy is still useful as a floor, but it is the easiest BFCL slice and overlaps with Overall. |
Instruction following is a production-critical capability orthogonal to raw reasoning or writing quality — a model can write brilliantly while ignoring format constraints, word limits, or explicit negative rules. IFEval (Instruction Following Eval) is the cleanest public benchmark for this: 541 verifiable instructions (JSON output, word limits, forbidden keywords, language constraints) scored with no LLM judge — pure programmatic pass/fail. It is independently published by Google Research and run by both the Open LLM Leaderboard and LLM-Stats, giving two independent source publishers for confidence scoring. Panel deliberately stays narrow until a second benchmark with comparable rigor emerges.
| Benchmark | Weight | Rationale |
|---|---|---|
| open-llm-ifeval | 1.00 | Only benchmark in the panel with fully programmatic, LLM-judge-free scoring of explicit instruction constraints. |
Long-context capability is a distinct production requirement orthogonal to per-token quality: a model can score well on standard benchmarks while completely failing to synthesize information across hundreds of thousands of tokens. LongBench-v2 (THUDM/Tsinghua, 2024) is the strongest open benchmark for this: 503 bilingual, human-verified long-context questions across single-doc QA, multi-doc QA, long in-context learning, and code repo comprehension, with gold labels and no LLM-judge dependency — unlike NIAH and RULER which rely on synthetic retrieval tasks. Panel deliberately narrow while the field standardises: long-context evaluation methodology is still converging and most alternatives either lack verified data or cover too few models to be actionable.
| Benchmark | Weight | Rationale |
|---|---|---|
| longbench-v2 | 1.00 | Only benchmark in the panel with verified, multi-task long-context evaluation and no LLM-judge dependency. |
AIME 2025 leads because it is the best public proxy here for hard, multi-step symbolic reasoning under real contest pressure. MATH-500 stays close because it covers broader math domains and solution styles. Open LLM Math and BenchLM Math broaden model coverage, while MMLU-Pro and Scale Math Evaluation stay lower as supporting evidence because they partially overlap with the harder dedicated math panels.
| Benchmark | Weight | Rationale |
|---|---|---|
| aime-2025 | 0.28 | Hard contest math forces multi-step chain construction and remains a top frontier discriminator. |
| math-500 | 0.22 | Broader subject coverage adds general mathematical robustness beyond olympiad-style question patterns. |
| open-llm-math | 0.16 | Open leaderboard math signal improves breadth and comparability across more callable models. |
| benchlm-math | 0.14 | Aggregated math evidence is useful for model coverage, but secondary to direct dedicated math benchmarks. |
| scale-math-eval | 0.10 | Modern applied math evaluation contributes current signal not fully captured by legacy benchmark sets. |
| mmlu-pro | 0.10 | Quantitative academic reasoning adds a cross-domain math-adjacent check without dominating the panel. |
The broad multilingual claim is intentionally narrow: one aggregate row from each independent public suite, not every per-language slice multiplied into fake certainty. Global-MMLU-Lite gets the highest weight because it is the cleanest broad cross-language knowledge readout. LanguageBench adds diversity by mixing ARC, MGSM, and MMLU across languages. MGSM-avg is a dedicated multilingual math benchmark (10 languages) from OpenAI simple-evals — a useful third signal that catches language-math reasoning gaps invisible to knowledge-only suites.
| Benchmark | Weight | Rationale |
|---|---|---|
| global-mmlu-lite-avg | 0.50 | Cross-language knowledge consistency is the cleanest single broad multilingual capability readout in the current panel. |
| languagebench-avg | 0.35 | Mixed-task language evaluation adds reasoning and instruction-following diversity that pure knowledge averages miss. |
| mgsm-avg | 0.15 | Multilingual math word problems across 10 languages catch language-math gaps that pure knowledge suites miss entirely. |
Multimodal capability is a first-class differentiator between frontier models — most major models in 2026 are vision-capable and users routinely submit images, charts, and documents. MMMU and MMMU-Pro share top weight: MMMU provides the broadest coverage (57 academic subjects, strong model count) while MMMU-Pro uses vision-only inputs that prevent text-shortcut cheating, making it the harder and cleaner signal. MathVista isolates math reasoning with visual context (diagrams, charts, geometry) — a distinct capability from text-only math. MMStar is manually filtered to require genuine visual perception rather than language priors, guarding against contamination that plagues simpler VQA benchmarks.
| Benchmark | Weight | Rationale |
|---|---|---|
| mmmu | 0.30 | Broadest multimodal coverage — 11K expert questions across 57 disciplines. Largest model count in the panel. |
| mmmu-pro | 0.28 | Vision-only inputs eliminate text-shortcut cheating and discriminate frontier models more reliably than standard MMMU. |
| mathvista | 0.24 | Math reasoning conditioned on visual inputs is a distinct capability — geometry, chart reading, visual equations. |
| mmstar | 0.18 | Manually filtered to ensure visual necessity; cleanest contamination-resistant multimodal signal in the panel. |
The overall composite weights general instruction following and conversation quality highest, since these best predict whether a model is broadly useful. Reasoning and coding add depth. Agentic and tool use round out the picture for users who care about autonomous task performance.
| Benchmark | Weight | Rationale |
|---|---|---|
| lmsys-elo | 0.25 | Largest human preference signal; crowd-sourced from 6M+ pairwise votes. |
| artificial-analysis-intelligence | 0.20 | Independent 10-evaluation composite; most rigorous aggregator outside of crowdsourced preference. |
| livebench-instruction-following | 0.15 | Contamination-resistant instruction-following tasks with fresh monthly questions. |
| open-llm-ifeval | 0.15 | Format-adherence and constraint-following benchmark with precise pass/fail grading. |
| alpaca-eval-lc | 0.13 | Length-controlled win-rate eliminates verbosity bias from AlpacaEval 2.0. |
| benchlm-instruction-following | 0.07 | Aggregator instruction-following score adds model coverage for less-benchmarked models. |
| arena-hard | 0.05 | Hard instruction-following arena tasks provide differentiation at the frontier. |
GPQA Diamond is weighted highest because it is still the strongest public reasoning benchmark for novel, contamination-resistant expert questions. HLE and MMLU-Pro stay close because they stress frontier breadth and multi-domain rigor at larger scale. The rest are supporting panels that cover reasoning sub-modes such as legal analysis, factual recall, multimodal reasoning, and adversarial puzzle solving.
| Benchmark | Weight | Rationale |
|---|---|---|
| gpqa-diamond | 0.22 | Contamination-resistant expert science questions remain the cleanest single proxy for frontier reasoning. |
| hle | 0.15 | Frontier-difficulty breadth matters because it stresses compositional reasoning outside one academic domain. |
| mmlu-pro | 0.12 | Harder multi-domain academic reasoning still adds scale and category diversity beyond GPQA. |
| mmmu | 0.09 | Multimodal reasoning catches failure modes text-only reasoning suites never see. |
| legalbench | 0.08 | LegalBench isolates rule application and textual nuance in a way general academic suites do not. |
| benchlm-reasoning | 0.08 | Broad aggregator signal is useful for coverage, but not strong enough to outrank primary independent suites. |
| benchlm-knowledge | 0.06 | Knowledge-heavy reasoning complements pure puzzle solving by stressing retrieval plus inference. |
| open-llm-bbh | 0.06 | BBH still captures compositional oddball reasoning tasks that benchmark-specific optimization can miss. |
| simpleqa | 0.05 | Short factual reasoning under uncertainty is a distinct failure mode from long-form academic problem solving. |
| enigma-eval | 0.04 | Adversarial puzzle-style evaluation adds a harder edge-case reasoning check. |
| scale-math-eval | 0.03 | Math-heavy reasoning belongs in the panel, but direct math overlap keeps it below dedicated reasoning suites. |
| tutorbench | 0.01 | Explanatory tutoring signal is useful, but it overlaps with writing and instruction following more than core reasoning. |
| ugi-natural-intelligence | 0.01 | Useful as a stress signal for broad intelligence claims, but too methodologically loose to weight heavily. |
MCP Atlas gets the lead because it is the more direct benchmark for structured, multi-step tool orchestration in agent stacks. Toolathlon stays substantial because it adds breadth across tool scenarios, but it does not yet have the same practical signal weight as the stronger MCP-style workflow benchmark.
| Benchmark | Weight | Rationale |
|---|---|---|
| mcp-atlas | 0.60 | Directly measures orchestrated tool workflows in a way closest to production agent systems. |
| toolathlon | 0.40 | Broad tool-use coverage adds complementary scenario diversity beyond MCP-shaped orchestration tasks. |
The uncensored panel is currently all one suite and mixes several normative ideas: willingness, broad intelligence, and free-form writing latitude. I can preserve the source evidence, but I cannot defend one public composite winner from this panel without importing value judgments that are not benchmark-stable. This vertical stays withheld.
withholding reason: weighting rationale not yet strong enough for a public winner claim
| Benchmark | Weight | Rationale |
|---|---|---|
| ugi-overall | 0.40 | Broad headline willingness signal is the obvious anchor if this category is ever published. |
| ugi-willingness | 0.25 | Direct willingness should matter separately from general capability, but it is still one normative dimension. |
| ugi-natural-intelligence | 0.20 | General intelligence overlap helps contextualize willingness, though it is not uniquely about uncensored behavior. |
| ugi-writing | 0.15 | Writing latitude contributes style freedom, but it is too indirect to dominate an uncensored claim. |
LMSYS Elo leads because large-scale pairwise preference still best captures whether humans actually prefer the prose. LiveBench Language and AlpacaEval stay close because they add fresher instruction-following and style quality checks. BenchLM, TutorBench, and UGI Writing remain supporting evidence that broaden writing coverage without overriding stronger direct preference and language-quality measures.
| Benchmark | Weight | Rationale |
|---|---|---|
| lmsys-elo | 0.35 | Large human-preference comparisons still best approximate whether readers prefer the output. |
| livebench-language | 0.20 | Fresh language tasks reduce contamination and capture current instruction-following quality in prose. |
| alpaca-eval-lc | 0.18 | Pairwise instruction-following quality complements free-form preference with tighter prompt compliance signal. |
| benchlm-instruction-following | 0.12 | Aggregated instruction-following evidence broadens model coverage but overlaps with stronger direct preference panels. |
| tutorbench | 0.08 | Teaching-style exposition checks clarity and explanation structure rather than just surface prose polish. |
| ugi-writing | 0.07 | Willingness-heavy writing signal captures style latitude, but normative bias keeps it below core writing suites. |
High: The weighted source-confidence value is at least 0.7.
Medium: The weighted source-confidence value is below 0.7.
Insufficient: No mapped result is available. Shown as — rather than a fabricated score.
When two models are within 2 points of each other, they are marked as effectively tied. Missing benchmark inputs reduce a composite through the documented missing-data penalty.
Pricing is sourced from OpenRouter, which aggregates real-time provider rates. Input and output costs are shown separately in $ per million tokens. Prices can change without notice — treat displayed prices as a recent snapshot.
Some SWE-Bench results use agent scaffolds (SWE-agent, OpenHands, mini-swe-agent). These are not pure model-only scores — the scaffold can add 10–25 percentage points. Agent-assisted rows are marked separately in benchmark breakdowns.
Rankings are computed algorithmically from public benchmark data. Scores are reproducible — every result links back to its source run.
If you find a data error or a missing source, open an issue →