The April 2026 BGI leaderboard arrives with one meaningful infrastructure addition: cost-adjusted value scores. Every model that has public pricing data now shows a value score alongside its BGI utility score — so you can compare not just which model performs best, but which performs best per dollar of inference cost.
This matters more than it might seem. The gap between the top-ranked model and the second-tier on raw utility scores is often smaller than the gap in cost. A model ranked 4th overall that costs one-quarter as much as the model ranked 1st may be the better engineering choice for most production workloads.
April 2026 Rankings
BGI Leaderboard
Ranked by BasedAGI General Intelligence score
| # | Model | BGI Score |
|---|---|---|
| 1 | gemini-2.5-pro Full | 25.0 |
| 2 | GLM-4.6 Fullzai-org | 30.7 |
| 3 | gpt-5-2025-08-07 Fullopenai | 25.0 |
| 4 | Grok-4-0709 Fullxai | 25.0 |
| 5 | anthropic/claude-sonnet-4 Fullanthropic | 23.4 |
| 6 | gpt-4.1-20250414 Fullopenai | 23.0 |
| 7 | gemini-3-pro-preview Full | 25.0 |
| 8 | o3-20250416 Fullopenai | 20.9 |
| 9 | gpt-5.2-2025-12-11 Fullopenai | 21.5 |
| 10 | claude-opus-4-5-20251101 Fullanthropic | 18.3 |
| 11 | anthropic/claude-sonnet-4.6 Fullanthropic | 19.7 |
| 12 | Full | 19.3 |
| 13 | Kimi-K2-Instruct Fullmoonshotai | 20.9 |
| 14 | o4-mini Fullopenai | 15.1 |
| 15 | Full | 14.7 |
| 16 | Full | 14.7 |
| 17 | 25.7 | |
| 18 | gemini-2.5-flash | 19.6 |
| 19 | gemini-3-flash-preview | 20.0 |
| 20 | 16.0 | |
| 21 | 18.6 |
What's New This Month
Cost and value scores are now live. Every model with public pricing data from ArtificialAnalysis's cost-quality frontier now shows a price per 1M tokens (blended 3:1 output:input ratio) and a value score. Value score = utility score / log₂(price + 2), which dampens cost so a 10× price difference doesn't overwhelm quality differences. Sort by Value on the leaderboard to see the reordering.
Coverage expanded to 151 use cases. The scoring taxonomy has grown slightly since March, adding new use cases in financial analysis, security operations, and multi-modal workflows. Models already on the leaderboard gained new scores automatically; a few models moved in rank as the additional use cases adjusted their weighted averages.
GLM-4.6 remains the top-ranked open-weight model. The zai-org release consistently passes all BGI thresholds — 151 use cases scored, all 5 dimensions covered, average confidence at 27%. It's the only open-weight model currently meeting the full-profile standard. Kimi-K2-Instruct is close: 58 use cases scored, all 5 dimensions, but average confidence at 21% keeps it just below threshold.
The confidence threshold (25% average across scored use cases) is the most significant filter on the leaderboard. It's not an arbitrary gate — it directly measures how much benchmark evidence backs each ranking. Models below threshold have real scores but thin evidence; showing them in the main ranking would make the leaderboard less reliable, not more complete.
Reading the Value Score
The value score answers: which model gives the most benchmark-backed utility per unit of cost?
It's computed as bgi_score / log₂(price_per_1m + 2). The log₂ transform means:
- A model at $1/1M is treated as costing roughly one unit
- A model at $4/1M costs about two units (not four)
- A model at $16/1M costs about three units (not sixteen)
This prevents cheap models from dominating just because they're cheap, while still rewarding efficiency. A model that scores 0.30 at $2/1M will rank higher on value than a model that scores 0.31 at $20/1M.
Pricing data reflects public API rates and changes frequently. The value score should inform engineering tradeoffs, not replace them — your actual cost depends on your usage pattern, volume discounts, and whether you're comparing inference costs to fine-tuning and hosting alternatives.
What Changed Since March
The top of the table is stable. Gemini 2.5 Pro, GPT-5, and Grok-4 occupied similar positions last month and this month. The frontier is dense — the utility score gap between rank 1 and rank 4 is smaller than the confidence interval on any individual use-case score.
Mid-table movement from expanded coverage. The 8 new use cases shifted several models' average scores. Models with strong finance and security coverage moved up slightly; models without it moved down. This is working as intended — more coverage means more accurate aggregates, not just more scores.
Open-weight confidence is growing slowly. The average confidence per use case for HuggingFace open-weight models has increased modestly as new benchmark data from open evals (BigCodeBench, MMLU-Pro community runs) has been ingested. This won't be visible as new leaderboard entrants immediately, but the trajectory is right.
Coverage Gaps
The models most conspicuously absent from this month's leaderboard:
Kimi-K2-Instruct — 58 use cases, all 5 dimensions, but avg confidence 21%. Close to qualifying; the next batch of open benchmark data may push it over threshold.
Llama 3.3 70B and Llama 3.1 70B — Both have coverage (29 use cases each) but only 1 scored dimension. Missing EQ, Accuracy, Creativity, and Based dimension scores. Dimension score coverage is the bottleneck, not use-case coverage.
Microsoft Phi-4 — 77 use cases but only 3 dimensions scored. Strong showing on the dimensions it has; needs broader multi-benchmark coverage for inclusion.
Qwen2.5-14B — 32 use cases, 4 dimensions, but avg confidence 17%. More benchmark evidence would move it onto the leaderboard.
All absent models have individual profile pages where you can see their current scores and coverage.
Full Methodology
The complete scoring methodology — source weighting, use-case relevance, confidence calculation, and value score formula — is at /methodology. March-to-April diff details are in the score diff view.