LLM Leaderboard: April 2026

The April 2026 BGI leaderboard arrives with one meaningful infrastructure addition: cost-adjusted value scores. Every model that has public pricing data now shows a value score alongside its BGI utility score — so you can compare not just which model performs best, but which performs best per dollar of inference cost.

This matters more than it might seem. The gap between the top-ranked model and the second-tier on raw utility scores is often smaller than the gap in cost. A model ranked 4th overall that costs one-quarter as much as the model ranked 1st may be the better engineering choice for most production workloads.

April 2026 Rankings

BGI Leaderboard

Ranked by BasedAGI General Intelligence score

Top 21 · Live

#	Model	BGI Score	Use Cases	Confidence	Dimensions	Params
1	gemini-2.5-pro google Full	25.0	151	37%	5 / 5	—
2	GLM-4.6 zai-org Full	30.7	26	26%	5 / 5	—
3	gpt-5-2025-08-07 openai Full	25.0	151	32%	5 / 5	—
4	Grok-4-0709 xai Full	25.0	151	31%	5 / 5	—
5	anthropic/claude-sonnet-4 anthropic Full	23.4	151	33%	5 / 5	—
6	gpt-4.1-20250414 openai Full	23.0	151	32%	5 / 5	—
7	gemini-3-pro-preview google Full	25.0	150	28%	5 / 5	—
8	o3-20250416 openai Full	20.9	151	27%	5 / 5	—
9	gpt-5.2-2025-12-11 openai Full	21.5	149	24%	5 / 5	—
10	claude-opus-4-5-20251101 anthropic Full	18.3	143	25%	5 / 5	—
11	anthropic/claude-sonnet-4.6 anthropic Full	19.7	139	23%	5 / 5	—
12	xai-org/grok-4-1-fast-reasoning xai-org Full	19.3	139	23%	5 / 5	—
13	Kimi-K2-Instruct moonshotai Full	20.9	58	21%	5 / 5	—
14	o4-mini openai Full	15.1	151	24%	5 / 5	—
15	xai-org/grok-4-1-fast-non-reasoning xai-org Full	14.7	129	22%	5 / 5	—
16	qwen-2.5-72b-instruct qwen Full	14.7	150	22%	5 / 5	—
17	google/gemini-3.1-pro-preview google	25.7	146	26%	4 / 5	—
18	gemini-2.5-flash google	19.6	145	28%	4 / 5	—
19	gemini-3-flash-preview google	20.0	137	24%	4 / 5	—
20	xai-org/grok-4-fast-reasoning xai-org	16.0	137	26%	4 / 5	—
21	openai/gpt-5.4-2026-03-05 openai	18.6	133	21%	4 / 5	—

What's New This Month

Cost and value scores are now live. Every model with public pricing data from ArtificialAnalysis's cost-quality frontier now shows a price per 1M tokens (blended 3:1 output:input ratio) and a value score. Value score = utility score / log₂(price + 2), which dampens cost so a 10× price difference doesn't overwhelm quality differences. Sort by Value on the leaderboard to see the reordering.

Coverage expanded to 151 use cases. The scoring taxonomy has grown slightly since March, adding new use cases in financial analysis, security operations, and multi-modal workflows. Models already on the leaderboard gained new scores automatically; a few models moved in rank as the additional use cases adjusted their weighted averages.

GLM-4.6 remains the top-ranked open-weight model. The zai-org release consistently passes all BGI thresholds — 151 use cases scored, all 5 dimensions covered, average confidence at 27%. It's the only open-weight model currently meeting the full-profile standard. Kimi-K2-Instruct is close: 58 use cases scored, all 5 dimensions, but average confidence at 21% keeps it just below threshold.

The confidence threshold (25% average across scored use cases) is the most significant filter on the leaderboard. It's not an arbitrary gate — it directly measures how much benchmark evidence backs each ranking. Models below threshold have real scores but thin evidence; showing them in the main ranking would make the leaderboard less reliable, not more complete.

Reading the Value Score

The value score answers: which model gives the most benchmark-backed utility per unit of cost?

It's computed as bgi_score / log₂(price_per_1m + 2). The log₂ transform means:

A model at $1/1M is treated as costing roughly one unit
A model at $4/1M costs about two units (not four)
A model at $16/1M costs about three units (not sixteen)

This prevents cheap models from dominating just because they're cheap, while still rewarding efficiency. A model that scores 0.30 at $2/1M will rank higher on value than a model that scores 0.31 at $20/1M.

Pricing data reflects public API rates and changes frequently. The value score should inform engineering tradeoffs, not replace them — your actual cost depends on your usage pattern, volume discounts, and whether you're comparing inference costs to fine-tuning and hosting alternatives.

What Changed Since March

The top of the table is stable. Gemini 2.5 Pro, GPT-5, and Grok-4 occupied similar positions last month and this month. The frontier is dense — the utility score gap between rank 1 and rank 4 is smaller than the confidence interval on any individual use-case score.

Mid-table movement from expanded coverage. The 8 new use cases shifted several models' average scores. Models with strong finance and security coverage moved up slightly; models without it moved down. This is working as intended — more coverage means more accurate aggregates, not just more scores.

Open-weight confidence is growing slowly. The average confidence per use case for HuggingFace open-weight models has increased modestly as new benchmark data from open evals (BigCodeBench, MMLU-Pro community runs) has been ingested. This won't be visible as new leaderboard entrants immediately, but the trajectory is right.

Coverage Gaps

The models most conspicuously absent from this month's leaderboard:

Kimi-K2-Instruct — 58 use cases, all 5 dimensions, but avg confidence 21%. Close to qualifying; the next batch of open benchmark data may push it over threshold.

Llama 3.3 70B and Llama 3.1 70B — Both have coverage (29 use cases each) but only 1 scored dimension. Missing EQ, Accuracy, Creativity, and Based dimension scores. Dimension score coverage is the bottleneck, not use-case coverage.

Microsoft Phi-4 — 77 use cases but only 3 dimensions scored. Strong showing on the dimensions it has; needs broader multi-benchmark coverage for inclusion.

Qwen2.5-14B — 32 use cases, 4 dimensions, but avg confidence 17%. More benchmark evidence would move it onto the leaderboard.

All absent models have individual profile pages where you can see their current scores and coverage.

Full Methodology

The complete scoring methodology — source weighting, use-case relevance, confidence calculation, and value score formula — is at /methodology. March-to-April diff details are in the score diff view.