BasedAGIBasedAGI
Leaderboard ReportLive data

Best Value LLMs

Most LLM leaderboards rank models by capability alone. That's useful for researchers and for applications where cost isn't a constraint. For production engineering, it's incomplete. The model ranked first on raw capability is often 10–20× more expensive than models ranked 3rd or 4th, and for many workloads the quality difference is smaller than the cost difference.

This report ranks models by value score: utility per log-unit of cost. It's not a replacement for the BGI utility rankings — it's a different question. The utility ranking answers "which model performs best?" The value ranking answers "which model performs best for what you're paying?"

How Value Score Works

Value score = BGI utility score / log₂(price_per_1M_tokens + 2)

The log₂ transform is intentional. A model at $1/1M costs one unit; a model at $4/1M costs two units; a model at $16/1M costs three units. This prevents cheap models from dominating just because they're cheap — a model needs meaningfully better price to overcome meaningfully worse quality, and vice versa. Price is from ArtificialAnalysis's cost-quality frontier, blended at a 3:1 output:input token ratio.

Value score is most useful for separating models within a tier. If you need the absolute best quality and cost is secondary, sort by utility score instead. If you need a budget option, value score helps identify which cheap models are actually capable, not just cheap.

Rankings: Sort by Value

Go to the full leaderboard and click Value to see the current live ranking. Models without public pricing don't appear in value sort — several frontier models (GPT-5, Grok-4, GPT-4.1) don't yet have confirmed public pricing.

What the Data Shows

Gemini 2.5 Flash is the clear efficiency leader. At $0.17/1M tokens blended, it achieves a BGI score that rivals models costing 10–20× more. Its confidence is in the provisional range (24%), meaning the ranking is based on meaningful but not extensive evidence. But the direction is unambiguous: this is what the cost-performance frontier looks like when a major lab decides to compete on price.

Grok-4-1-fast sits at $0.28/1M and punches above its weight. Near-frontier capability at budget pricing is the value proposition. If you're building workloads that need strong reasoning but can't justify $3–6/1M for premium models, this tier is worth serious evaluation.

Gemini 2.5 Pro ($3.44/1M) is the value pick in the premium tier. Among models with confirmed pricing above $2/1M, it offers the best utility-per-dollar ratio. Compared to Claude Sonnet 4 at $6/1M, it achieves comparable BGI scores at roughly half the cost.

Claude and GPT flagships don't optimize for value. Claude Sonnet 4 and Sonnet 4.6 at $6/1M have strong absolute performance but lower value scores than Gemini 2.5 Pro. This isn't a knock on quality — it reflects that Anthropic and OpenAI price their flagships at a premium. For applications requiring the highest possible output quality where money is secondary, raw utility scores matter more than value scores.

Open-weight models aren't in this ranking for good reason. GLM-4.6 and Kimi-K2-Instruct are the top open-weight models on the BGI leaderboard but don't have public API pricing. If you're deploying them self-hosted, your actual cost depends on infrastructure, which this report can't model generically.

Pricing changes frequently. The values shown reflect ArtificialAnalysis data at time of publication. Check current pricing directly with providers before making infrastructure decisions. Volume discounts, reserved capacity, and batch inference pricing can materially change the effective cost per token.

Choosing by Use Case

Value score is computed from the overall BGI score, which aggregates across 151 use cases. For specific workloads:

High-volume, moderate complexity (email drafting, summarization, classification) — Gemini 2.5 Flash is the obvious starting point. The cost advantage is largest here because you're running millions of tokens at tasks that don't require frontier-level reasoning.

Mid-complexity production workloads (customer support, data extraction, RAG) — Grok-4-1-fast and Gemini 2.5 Flash both cover this range well. Test both against your specific task before committing.

Complex reasoning, occasional use (legal review, financial analysis, code generation for difficult problems) — The premium tier makes sense here. Gemini 2.5 Pro at $3.44/1M is the value-optimized pick; Claude Sonnet 4 or GPT-4.1 if you have specific quality requirements that only the highest-ranked models meet.

Maximum quality regardless of cost — Sort by utility score instead of value score.

Methodology

BGI scores are computed from 151 scored use cases across 5 capability dimensions (IQ, EQ, Accuracy, Creativity, Based), weighted by confidence. Pricing is from ArtificialAnalysis's cost-quality frontier at a 3:1 output:input blended rate. Full scoring methodology at /methodology.

Not all leaderboard models have pricing data — models without confirmed public pricing are excluded from value rankings, not assumed to be free or expensive.

Related Reports