LLM Leaderboard: March 2026

The March 2026 BGI leaderboard represents the current state of general-purpose language model capability as measured by BasedAGI's multi-source, use-case-weighted scoring system. Unlike single-benchmark leaderboards, BGI scores require a model to perform well across a wide range of real-world tasks — coding, reasoning, document analysis, creative writing, customer interaction — with sufficient benchmark evidence to be confident in those rankings.

Models appear on this leaderboard only when they meet our coverage and confidence thresholds. A model with narrow but excellent benchmark coverage will not outrank a model with broad, high-quality coverage just because it peaks higher on the benchmarks it happens to appear in.

How the BGI Score Works

The BGI (BasedAGI General Intelligence) score is not a benchmark result. It's an aggregate computed from 143+ use-case scores, each of which is itself derived from multi-source benchmark evidence.

The pipeline:

Benchmark ingestion — We continuously ingest results from 100+ benchmark sources covering all major capability dimensions
Metric normalization — Raw scores are normalized to a comparable scale within each benchmark family
Use-case scoring — Each use case is scored using the benchmark metrics most relevant to it, weighted by their predictive validity for that use case
BGI aggregation — Use-case scores are aggregated with confidence weighting — models with thin evidence in some use cases contribute less from those use cases to the final score

A high BGI score means the model is broadly capable with evidence to back it up. A middling BGI score can mean either middling capability, thin evidence, or both — which is why we show confidence scores separately.

The BGI leaderboard intentionally excludes models that benchmark extensively on a narrow set of tasks. Leaderboard-optimized models — those that appear to score well because they've been evaluated primarily on the benchmarks they're strongest on — tend to score lower here because their broad use-case coverage is thin.

March 2026 Rankings

No leaderboard data available. View live leaderboard →

Reading the Table

BGI Score — The primary ranking metric. Scaled 0–100 for legibility; higher is better. Confidence-adjusted: a model with high raw scores but thin evidence will score lower here than its raw benchmark numbers might suggest.

Use Cases — The number of use cases the model has been scored on. Models with higher use case coverage have more complete profiles and more reliable BGI scores.

Confidence — The average confidence across all scored use cases. Higher confidence means more benchmark evidence per use case. Below 35% indicates thin coverage; treat those rankings as provisional.

Dimensions — How many of the 5 intelligence dimensions (IQ, EQ, Accuracy, Creativity, Based) the model has been profiled on. A "Full" badge means all 5 dimensions are scored.

Params — Parameter count where available from the HuggingFace catalog. Not available for closed-source API models.

What Changed Since February

The field moved in several notable ways this month:

Open-weight models continued to close the gap on reasoning tasks. The distance between the top closed-source and top open-weight models on IQ-related use cases has narrowed again. For most coding and analytical use cases, the best open-weight options are competitive with models that were frontier-only 6 months ago.

Coverage is expanding. We added benchmark data from new sources this month, which has moved several models significantly — both up and down. Models that appeared in the top 10 primarily because of limited benchmark coverage have dropped; models with genuine broad capability have moved up as their evidence base expanded.

Confidence scores are improving overall. As we add more benchmark sources, the average confidence per model per use case has increased. This means the rankings are becoming more reliable, and the gap between nominal and confidence-adjusted scores is narrowing for well-covered models.

Coverage Gaps

Not every strong model appears here. The most common reasons for absence:

No public benchmark data — Some commercially deployed models have limited public evaluation. We can only score what's been evaluated.

Alias fragmentation — Large model families with many releases and fine-tunes can be hard to aggregate correctly. We may have coverage under a different alias.

Below threshold — Models that appear in fewer than 21 use cases don't meet our minimum coverage requirement for BGI inclusion. Check individual model profiles for lower-coverage models.

If a model you're evaluating doesn't appear here, check its direct profile page. All models with any benchmark coverage have profile pages even if they don't meet the BGI threshold.

Full Methodology

The complete scoring methodology — including how benchmark sources are weighted, how use-case relevance is determined, and how confidence is calculated — is documented at /methodology.