BasedAGIBasedAGI
Menu
Rankings live

Transparency

How we rank models

Every score on BasedAGI is derived from public benchmark data. No vibes, no sponsors, no pay-to-rank. This page explains exactly how models are scored, ranked, and surfaced.

The Big Picture

BasedAGI maps real-world workflows (use cases) to the LLMs best suited to them. Each use case defines which benchmarks matter and how much weight each metric carries. Models are scored purely from benchmark measurements — no manual curation, no hidden adjustments.

The result: a use-case-specific score for every model, backed by traceable evidence you can inspect down to the individual benchmark row.

Scoring Pipeline

1

Benchmark Ingestion

We ingest structured data from public benchmark leaderboards. Each source has a trust tier, refresh SLA, and reliability weight. Snapshots are captured periodically and deduplicated.

2

Model Matching

Raw benchmark rows are matched to canonical model identities using fuzzy matching on name, author, and parameter count. Match rates are tracked per source.

3

Metric Normalization

Raw metric values are normalized to a 0–1 scale within each benchmark metric. This makes scores comparable across benchmarks with different scales.

4

Weight Application

Each use case has a set of benchmark-metric weights that reflect which capabilities matter. Weights are defined in versioned presets.

5

Score Aggregation

The final use-case score is a weighted average: Σ(normalized_value × weight) / Σ(weight). Confidence is derived from evidence coverage and weight diversity.

6

Ranking & Quality Check

Models are ranked by score. Each use case gets an evidence quality assessment based on ranked model count, average evidence, and coverage ratio.

Scoring Formula

Use-Case Score = Σ(value_normalized × weight) / Σ(weight)

value_normalized

The model’s metric score scaled to 0–1 within that benchmark.

weight

How much this benchmark metric matters for this use case.

confidence

Derived from evidence count and weight coverage. Higher = more data.

Utility Score (Global Ranking)

The Utility Score shown on the Model Rankings page is a cross-use-case aggregate:

Utility Score = Σ(use_case_score × confidence) / Σ(confidence)

This confidence-weighted average rewards models that perform well on use cases where the evidence is strong, and discounts scores backed by thin evidence.

Benchmark Sources

Total Sources

170

Active

168

Weighted

93

Use Cases

143

Trust Tiers

Each benchmark source is assigned a trust tier that affects its reliability weight in scoring:

Primary (13)

Well-established benchmarks with regular updates, high model coverage, and structured data feeds.

Secondary (152)

Good coverage but less frequent updates or narrower model selection.

Experimental (3)

New or niche benchmarks under evaluation. Lower weight until proven reliable.

Evidence Quality

Each use case receives an evidence quality grade based on three factors:

Ranked Model Count

How many models have enough weighted benchmark data to receive a score.

Average Evidence

Mean number of benchmark evidence points per ranked model. Higher = more trustworthy.

Coverage Ratio

What fraction of expected benchmark metrics actually have data for ranked models.

Use cases are graded Sufficient, Insufficient, or Unscored. Insufficient rankings still appear but display a warning banner.

What We Don't Do

  • ×No manual curation or editorial picks
  • ×No sponsored placements or pay-to-rank
  • ×No vibes-based or anecdotal scoring
  • ×No hidden adjustments or secret weights
  • ×No self-reported model performance data