Transparency
How we rank models
Every score on BasedAGI is derived from public benchmark data. No vibes, no sponsors, no pay-to-rank. This page explains exactly how models are scored, ranked, and surfaced.
The Big Picture
BasedAGI maps real-world workflows (use cases) to the LLMs best suited to them. Each use case defines which benchmarks matter and how much weight each metric carries. Models are scored purely from benchmark measurements — no manual curation, no hidden adjustments.
The result: a use-case-specific score for every model, backed by traceable evidence you can inspect down to the individual benchmark row.
Scoring Pipeline
Benchmark Ingestion
We ingest structured data from public benchmark leaderboards. Each source has a trust tier, refresh SLA, and reliability weight. Snapshots are captured periodically and deduplicated.
Model Matching
Raw benchmark rows are matched to canonical model identities using fuzzy matching on name, author, and parameter count. Match rates are tracked per source.
Metric Normalization
Raw metric values are normalized to a 0–1 scale within each benchmark metric. This makes scores comparable across benchmarks with different scales.
Weight Application
Each use case has a set of benchmark-metric weights that reflect which capabilities matter. Weights are defined in versioned presets.
Score Aggregation
The final use-case score is a weighted average: Σ(normalized_value × weight) / Σ(weight). Confidence is derived from evidence coverage and weight diversity.
Ranking & Quality Check
Models are ranked by score. Each use case gets an evidence quality assessment based on ranked model count, average evidence, and coverage ratio.
Scoring Formula
value_normalized
The model’s metric score scaled to 0–1 within that benchmark.
weight
How much this benchmark metric matters for this use case.
confidence
Derived from evidence count and weight coverage. Higher = more data.
Utility Score (Global Ranking)
The Utility Score shown on the Model Rankings page is a cross-use-case aggregate:
This confidence-weighted average rewards models that perform well on use cases where the evidence is strong, and discounts scores backed by thin evidence.
Benchmark Sources
Total Sources
170
Active
168
Weighted
93
Use Cases
143
Trust Tiers
Each benchmark source is assigned a trust tier that affects its reliability weight in scoring:
Primary (13)
Well-established benchmarks with regular updates, high model coverage, and structured data feeds.
Secondary (152)
Good coverage but less frequent updates or narrower model selection.
Experimental (3)
New or niche benchmarks under evaluation. Lower weight until proven reliable.
Evidence Quality
Each use case receives an evidence quality grade based on three factors:
Ranked Model Count
How many models have enough weighted benchmark data to receive a score.
Average Evidence
Mean number of benchmark evidence points per ranked model. Higher = more trustworthy.
Coverage Ratio
What fraction of expected benchmark metrics actually have data for ranked models.
Use cases are graded Sufficient, Insufficient, or Unscored. Insufficient rankings still appear but display a warning banner.
What We Don't Do
- ×No manual curation or editorial picks
- ×No sponsored placements or pay-to-rank
- ×No vibes-based or anecdotal scoring
- ×No hidden adjustments or secret weights
- ×No self-reported model performance data