benchmark evidence
Scale Coding Evaluation
Scale Labs coding leaderboard based on reviewed human preference and correctness judgments.
winner on Scale Coding Evaluation
DeepSeek: R184.2
direct benchmark result, not a broad vertical composite | source row dated 2025-03-05
scored on 2025-03-05 · stale source data (451d)
latest mapped results | top 20
| # | Model | Score | Evidence | Tested |
|---|---|---|---|---|
| 1 | DeepSeek: R1 | 84.2 | model-only independent_benchmark | 2025-03-05 |
| 2 | OpenAI: o1 | 82.9 | model-only independent_benchmark | 2025-03-05 |
what this result means
Scale Labs coding leaderboard based on reviewed human preference and correctness judgments.
This benchmark contributes direct public evidence. Read its scope before generalizing the result.
A win here is a win on Scale Coding Evaluation. Broad task pages require independent corroboration before naming a general winner.
source record
category: coding
metric: accuracy
matched models: 2
latest source date: 2025-03-05
direction: higher is better