benchmark evidence
EvalPlus MBPP+
EvalPlus evalplus-mbpp — rigorous coding benchmark with 80x tests.
winner on EvalPlus MBPP+
direct benchmark result, not a broad vertical composite | source row dated 2000-01-01
scored on 2000-01-01 · stale source data (9646d)
latest mapped results | top 20
| # | Model | Score | Evidence | Tested |
|---|---|---|---|---|
| 1 | Qwen2.5 Coder 32B Instruct | 77.0 | model-only independent_benchmark | 2000-01-01 |
what this result means
EvalPlus evalplus-mbpp — rigorous coding benchmark with 80x tests.
MBPP-derived code generation is too saturated to establish frontier coding leadership alone.
A win here is a win on EvalPlus MBPP+. Broad task pages require independent corroboration before naming a general winner.
source record
category: coding
metric: pass@1
matched models: 1
latest source date: 2000-01-01
direction: higher is better