benchmark evidence
HiL-Bench Pass@3
HiL-Bench selective escalation benchmark from Scale Labs, using Pass@3 as the outcome metric.
winner on HiL-Bench Pass@3
OpenAI: GPT-5.529.1
direct benchmark result, not a broad vertical composite | source row dated 2026-04-29
scored on 2026-04-29
latest mapped results | top 20
| # | Model | Score | Evidence | Tested |
|---|---|---|---|---|
| 1 | OpenAI: GPT-5.5 | 29.1 | model-only independent_benchmark | 2026-04-29 |
| 2 | Anthropic: Claude Opus 4.7 | 27.7 | model-only independent_benchmark | 2026-04-17 |
| 3 | Anthropic: Claude Opus 4.6 | 24.3 | model-only independent_benchmark | 2026-04-16 |
| 4 | OpenAI: GPT-5.4 | 9.3 | model-only independent_benchmark | 2026-04-16 |
what this result means
HiL-Bench selective escalation benchmark from Scale Labs, using Pass@3 as the outcome metric.
This benchmark contributes direct public evidence. Read its scope before generalizing the result.
A win here is a win on HiL-Bench Pass@3. Broad task pages require independent corroboration before naming a general winner.
source record
category: agentic
metric: accuracy
matched models: 4
latest source date: 2026-04-29
direction: higher is better