benchmark evidence

HiL-Bench Pass@3

HiL-Bench selective escalation benchmark from Scale Labs, using Pass@3 as the outcome metric.

winner on HiL-Bench Pass@3

direct benchmark result, not a broad vertical composite | source row dated 2026-04-29

scored on 2026-04-29 · stale source data (76d)

latest mapped results | top 20

#	Model	Score	Evidence	Tested
1	OpenAI: GPT-5.5 Openai	29.1	model-only independent_benchmark	2026-04-29
2	Anthropic: Claude Opus 4.7 Anthropic	27.7	model-only independent_benchmark	2026-04-17
3	Anthropic: Claude Opus 4.6 Anthropic	24.3	model-only independent_benchmark	2026-04-16
4	OpenAI: GPT-5.4 Openai	9.3	model-only independent_benchmark	2026-04-16

what this result means

HiL-Bench selective escalation benchmark from Scale Labs, using Pass@3 as the outcome metric.

This benchmark contributes direct public evidence. Read its scope before generalizing the result.

A win here is a win on HiL-Bench Pass@3. Broad task pages require independent corroboration before naming a general winner.

source record

category: agentic

metric: accuracy

matched models: 4

latest source date: 2026-04-29

direction: higher is better