benchmark evidence

TutorBench

Scale Labs tutoring benchmark measuring adaptive explanations, feedback, and hint quality.

winner on TutorBench

direct benchmark result, not a broad vertical composite | source row dated 2025-09-10

scored on 2025-12-15 · stale source data (210d)

latest mapped results | top 20

#	Model	Score	Evidence	Tested
1	OpenAI: GPT-5 Openai	77.3	model-only independent_benchmark	2025-09-10
2	OpenAI: GPT-5.2 Openai	74.7	model-only independent_benchmark	2025-12-15
3	Anthropic: Claude Opus 4.5 Anthropic	69.6	model-only independent_benchmark	2025-09-10
4	Anthropic: Claude Sonnet 4.5 Anthropic	63.8	model-only independent_benchmark	2025-09-10
5	Meta: Llama 4 Maverick Meta Llama	56.2	model-only independent_benchmark	2025-09-10

what this result means

Scale Labs tutoring benchmark measuring adaptive explanations, feedback, and hint quality.

This benchmark contributes direct public evidence. Read its scope before generalizing the result.

A win here is a win on TutorBench. Broad task pages require independent corroboration before naming a general winner.

source record

category: writing

metric: accuracy

matched models: 5

latest source date: 2025-12-15

direction: higher is better