benchmark evidence

Arena-Hard

Arena-Hard win rate vs GPT-4-0314 baseline on 500 challenging prompts.

winner on Arena-Hard

direct benchmark result, not a broad vertical composite | source row dated 2026-05-18

scored on 2026-05-18 · stale source data (57d)

latest mapped results | top 20

#	Model	Score	Evidence	Tested
1	OpenAI: GPT-5 Openai	79.2	model-only independent_benchmark	2026-05-18
2	Mistral Large Mistralai	37.7	model-only independent_benchmark	2026-05-18

what this result means

Arena-Hard win rate vs GPT-4-0314 baseline on 500 challenging prompts.

This benchmark contributes direct public evidence. Read its scope before generalizing the result.

A win here is a win on Arena-Hard. Broad task pages require independent corroboration before naming a general winner.

source record

category: overall

metric: win_rate

matched models: 2

latest source date: 2026-05-18

direction: higher is better