live
weekly refresh
basedagi.org
benchmark evidence

EQ-Bench Creative Writing v3

Pairwise creative writing quality benchmark. Rubric score 0-100; higher means more creative and stylistically rich prose.

winner on EQ-Bench Creative Writing v3
direct benchmark result, not a broad vertical composite | source row dated 2000-01-01
scored on 2000-01-01 · stale source data (9646d)
latest mapped results | top 20
#ModelScoreEvidenceTested
1OpenAI: GPT-5.5
Openai
85.0
model-only
independent_benchmark
2000-01-01
2OpenAI: GPT-5.4
Openai
84.5
model-only
independent_benchmark
2000-01-01
3OpenAI: GPT-5
Openai
84.0
model-only
independent_benchmark
2000-01-01
4OpenAI: GPT-5.2
Openai
83.3
model-only
independent_benchmark
2000-01-01
5Anthropic: Claude Opus 4.6
Anthropic
82.7
model-only
independent_benchmark
2000-01-01
6OpenAI: GPT-5.4 Mini
Openai
82.5
model-only
independent_benchmark
2000-01-01
7MoonshotAI: Kimi K2 Thinking
Moonshotai
82.3
model-only
independent_benchmark
2000-01-01
8DeepSeek: DeepSeek V4 Pro
Deepseek
82.3
model-only
independent_benchmark
2000-01-01
9Anthropic: Claude Opus 4.5
Anthropic
81.8
model-only
independent_benchmark
2000-01-01
10DeepSeek: DeepSeek V4 Flash
Deepseek
81.5
model-only
independent_benchmark
2000-01-01
11DeepSeek: DeepSeek V3.2
Deepseek
81.4
model-only
independent_benchmark
2000-01-01
12OpenAI: o3
Openai
81.4
model-only
independent_benchmark
2000-01-01
13Anthropic: Claude Sonnet 4.5
Anthropic
80.7
model-only
independent_benchmark
2000-01-01
14Anthropic: Claude Opus 4.7
Anthropic
80.3
model-only
independent_benchmark
2000-01-01
15OpenAI: GPT-4.1
Openai
79.0
model-only
independent_benchmark
2000-01-01
16Anthropic: Claude Sonnet 4.6
Anthropic
78.8
model-only
independent_benchmark
2000-01-01
17DeepSeek: R1
Deepseek
78.4
model-only
independent_benchmark
2000-01-01
18Google: Gemma 3 27B
Google
76.7
model-only
independent_benchmark
2000-01-01
19Google: Gemini 2.0 Flash
Google
71.2
model-only
independent_benchmark
2000-01-01
what this result means

Pairwise creative writing quality benchmark. Rubric score 0-100; higher means more creative and stylistically rich prose.

This benchmark contributes direct public evidence. Read its scope before generalizing the result.

A win here is a win on EQ-Bench Creative Writing v3. Broad task pages require independent corroboration before naming a general winner.

source record
category: eq
metric: accuracy
matched models: 19
latest source date: 2000-01-01
direction: higher is better
inspect upstream source ->