Creative writing is the use case where the standard benchmark stack fails most badly. MMLU doesn't tell you whether a model can write a character who feels like a person. HumanEval tells you nothing about whether prose has rhythm. Even models that score near the top of every reasoning benchmark can produce fiction that reads like a plot synopsis — technically coherent, emotionally inert.
The gap exists because most benchmarks measure retrieval and logic. Creative writing requires something different: voice, emotional resonance, the ability to show rather than tell, and the judgment to know when a scene needs silence instead of explanation. Those capabilities correlate weakly with IQ scores and strongly with EQ.
This report covers what the data actually shows about which models write well — and why the rankings here may surprise you.
What Creative Writing Actually Requires
Creative writing isn't a single capability. It's a cluster of distinct skills that need to work together:
Narrative arc — A story needs to move. Setup, tension, escalation, resolution — these structural requirements hold whether you're writing a 500-word flash piece or a multi-chapter narrative. Models that generate locally coherent sentences but fail to track the larger arc produce prose that feels like it's going nowhere.
Voice and style consistency — Good writing has a distinct voice. A model asked to write in a particular style (minimalist, baroque, darkly comic) needs to sustain that register across the whole piece, not just the first paragraph. Consistency is harder than imitation; many models nail the opening and drift back to generic prose within a few hundred words.
Emotional resonance — Characters need to feel like people, not functions. Readers don't care about a character's problem unless they care about the character. This is where EQ matters more than IQ: a model that can correctly identify what someone is feeling in a social scenario tends to write characters whose emotional lives make sense.
Show, don't tell — "She was devastated" is telling. Showing what devastation looks like in action is harder and more effective. Models with low EQ scores tend to over-explain emotional states rather than dramatizing them.
Avoiding formulaic patterns — LLM-generated creative writing has recognizable tells: the redemptive arc that resolves too cleanly, the villain who exists to be defeated, the emotional epiphany that arrives exactly at the three-quarter mark. High-quality creative writing models subvert expectations; the weakest ones reproduce the most common narrative templates with near-perfect fidelity.
EQ scores are a better predictor of creative writing quality than IQ scores. The models that understand human emotional dynamics tend to write characters who feel human. The models that merely reason well tend to write characters who function correctly in the plot but feel like archetypes rather than people.
The Benchmark Landscape
Evaluating creative writing automatically is genuinely hard. Most benchmarks sidestep it entirely. The ones that exist are imperfect but useful:
Judgemark is the most relevant benchmark for creative quality. It uses LLM-as-judge evaluation across multiple quality axes — originality, coherence, voice, and emotional impact — with human correlation studies to validate the scoring. It's not perfect, but it's the closest thing to a rigorous automated assessment of writing quality. Strong Judgemark performance is the best single signal for creative writing ability.
The creativity dimension on this leaderboard aggregates Judgemark with EQ-Bench and a small number of open-ended generation evaluations. Models that score well on creativity have demonstrated both stylistic range and emotional authenticity across varied prompts and genres.
Longform coherence is tracked through evaluations that examine whether models maintain consistent characterization, plot logic, and prose style across extended generations. This catches a common failure mode: models that write excellent individual paragraphs but lose the thread over longer pieces.
Perplexity and fluency scores — used in many academic NLP evaluations — are poor proxies for creative quality. A model can produce grammatically impeccable, statistically average prose that no human would voluntarily read. Fluency is necessary but nowhere near sufficient.
Current Rankings
Long-form story co-author
creative
| # | Model | Score |
|---|---|---|
| 1 | Grok-4-0709 external/xai/grok-4-0709 | 27.1 |
| 2 | gemini-3-pro-preview external/google/gemini-3-pro-preview | 25.6 |
| 3 | gemini-2.5-pro external/google/gemini-2-5-pro | 25.5 |
| 4 | gpt-4.1-20250414 external/openai/gpt-4-1-20250414 | 24.9 |
| 5 | gpt-5-2025-08-07 external/openai/gpt-5-2025-08-07 | 23.3 |
| 6 | o3-20250416 external/openai/o3-20250416 | 22.7 |
| 7 | anthropic/claude-sonnet-4 external/anthropic/claude-sonnet-4 | 22.4 |
| 8 | google/gemini-3.1-pro-preview external/google/gemini-3-1-pro-preview | 22.1 |
| 9 | gemini-3-flash-preview external/google/gemini-3-flash-preview | 21.2 |
| 10 | gpt-5.2-2025-12-11 external/openai/gpt-5-2-2025-12-11 | 20.9 |
| 11 | xai-org/grok-4-1-fast-reasoning external/xai-org/grok-4-1-fast-reasoning | 20.7 |
| 12 | qwen-2.5-72b-instruct external/qwen/qwen-2-5-72b-instruct | 19.2 |
| 13 | openai/gpt-5.4-2026-03-05 external/openai/gpt-5-4-2026-03-05 | 18.8 |
| 14 | anthropic/claude-sonnet-4.6 external/anthropic/claude-sonnet-4-6 | 18.5 |
| 15 | claude-opus-4-5-20251101 external/anthropic/claude-opus-4-5-20251101 | 18.1 |
Reading These Rankings
The scores above reflect performance weighted toward creative quality signals — Judgemark, EQ dimension, and longform coherence — rather than general reasoning ability. A few consistent patterns:
EQ predicts creative rank more than IQ does. This holds across genres. Models with strong theory-of-mind and social reasoning scores write better dialogue, more believable character motivation, and scenes where the emotional subtext actually works. The EQ rankings are worth reading alongside this report.
Larger models have more stylistic range. A bigger model has absorbed more diverse writing and can sustain a wider variety of registers — the difference between writing competent minimalism and competent maximalism, rather than defaulting to the same middle-register prose regardless of what was asked.
Instruction-following quality shapes the output ceiling. Even a highly capable creative model fails if it can't correctly interpret a nuanced style prompt. "Write in the style of a 1970s noir short story with an unreliable narrator" requires both the capability to execute that style and the instruction comprehension to know what was actually asked.
Heavy safety fine-tuning measurably hurts creative writing. This is one of the few use cases where the tradeoff is clearly visible in the data. Models that refuse to engage with morally complex characters, dark themes, or ambiguous endings produce sanitized fiction that satisfies nobody. The models at the top of these rankings are generally willing to go to difficult places when the narrative requires it.
Open-Weights Note
Open-weight models in this ranking are meaningful options for creative writing workloads, particularly for teams building interactive fiction tools, game narrative systems, or writing assistants where inference cost at scale matters. The best open-weight creative models are within striking distance of the top proprietary models on Judgemark — the gap is real but not large.
The practical consideration for creative use cases is that fine-tuning on curated literary data can substantially improve an open-weight model's stylistic range. If you have a specific genre or style target, fine-tuning is more tractable for creative writing than for most other use cases.
Related Use Cases
Creative writing encompasses several distinct sub-tasks with their own characteristics:
- Screenplay and scene writing — Scene structure, dialogue formatting, and visual storytelling differ from prose fiction in ways that affect model performance
- Poetry and lyrics — Meter, constraint satisfaction, and sonic qualities are distinct capabilities from narrative prose
- NPC dialogue — Interactive narrative requires branching coherence and character consistency across player choices, which stresses different capabilities than linear fiction
Full rankings for each are in the use cases browser. The creativity rankings and EQ rankings are the most relevant dimension reports for this use case.
Methodology
Rankings on this page are computed from live benchmark ingestion across the sources described above. Scores update as new benchmark data is ingested. Full methodology at /methodology.