creative
qwen-2.5-72b-instruct vs gpt-4o
For NPC dialogue
Model A winsby +1.0%
Rank #12
Confidence
35.0%
Evidence
13 pts
Creative Writing Official (EQ-Bench Slice): creative_writing_score
Value 78.4% · Conf 100.0% · Weight 6.1%
artificialanalysis_creative_writing_official.creative_writing_score (Mar 12, 2026)
Judgemark Official (EQ-Bench Slice): judgemark_score
Value 55.6% · Conf 100.0% · Weight 3.3%
artificialanalysis_judgemark_official.judgemark_score (Mar 12, 2026)
EQ-Bench Leaderboard: judgemark_score
Value 55.6% · Conf 100.0% · Weight 1.5%
eq_bench.judgemark_score (Mar 12, 2026)
Galileo Agent Leaderboard v2: Avg AC
Value 76.1% · Conf 100.0% · Weight 1.5%
galileo_agent_v2.avg_ac (Mar 12, 2026)
UGI Leaderboard: Writing ✍️
Value 41.8% · Conf 100.0% · Weight 1.1%
ugi_main.writing (Mar 12, 2026)
Rank #16
Confidence
26.5%
Evidence
12 pts
Creative Writing Official (EQ-Bench Slice): creative_writing_score
Value 84.4% · Conf 100.0% · Weight 6.6%
artificialanalysis_creative_writing_official.creative_writing_score (Mar 12, 2026)
Judgemark Official (EQ-Bench Slice): judgemark_score
Value 74.3% · Conf 100.0% · Weight 4.4%
artificialanalysis_judgemark_official.judgemark_score (Mar 12, 2026)
EQ-Bench Leaderboard: judgemark_score
Value 74.3% · Conf 100.0% · Weight 2.0%
eq_bench.judgemark_score (Mar 12, 2026)
MEGA-Bench: overall_score
Value 92.8% · Conf 100.0% · Weight 0.7%
mega_bench.overall_score (Mar 12, 2026)
DuckDB NSQL Leaderboard: all_execution_accuracy
Value 76.9% · Conf 100.0% · Weight 0.6%
duckdb_nsql_leaderboard.all_execution_accuracy (Mar 12, 2026)