creative
qwen-2.5-72b-instruct vs gpt-4o
Model A winsby +1.4%
Rank #4
Confidence
48.2%
Evidence
13 pts
Creative Writing Official (EQ-Bench Slice): creative_writing_score
Value 78.4% · Conf 100.0% · Weight 7.8%
artificialanalysis_creative_writing_official.creative_writing_score (Mar 12, 2026)
Judgemark Official (EQ-Bench Slice): judgemark_score
Value 55.6% · Conf 100.0% · Weight 4.2%
artificialanalysis_judgemark_official.judgemark_score (Mar 12, 2026)
EQ-Bench Leaderboard: judgemark_score
Value 55.6% · Conf 100.0% · Weight 1.9%
eq_bench.judgemark_score (Mar 12, 2026)
Galileo Agent Leaderboard v2: Avg AC
Value 76.1% · Conf 100.0% · Weight 1.9%
galileo_agent_v2.avg_ac (Mar 12, 2026)
UGI Leaderboard: Writing ✍️
Value 41.8% · Conf 100.0% · Weight 1.5%
ugi_main.writing (Mar 12, 2026)
Rank #5
Confidence
36.5%
Evidence
12 pts
Creative Writing Official (EQ-Bench Slice): creative_writing_score
Value 84.4% · Conf 100.0% · Weight 8.4%
artificialanalysis_creative_writing_official.creative_writing_score (Mar 12, 2026)
Judgemark Official (EQ-Bench Slice): judgemark_score
Value 74.3% · Conf 100.0% · Weight 5.6%
artificialanalysis_judgemark_official.judgemark_score (Mar 12, 2026)
EQ-Bench Leaderboard: judgemark_score
Value 74.3% · Conf 100.0% · Weight 2.6%
eq_bench.judgemark_score (Mar 12, 2026)
MEGA-Bench: overall_score
Value 92.8% · Conf 100.0% · Weight 0.9%
mega_bench.overall_score (Mar 12, 2026)
DuckDB NSQL Leaderboard: all_execution_accuracy
Value 76.9% · Conf 100.0% · Weight 0.8%
duckdb_nsql_leaderboard.all_execution_accuracy (Mar 12, 2026)