creative
qwen-2.5-72b-instruct vs gpt-4o
Model A winsby +0.4%
Rank #1
Confidence
34.9%
Evidence
15 pts
Creative Writing Official (EQ-Bench Slice): creative_writing_score
Value 78.4% · Conf 100.0% · Weight 5.0%
artificialanalysis_creative_writing_official.creative_writing_score (Mar 12, 2026)
Judgemark Official (EQ-Bench Slice): judgemark_score
Value 55.6% · Conf 100.0% · Weight 2.6%
artificialanalysis_judgemark_official.judgemark_score (Mar 12, 2026)
DuckDB NSQL Leaderboard: all_execution_accuracy
Value 82.7% · Conf 100.0% · Weight 1.6%
duckdb_nsql_leaderboard.all_execution_accuracy (Mar 12, 2026)
JSONSchemaBench Leaderboard: medium_schema_compliance_pct
Value 90.1% · Conf 100.0% · Weight 1.4%
jsonschemabench_leaderboard.medium_schema_compliance_pct (Mar 12, 2026)
EQ-Bench Leaderboard: judgemark_score
Value 55.6% · Conf 100.0% · Weight 1.2%
eq_bench.judgemark_score (Mar 12, 2026)
Rank #2
Confidence
27.7%
Evidence
14 pts
Creative Writing Official (EQ-Bench Slice): creative_writing_score
Value 84.4% · Conf 100.0% · Weight 5.3%
artificialanalysis_creative_writing_official.creative_writing_score (Mar 12, 2026)
Judgemark Official (EQ-Bench Slice): judgemark_score
Value 74.3% · Conf 100.0% · Weight 3.5%
artificialanalysis_judgemark_official.judgemark_score (Mar 12, 2026)
EQ-Bench Leaderboard: judgemark_score
Value 74.3% · Conf 100.0% · Weight 1.6%
eq_bench.judgemark_score (Mar 12, 2026)
JSONSchemaBench Leaderboard: medium_schema_compliance_pct
Value 100.0% · Conf 100.0% · Weight 1.6%
jsonschemabench_leaderboard.medium_schema_compliance_pct (Mar 12, 2026)
DuckDB NSQL Leaderboard: all_execution_accuracy
Value 76.9% · Conf 100.0% · Weight 1.5%
duckdb_nsql_leaderboard.all_execution_accuracy (Mar 12, 2026)