adult
Grok-4-0709 vs gpt-4.1-20250414
Model A winsby +0.8%
Rank #21
Confidence
25.9%
Evidence
20 pts
UGI Leaderboard: Writing ✍️
Value 99.2% · Conf 100.0% · Weight 3.8%
ugi_main.writing (Mar 12, 2026)
UGI Leaderboard: Entertainment
Value 100.0% · Conf 100.0% · Weight 3.4%
ugi_main.entertainment (Mar 12, 2026)
Galileo Agent Leaderboard v2: Avg TSQ
Value 84.6% · Conf 100.0% · Weight 1.3%
galileo_agent_v2.avg_tsq (Mar 12, 2026)
Galileo Agent Leaderboard v2: Avg AC
Value 56.5% · Conf 100.0% · Weight 1.2%
galileo_agent_v2.avg_ac (Mar 12, 2026)
Vals CorpFin v2: overall_accuracy_pct
Value 93.6% · Conf 100.0% · Weight 0.6%
vals_corp_fin_v2.overall_accuracy_pct (Mar 12, 2026)
Rank #25
Confidence
25.6%
Evidence
20 pts
UGI Leaderboard: Writing ✍️
Value 100.0% · Conf 100.0% · Weight 3.8%
ugi_main.writing (Mar 12, 2026)
UGI Leaderboard: Entertainment
Value 73.3% · Conf 100.0% · Weight 2.5%
ugi_main.entertainment (Mar 12, 2026)
Galileo Agent Leaderboard v2: Avg AC
Value 100.0% · Conf 100.0% · Weight 2.1%
galileo_agent_v2.avg_ac (Mar 12, 2026)
Galileo Agent Leaderboard v2: Avg TSQ
Value 64.1% · Conf 100.0% · Weight 1.0%
galileo_agent_v2.avg_tsq (Mar 12, 2026)
Vectara HHEM Leaderboard: overall_hallucination_error_pct
Value 82.5% · Conf 100.0% · Weight 0.7%
vectara_hhem_leaderboard.overall_hallucination_error_pct (Mar 12, 2026)