adult

Grok-4-0709 vs gpt-4.1-20250414

For Adult ERP roleplay (explicit)

Model A winsby +0.8%

Model A

Winner

Grok-4-0709

external/xai/grok-4-0709

20.6%

Rank #21

Confidence

25.9%

Evidence

20 pts

Confidence 25.9%20 evidence pts

UGI Leaderboard: Writing ✍️

Value 99.2% · Conf 100.0% · Weight 3.8%

ugi_main.writing (Mar 12, 2026)

UGI Leaderboard: Entertainment

Value 100.0% · Conf 100.0% · Weight 3.4%

ugi_main.entertainment (Mar 12, 2026)

Galileo Agent Leaderboard v2: Avg TSQ

Value 84.6% · Conf 100.0% · Weight 1.3%

galileo_agent_v2.avg_tsq (Mar 12, 2026)

Galileo Agent Leaderboard v2: Avg AC

Value 56.5% · Conf 100.0% · Weight 1.2%

galileo_agent_v2.avg_ac (Mar 12, 2026)

Vals CorpFin v2: overall_accuracy_pct

Value 93.6% · Conf 100.0% · Weight 0.6%

vals_corp_fin_v2.overall_accuracy_pct (Mar 12, 2026)

Model B

gpt-4.1-20250414

external/openai/gpt-4-1-20250414

19.8%

Rank #25

Confidence

25.6%

Evidence

20 pts

Confidence 25.6%20 evidence pts

UGI Leaderboard: Writing ✍️

Value 100.0% · Conf 100.0% · Weight 3.8%

ugi_main.writing (Mar 12, 2026)

UGI Leaderboard: Entertainment

Value 73.3% · Conf 100.0% · Weight 2.5%

ugi_main.entertainment (Mar 12, 2026)

Galileo Agent Leaderboard v2: Avg AC

Value 100.0% · Conf 100.0% · Weight 2.1%

galileo_agent_v2.avg_ac (Mar 12, 2026)

Galileo Agent Leaderboard v2: Avg TSQ

Value 64.1% · Conf 100.0% · Weight 1.0%

galileo_agent_v2.avg_tsq (Mar 12, 2026)

Vectara HHEM Leaderboard: overall_hallucination_error_pct

Value 82.5% · Conf 100.0% · Weight 0.7%

vectara_hhem_leaderboard.overall_hallucination_error_pct (Mar 12, 2026)

Back to Adult ERP roleplay (explicit)Grok-4-0709 Profile gpt-4.1-20250414 Profile