marketing_sales
Grok-4-0709 vs gemini-2.5-pro
For Campaign brief
Model A winsby +0.3%
Rank #1
Confidence
35.5%
Evidence
19 pts
Galileo Agent Leaderboard v2: Avg TSQ
Value 84.6% · Conf 100.0% · Weight 4.1%
galileo_agent_v2.avg_tsq (Mar 12, 2026)
UGI Leaderboard: Writing ✍️
Value 99.2% · Conf 100.0% · Weight 2.2%
ugi_main.writing (Mar 12, 2026)
Galileo Agent Leaderboard v2: Avg AC
Value 56.5% · Conf 100.0% · Weight 1.6%
galileo_agent_v2.avg_ac (Mar 12, 2026)
Vals CorpFin v2: overall_accuracy_pct
Value 93.6% · Conf 100.0% · Weight 0.8%
vals_corp_fin_v2.overall_accuracy_pct (Mar 12, 2026)
Vals LiveCodeBench: overall_accuracy_pct
Value 92.8% · Conf 100.0% · Weight 0.8%
vals_lcb.overall_accuracy_pct (Mar 12, 2026)
Rank #2
Confidence
36.8%
Evidence
22 pts
Galileo Agent Leaderboard v2: Avg TSQ
Value 79.5% · Conf 100.0% · Weight 3.8%
galileo_agent_v2.avg_tsq (Mar 12, 2026)
UGI Leaderboard: Writing ✍️
Value 96.3% · Conf 100.0% · Weight 2.2%
ugi_main.writing (Mar 12, 2026)
Galileo Agent Leaderboard v2: Avg AC
Value 58.7% · Conf 100.0% · Weight 1.7%
galileo_agent_v2.avg_ac (Mar 12, 2026)
Vectara HHEM Leaderboard: overall_hallucination_error_pct
Value 76.0% · Conf 100.0% · Weight 0.9%
vectara_hhem_leaderboard.overall_hallucination_error_pct (Mar 12, 2026)
FACTS Benchmark Suite: facts_grounding_score_pct
Value 100.0% · Conf 100.0% · Weight 0.8%
facts_benchmark_suite.facts_grounding_score_pct (Mar 12, 2026)