marketing_sales

Grok-4-0709 vs gemini-2.5-pro

For Campaign brief

Model A winsby +0.3%

Model A

Winner

Grok-4-0709

external/xai/grok-4-0709

26.2%

Rank #1

Confidence

35.5%

Evidence

19 pts

Confidence 35.5%19 evidence pts

Galileo Agent Leaderboard v2: Avg TSQ

Value 84.6% · Conf 100.0% · Weight 4.1%

galileo_agent_v2.avg_tsq (Mar 12, 2026)

UGI Leaderboard: Writing ✍️

Value 99.2% · Conf 100.0% · Weight 2.2%

ugi_main.writing (Mar 12, 2026)

Galileo Agent Leaderboard v2: Avg AC

Value 56.5% · Conf 100.0% · Weight 1.6%

galileo_agent_v2.avg_ac (Mar 12, 2026)

Vals CorpFin v2: overall_accuracy_pct

Value 93.6% · Conf 100.0% · Weight 0.8%

vals_corp_fin_v2.overall_accuracy_pct (Mar 12, 2026)

Vals LiveCodeBench: overall_accuracy_pct

Value 92.8% · Conf 100.0% · Weight 0.8%

vals_lcb.overall_accuracy_pct (Mar 12, 2026)

Model B

gemini-2.5-pro

external/google/gemini-2-5-pro

25.9%

Rank #2

Confidence

36.8%

Evidence

22 pts

Confidence 36.8%22 evidence pts

Galileo Agent Leaderboard v2: Avg TSQ

Value 79.5% · Conf 100.0% · Weight 3.8%

galileo_agent_v2.avg_tsq (Mar 12, 2026)

UGI Leaderboard: Writing ✍️

Value 96.3% · Conf 100.0% · Weight 2.2%

ugi_main.writing (Mar 12, 2026)

Galileo Agent Leaderboard v2: Avg AC

Value 58.7% · Conf 100.0% · Weight 1.7%

galileo_agent_v2.avg_ac (Mar 12, 2026)

Vectara HHEM Leaderboard: overall_hallucination_error_pct

Value 76.0% · Conf 100.0% · Weight 0.9%

vectara_hhem_leaderboard.overall_hallucination_error_pct (Mar 12, 2026)

FACTS Benchmark Suite: facts_grounding_score_pct

Value 100.0% · Conf 100.0% · Weight 0.8%

facts_benchmark_suite.facts_grounding_score_pct (Mar 12, 2026)

Back to Campaign brief Grok-4-0709 Profile gemini-2.5-pro Profile