hr_recruiting
gpt-4.1-20250414 vs Grok-4-0709
Model A winsby +7.1%
Rank #1
Confidence
29.5%
Evidence
19 pts
MMLongBench-Doc Leaderboard: acc_score_pct
Value 74.6% · Conf 100.0% · Weight 6.4%
mmlongbench_doc_leaderboard.acc_score_pct (Mar 12, 2026)
Galileo Agent Leaderboard v2: Avg TSQ
Value 64.1% · Conf 100.0% · Weight 2.6%
galileo_agent_v2.avg_tsq (Mar 12, 2026)
Galileo Agent Leaderboard v2: Avg AC
Value 100.0% · Conf 100.0% · Weight 1.9%
galileo_agent_v2.avg_ac (Mar 12, 2026)
Vectara HHEM Leaderboard: overall_hallucination_error_pct
Value 82.5% · Conf 100.0% · Weight 0.6%
vectara_hhem_leaderboard.overall_hallucination_error_pct (Mar 12, 2026)
Vals Tax Eval v2: overall_accuracy_pct
Value 94.2% · Conf 100.0% · Weight 0.5%
vals_tax_eval_v2.overall_accuracy_pct (Mar 12, 2026)
Rank #2
Confidence
18.7%
Evidence
18 pts
Galileo Agent Leaderboard v2: Avg TSQ
Value 84.6% · Conf 100.0% · Weight 3.5%
galileo_agent_v2.avg_tsq (Mar 12, 2026)
Galileo Agent Leaderboard v2: Avg AC
Value 56.5% · Conf 100.0% · Weight 1.1%
galileo_agent_v2.avg_ac (Mar 12, 2026)
Vals CorpFin v2: overall_accuracy_pct
Value 93.6% · Conf 100.0% · Weight 0.5%
vals_corp_fin_v2.overall_accuracy_pct (Mar 12, 2026)
Vals LiveCodeBench: overall_accuracy_pct
Value 92.8% · Conf 100.0% · Weight 0.5%
vals_lcb.overall_accuracy_pct (Mar 12, 2026)
Vals MedQA: overall_accuracy_pct
Value 92.4% · Conf 100.0% · Weight 0.5%
vals_medqa.overall_accuracy_pct (Mar 12, 2026)