cybersecurity
gemini-2.5-pro vs o3-20250416
Model A winsby +7.9%
Rank #1
Confidence
43.6%
Evidence
30 pts
BaxBench Leaderboard: average_secure_pass_1_pct
Value 44.1% · Conf 100.0% · Weight 2.0%
baxbench_leaderboard.average_secure_pass_1_pct (Mar 17, 2026)
FACTS Benchmark Suite: facts_grounding_score_pct
Value 100.0% · Conf 100.0% · Weight 2.0%
facts_benchmark_suite.facts_grounding_score_pct (Mar 17, 2026)
VADER Leaderboard: mean_score_pct
Value 80.8% · Conf 100.0% · Weight 1.7%
vader_leaderboard.mean_score_pct (Mar 17, 2026)
Vectara HHEM Leaderboard: overall_hallucination_error_pct
Value 76.0% · Conf 100.0% · Weight 1.6%
vectara_hhem_leaderboard.overall_hallucination_error_pct (Mar 17, 2026)
AI SOC LLM Leaderboard: overall_success_rate_pct
Value 80.6% · Conf 100.0% · Weight 1.4%
ai_soc_llm_leaderboard.overall_success_rate_pct (Mar 17, 2026)
Rank #2
Confidence
27.9%
Evidence
23 pts
BaxBench Leaderboard: average_secure_pass_1_pct
Value 67.2% · Conf 100.0% · Weight 3.0%
baxbench_leaderboard.average_secure_pass_1_pct (Mar 17, 2026)
VADER Leaderboard: mean_score_pct
Value 100.0% · Conf 100.0% · Weight 2.1%
vader_leaderboard.mean_score_pct (Mar 17, 2026)
Vals CorpFin v2: overall_accuracy_pct
Value 75.3% · Conf 100.0% · Weight 1.2%
vals_corp_fin_v2.overall_accuracy_pct (Mar 17, 2026)
SciArena Leaderboard: rating_elo
Value 100.0% · Conf 100.0% · Weight 1.1%
sciarena_leaderboard.rating_elo (Mar 17, 2026)
VADER Leaderboard: remediation_mean_score_pct
Value 100.0% · Conf 100.0% · Weight 1.1%
vader_leaderboard.remediation_mean_score_pct (Mar 17, 2026)