cybersecurity
Threat Intelligence Analysis
Analyzing threat reports, CVEs, and security advisories to produce structured risk assessments.
#1 Recommendation
gemini-2.5-pro
Strong on BaxBench Leaderboard average_secure_pass_1_pct (44%) and FACTS Benchmark Suite facts_grounding_score_pct (100%)
external/google/gemini-2-5-pro
27.9%
Score
43.6%
Confidence
Limited benchmark evidence for this use case.
57 ranked models with average evidence of 14.7 points. Rankings may shift as more benchmark data is ingested.
Ranked Models
30
Evidence Quality
79%
Scoring
Benchmark-backed
Top Signal
BaxBench Leaderboard: average_secure_pass_1_pct
All Ranked Models
Compare Models
Model A leads by +7.9%
Shareable Link →Model A
gemini-2.5-pro
external/google/gemini-2-5-pro
Rank #1
BaxBench Leaderboard: average_secure_pass_1_pct
Value 44.1% · Conf 100.0% · Weight 2.0%
baxbench_leaderboard.average_secure_pass_1_pct (Mar 17, 2026)
FACTS Benchmark Suite: facts_grounding_score_pct
Value 100.0% · Conf 100.0% · Weight 2.0%
facts_benchmark_suite.facts_grounding_score_pct (Mar 17, 2026)
VADER Leaderboard: mean_score_pct
Value 80.8% · Conf 100.0% · Weight 1.7%
vader_leaderboard.mean_score_pct (Mar 17, 2026)
Vectara HHEM Leaderboard: overall_hallucination_error_pct
Value 76.0% · Conf 100.0% · Weight 1.6%
vectara_hhem_leaderboard.overall_hallucination_error_pct (Mar 17, 2026)
Model B
o3-20250416
external/openai/o3-20250416
Rank #2
BaxBench Leaderboard: average_secure_pass_1_pct
Value 67.2% · Conf 100.0% · Weight 3.0%
baxbench_leaderboard.average_secure_pass_1_pct (Mar 17, 2026)
VADER Leaderboard: mean_score_pct
Value 100.0% · Conf 100.0% · Weight 2.1%
vader_leaderboard.mean_score_pct (Mar 17, 2026)
Vals CorpFin v2: overall_accuracy_pct
Value 75.3% · Conf 100.0% · Weight 1.2%
vals_corp_fin_v2.overall_accuracy_pct (Mar 17, 2026)
SciArena Leaderboard: rating_elo
Value 100.0% · Conf 100.0% · Weight 1.1%
sciarena_leaderboard.rating_elo (Mar 17, 2026)
▶Ranking Diagnostics & Missing Models
Source Lift
Ranked
57
Sources
8
Quality
Insufficient
Vals CorpFin v2
vals_corp_fin_v2
43 rows
1.0% avg lift
Vals Legal Bench
vals_legal_bench
37 rows
0.2% avg lift
Vals MedQA
vals_medqa
36 rows
0.2% avg lift
Vals Tax Eval v2
vals_tax_eval_v2
34 rows
0.2% avg lift
Missing Strong Models
No obvious gaps right now.
▶Taxonomy Details
Core Tasks
Required Modes
Domains
Related Use Cases
cybersecurity
Security incident triage
Triage security incidents from alerts/logs into impact and next steps.
Top: gemini-2.5-pro
cybersecurity
Vulnerability-oriented code review
Review code for security vulnerabilities and propose mitigations.
Top: gemini-2.5-pro
cybersecurity
Malware analysis report (defensive)
Explain suspicious code and produce a defensive analysis report.
Top: gemini-2.5-pro