BasedAGIBasedAGI
Menu
Rankings live

cybersecurity

Threat Intelligence Analysis

Analyzing threat reports, CVEs, and security advisories to produce structured risk assessments.

#1 Recommendation

gemini-2.5-pro

Strong on BaxBench Leaderboard average_secure_pass_1_pct (44%) and FACTS Benchmark Suite facts_grounding_score_pct (100%)

external/google/gemini-2-5-pro

27.9%

Score

43.6%

Confidence

Limited benchmark evidence for this use case.

57 ranked models with average evidence of 14.7 points. Rankings may shift as more benchmark data is ingested.

Ranked Models

30

Evidence Quality

79%

Scoring

Benchmark-backed

Top Signal

BaxBench Leaderboard: average_secure_pass_1_pct

All Ranked Models

Max params:
Min confidence:
30 of 30
RankModelScore
#1gemini-2.5-pro

Strong on BaxBench Leaderboard average_secure_pass_1_pct (44%) and FACTS Benchmark Suite facts_grounding_score_pct (100%)

27.9%
#2o3-20250416

Strong on BaxBench Leaderboard average_secure_pass_1_pct (67%) and VADER Leaderboard mean_score_pct (100%)

20.0%
#3gemini-3-pro-preview

Strong on FACTS Benchmark Suite facts_grounding_score_pct (88%) and FACTS Benchmark Suite facts_search_score_pct (100%)

19.8%
#4gpt-4.1-20250414
19.3%
#5gpt-5-2025-08-07
16.2%
#6gpt-5-mini-2025-08-07
16.1%
#7anthropic/claude-sonnet-4.6
14.9%
#8Grok-4-0709
14.4%
#9google/gemini-3.1-pro-preview
13.6%
#10openai/gpt-5.4-2026-03-05
13.4%
#11gemini-2.5-flash
13.1%
#12claude-opus-4-5-20251101
13.0%
#14gpt-5.1-2025-11-13
11.9%
#15claude-sonnet-4-20250514
11.8%
#16openai/gpt-4.1
11.8%
#17gemini-3-flash-preview
11.6%
#18x-ai/grok-3
11.5%
#19google/gemini-3.1-flash-lite-preview
11.1%
#20xai-org/grok-4-fast-reasoning
11.0%
#21gpt-4.1-mini-20250414
10.9%
#23xai-org/grok-4-1-fast-reasoning
10.4%
#24anthropic/claude-opus-4-6-thinking
10.4%
#25gpt-5.2-2025-12-11
10.4%
#26kimi/kimi-k2.5-thinking
9.8%
#27anthropic/claude-opus-4-5-20251101-thinking
9.6%
#28deepseek/deepseek-r1
9.5%
#29gpt-4o
9.3%
#30gpt-4o-2024-05-13
9.1%
#31anthropic/claude-sonnet-4-5-20250929-thinking
9.0%
#32openai/gpt-4o-mini-2024-07-18
8.9%

Compare Models

Model A leads by +7.9%

Shareable Link →

Model A

gemini-2.5-pro

external/google/gemini-2-5-pro

27.9%

Rank #1

Confidence 43.6%30 evidence pts

BaxBench Leaderboard: average_secure_pass_1_pct

Value 44.1% · Conf 100.0% · Weight 2.0%

baxbench_leaderboard.average_secure_pass_1_pct (Mar 17, 2026)

FACTS Benchmark Suite: facts_grounding_score_pct

Value 100.0% · Conf 100.0% · Weight 2.0%

facts_benchmark_suite.facts_grounding_score_pct (Mar 17, 2026)

VADER Leaderboard: mean_score_pct

Value 80.8% · Conf 100.0% · Weight 1.7%

vader_leaderboard.mean_score_pct (Mar 17, 2026)

Vectara HHEM Leaderboard: overall_hallucination_error_pct

Value 76.0% · Conf 100.0% · Weight 1.6%

vectara_hhem_leaderboard.overall_hallucination_error_pct (Mar 17, 2026)

Model B

o3-20250416

external/openai/o3-20250416

20.0%

Rank #2

Confidence 27.9%23 evidence pts

BaxBench Leaderboard: average_secure_pass_1_pct

Value 67.2% · Conf 100.0% · Weight 3.0%

baxbench_leaderboard.average_secure_pass_1_pct (Mar 17, 2026)

VADER Leaderboard: mean_score_pct

Value 100.0% · Conf 100.0% · Weight 2.1%

vader_leaderboard.mean_score_pct (Mar 17, 2026)

Vals CorpFin v2: overall_accuracy_pct

Value 75.3% · Conf 100.0% · Weight 1.2%

vals_corp_fin_v2.overall_accuracy_pct (Mar 17, 2026)

SciArena Leaderboard: rating_elo

Value 100.0% · Conf 100.0% · Weight 1.1%

sciarena_leaderboard.rating_elo (Mar 17, 2026)

Ranking Diagnostics & Missing Models

Source Lift

Ranked

57

Sources

8

Quality

Insufficient

Vals CorpFin v2

vals_corp_fin_v2

43 rows

1.0% avg lift

Vals Legal Bench

vals_legal_bench

37 rows

0.2% avg lift

Vals MedQA

vals_medqa

36 rows

0.2% avg lift

Vals Tax Eval v2

vals_tax_eval_v2

34 rows

0.2% avg lift

Missing Strong Models

No obvious gaps right now.

Taxonomy Details

Core Tasks

task.multi_doc_synthesistask.risk_assessment

Required Modes

mode.long_context

Domains

domain.cybersecurity_defense

Related Use Cases