cybersecurity
Vulnerability-oriented code review
Review code for security vulnerabilities and propose mitigations.
#1 Recommendation
gemini-2.5-pro
Strong on VADER Leaderboard mean_score_pct (81%) and BaxBench Leaderboard average_secure_pass_1_pct (44%)
external/google/gemini-2-5-pro
21.2%
Score
32.1%
Confidence
Limited benchmark evidence for this use case.
36 ranked models with average evidence of 12.6 points. Rankings may shift as more benchmark data is ingested.
Ranked Models
30
Evidence Quality
80%
Scoring
Benchmark-backed
Top Signal
VADER Leaderboard: mean_score_pct
All Ranked Models
| Rank | Model | Score |
|---|---|---|
| #1 | gemini-2.5-pro Strong on VADER Leaderboard mean_score_pct (81%) and BaxBench Leaderboard average_secure_pass_1_pct (44%) | 21.2% |
| #4 | Meta-Llama-3-8B-Instruct | 16.0% |
| #5 | gpt-4o-2024-05-13 | 15.8% |
| #6 | Llama-2-7b-chat-hf | 15.1% |
| #8 | openai/gpt-4o-mini-2024-07-18 | 13.1% |
| #9 | deepseek/deepseek-r1 | 13.1% |
| #10 | gpt-4.1-20250414 | 12.5% |
| #11 | Kimi K2 Thinking | 12.2% |
| #12 | gemma-7b-it | 12.2% |
| #13 | gemma-2b-it | 12.2% |
| #15 | z-ai/glm-4.7 | 11.9% |
| #17 | falcon-7b-instruct | 11.3% |
| #19 | minimax/minimax-m2.1 | 11.2% |
| #21 | gemini-3-pro-preview | 10.8% |
| #23 | zephyr-7b-beta | 10.4% |
| #24 | GLM-5 | 10.4% |
| #28 | Grok-4-0709 | 10.2% |
| #29 | google/gemini-3.1-pro-preview | 9.8% |
| #30 | claude-sonnet-4-20250514 | 9.8% |
| #32 | gpt-5-2025-08-07 | 9.0% |
| #33 | openai/gpt-5.4-2026-03-05 | 8.9% |
| #34 | gpt-4o | 8.8% |
| #35 | gpt-5.1-2025-11-13 | 8.6% |
| #36 | anthropic/claude-sonnet-4.6 | 8.6% |
| #37 | claude-opus-4-5-20251101 | 8.5% |
| #38 | gpt-5-mini-2025-08-07 | 8.3% |
| #39 | gemini-3-flash-preview | 8.1% |
| #40 | alpaca-native | 8.1% |
| #41 | x-ai/grok-3 | 8.0% |
| #42 | Mistral-7B-OpenOrca | 8.0% |
Compare Models
Model A leads by +5.2%
Shareable Link →Model A
gemini-2.5-pro
external/google/gemini-2-5-pro
Rank #1
VADER Leaderboard: mean_score_pct
Value 80.8% · Conf 100.0% · Weight 2.5%
vader_leaderboard.mean_score_pct (Mar 12, 2026)
BaxBench Leaderboard: average_secure_pass_1_pct
Value 44.1% · Conf 100.0% · Weight 2.0%
baxbench_leaderboard.average_secure_pass_1_pct (Mar 12, 2026)
VADER Leaderboard: explanation_mean_score_pct
Value 100.0% · Conf 100.0% · Weight 1.4%
vader_leaderboard.explanation_mean_score_pct (Mar 12, 2026)
VADER Leaderboard: remediation_mean_score_pct
Value 66.7% · Conf 100.0% · Weight 1.1%
vader_leaderboard.remediation_mean_score_pct (Mar 12, 2026)
Model B
Meta-Llama-3-8B-Instruct
meta-llama/Meta-Llama-3-8B-Instruct
Rank #4
LLM Trustworthy Leaderboard: adv
Value 100.0% · Conf 100.0% · Weight 2.8%
llm_trustworthy_leaderboard.adv (Mar 12, 2026)
LLM Trustworthy Leaderboard: privacy
Value 69.0% · Conf 100.0% · Weight 2.2%
llm_trustworthy_leaderboard.privacy (Mar 12, 2026)
RepoQA Official Results: overall_average_pass_at_1_pct
Value 64.9% · Conf 100.0% · Weight 1.8%
repoqa_leaderboard.overall_average_pass_at_1_pct (Mar 12, 2026)
LLM Trustworthy Leaderboard: fairness
Value 46.8% · Conf 100.0% · Weight 1.6%
llm_trustworthy_leaderboard.fairness (Mar 12, 2026)
▶Ranking Diagnostics & Missing Models
Source Lift
Ranked
36
Sources
8
Quality
Insufficient
Vals CorpFin v2
vals_corp_fin_v2
21 rows
0.3% avg lift
Vals Tax Eval v2
vals_tax_eval_v2
21 rows
0.4% avg lift
Vals GPQA
vals_gpqa
21 rows
0.3% avg lift
Vals MedQA
vals_medqa
20 rows
0.4% avg lift
Missing Strong Models
gpt-5.2-2025-12-11
external/openai/gpt-5-2-2025-12-11
Rank #16
16.2%
anthropic/claude-opus-4-6-thinking
external/anthropic/claude-opus-4-6-thinking
Rank #17
16.1%
google/gemini-3.1-flash-lite-preview
external/google/gemini-3-1-flash-lite-preview
Rank #19
15.6%
anthropic/claude-opus-4-5-20251101-thinking
external/anthropic/claude-opus-4-5-20251101-thinking
Rank #21
15.2%