Cybersecurity is one of the highest-stakes domains for LLM deployment. The consequences of a wrong answer range from a missed alert that escalates into a breach to a false positive that sends responders chasing a non-issue while the real attack progresses. Getting model selection right here matters more than in most domains.
Security work is also technically demanding in ways that expose model weaknesses that don't appear in general benchmarks. Triage requires synthesizing signals across logs, alerts, and threat intelligence with incomplete information under time pressure. Vulnerability analysis requires precise technical understanding of code and system architecture. Threat hunting requires forming hypotheses and reasoning about adversary behavior — a capability profile that sits between technical analysis and strategic reasoning.
What Cybersecurity Work Actually Requires
The security use case is not monolithic. Different security tasks have different capability requirements, and models that excel at one may underperform at another.
Incident triage — Classifying alerts as genuine threats, false positives, or ambiguous requires pattern matching across large volumes of log and alert data combined with knowledge of attacker TTPs (tactics, techniques, and procedures). Speed matters; so does precision. False negatives (missed real incidents) and false positives (wasted response effort) have asymmetric costs that depend on the specific security posture.
Vulnerability analysis — Understanding whether a piece of code or system configuration contains an exploitable vulnerability requires deep technical reasoning about security-relevant code paths, data flows, and system trust boundaries. This is harder than code generation because the model must reason about what an adversary could do, not just what the code does under normal conditions.
Threat intelligence synthesis — Summarizing threat reports, extracting indicators of compromise, and mapping observed activity to known threat actor profiles requires both reading comprehension and structured knowledge about the threat landscape. Models with strong factual accuracy on security knowledge perform significantly better here.
Security advisory generation — Communicating findings clearly to technical and non-technical stakeholders requires good writing, appropriate calibration of severity language, and the ability to translate technical findings into risk terms. This is where EQ-dimension capabilities (communicating clearly under uncertainty) matter.
Accuracy and IQ scores are the strongest predictors of cybersecurity performance in our data. Security tasks are both knowledge-intensive (requiring accurate recall of vulnerability patterns, attack techniques, and protocol behavior) and reasoning-intensive (requiring causal inference about what an attacker could do with a given access level).
The Policy Question
Cybersecurity is one of the domains where model providers apply the most policy restrictions. Many flagship models refuse to explain how vulnerabilities work in detail, decline to write proof-of-concept exploit code, or add caveats to outputs that make them less useful for legitimate security work.
Our nanny index tracks this behavior. In security contexts, over-cautious models create real operational problems: they refuse to analyze malware samples, decline to explain how an observed technique works, or hedge so heavily on vulnerability severity that the output becomes ambiguous. For defenders who need precise technical analysis, models with low nanny index scores — those that treat security professionals as professionals — are meaningfully more useful.
This tradeoff is legitimate: the same capability that lets a model help a defender analyze an attack can help an attacker refine one. The practical resolution is that mature security teams evaluate models against their actual workflows and select based on demonstrated utility on real work, not on whether the model passes theoretical policy restrictions.
Use LLM assistance for defensive security, threat analysis, and security research in authorized contexts only. Verify AI-generated analysis against primary sources before acting on it in production environments — models can and do make technically plausible but incorrect statements about vulnerabilities, CVEs, and attacker techniques. Security decisions made on incorrect analysis can be worse than no analysis at all.
The Benchmark Landscape
CyberSecEval (Meta) is the most comprehensive public benchmark for cybersecurity-specific LLM evaluation. It tests both capability (ability to assist with security tasks) and safety (tendency to assist with clearly malicious requests). The capability component is most relevant for model selection.
CTF benchmark performance — Some evaluations test model performance on Capture the Flag challenges, which require reasoning about vulnerabilities and exploitation techniques in controlled settings. CTF performance correlates with vulnerability analysis capability.
IQ and accuracy dimension scores in our leaderboard are the strongest general-purpose proxies for security performance. Security tasks require both precise factual knowledge and multi-step causal reasoning — the two signals these dimensions capture.
Nanny index is relevant in the opposite direction: a high nanny index predicts over-refusal on legitimate security analysis tasks.
Current Rankings
Security incident triage
cybersecurity
| # | Model | Score |
|---|---|---|
| 1 | gemini-2.5-pro external/google/gemini-2-5-pro | 30.6 |
| 2 | google/gemini-3.1-pro-preview external/google/gemini-3-1-pro-preview | 26.4 |
| 3 | gpt-5-2025-08-07 external/openai/gpt-5-2025-08-07 | 23.7 |
| 4 | o3-20250416 external/openai/o3-20250416 | 23.5 |
| 5 | gpt-5-mini-2025-08-07 external/openai/gpt-5-mini-2025-08-07 | 20.9 |
| 6 | gemini-3-flash-preview external/google/gemini-3-flash-preview | 20.1 |
| 7 | gemini-3-pro-preview external/google/gemini-3-pro-preview | 20.1 |
| 8 | Grok-4-0709 external/xai/grok-4-0709 | 19.7 |
| 9 | gpt-5.2-2025-12-11 external/openai/gpt-5-2-2025-12-11 | 19.5 |
| 10 | gpt-4.1-20250414 external/openai/gpt-4-1-20250414 | 19.5 |
| 11 | google/gemini-3.1-flash-lite-preview external/google/gemini-3-1-flash-lite-preview | 18.5 |
| 12 | anthropic/claude-sonnet-4.6 external/anthropic/claude-sonnet-4-6 | 18.4 |
| 13 | anthropic/claude-sonnet-4 external/anthropic/claude-sonnet-4 | 17.7 |
| 14 | openai/gpt-5.4-2026-03-05 external/openai/gpt-5-4-2026-03-05 | 17.3 |
| 15 | claude-opus-4-5-20251101 external/anthropic/claude-opus-4-5-20251101 | 16.8 |
Reading These Rankings
Frontier models dominate technically demanding security tasks. Incident triage, vulnerability analysis, and threat intelligence synthesis all require both broad security knowledge and precise reasoning — a combination where the gap between frontier and mid-tier models is larger than in most other use cases.
Reasoning models have an edge on complex analysis. For multi-step vulnerability analysis — tracing how an attacker could chain access from an initial foothold through privilege escalation to data exfiltration — models with extended reasoning capabilities produce more thorough and accurate analysis than models optimized for speed.
Nanny index is a practical filter. Models with high nanny index scores often refuse or heavily caveat outputs that are operationally necessary for security work — analyzing malware behavior, explaining how a specific CVE is exploited, or generating detection signatures for attack patterns. This is a real capability degradation for security workflows, not a minor UX issue.
Open-weight models are viable for on-premises deployment. For security contexts where analyzed data (logs, malware samples, internal code) cannot be sent to external APIs, strong open-weight alternatives deployed on-premises are worth serious evaluation. Several rank competitively for security-relevant tasks.
Related Use Cases
- Log triage — Structured log analysis and anomaly detection, a narrower slice of the security triage workflow
- Debugging — Code-level reasoning capabilities that underpin vulnerability analysis
- Accuracy rankings — Full model rankings on the Accuracy dimension
Full use-case rankings at /use-cases. Methodology at /methodology.