Risk & Eval

Prompt injection resistance (eval)

Measure resistance to prompt injection in RAG and tool settings.

task.prompt_injection_resistance

Evidence quality is currently limited for this use case. Rankings below are useful for exploration, not a strong winner claim.

Provisional leader

gemini-3.1-pro-preview

Current leader based on limited benchmark evidence. Treat this ranking as directional until coverage improves.

25.3%

Best benchmark score

29.0%

Confidence

All ranked models — top 3

🥇

gemini-3.1-pro-preview

25.3%

🥈

gemini-2.5-pro

23.6%

🥉

gpt-5-2025-08-07

23.1%

Ranked Models

Evidence Quality

81%

Evidence Points

Top Signal

Vals Finance Agent: overall_accuracy_pct

All Ranked Models

30 of 30 models

Rank	Model	Score	Confidence	Price / 1M	Evidence sources
🥇	gemini-3.1-pro-preview Strong on Vals Finance Agent overall_accuracy_pct and FACTS Benchmark Suite facts_search_score_pct	25.3%	29%	$4.50	Vals Finance AgentFACTS Benchmark Suite
🥈	gemini-2.5-pro Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vectara HHEM Leaderboard overall_hallucination_error_pct	23.6%	35%	$3.44	FACTS Benchmark SuiteVectara HHEM Leaderboard
🥉	gpt-5-2025-08-07 Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vals Finance Agent overall_accuracy_pct	23.1%	31%	—	FACTS Benchmark SuiteVals Finance Agent
#4	claude-sonnet-4.6 Strong on Vals Finance Agent overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct	20.8%	30%	$6.00	Vals Finance AgentVals CorpFin v2
#5	Grok-4-0709 Strong on Vals CorpFin v2 overall_accuracy_pct and Galileo Agent Leaderboard v2 Avg TSQ	20.8%	30%	—	Vals CorpFin v2Galileo Agent Leaderboard v2
#6	gemini-3-flash-preview Strong on Vals CorpFin v2 overall_accuracy_pct and FACTS Benchmark Suite facts_grounding_score_pct	19.9%	26%	$1.13	Vals CorpFin v2FACTS Benchmark Suite
#7	gpt-5-mini-2025-08-07 Strong on Vals Finance Agent overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct	19.2%	31%	—	Vals Finance AgentVals CorpFin v2
#8	claude-sonnet-4 Strong on Galileo Agent Leaderboard v2 Avg AC and Galileo Agent Leaderboard v2 Avg TSQ	19.1%	31%	$6.00	Galileo Agent Leaderboard v2Galileo Agent Leaderboard v2
#9	gpt-4.1-20250414 Strong on Galileo Agent Leaderboard v2 Avg AC and Vectara HHEM Leaderboard overall_hallucination_error_pct	18.9%	27%	—	Galileo Agent Leaderboard v2Vectara HHEM Leaderboard
#10	gemini-3-pro-preview Strong on Vals Finance Agent overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct	17.9%	26%	$4.50	Vals Finance AgentVals CorpFin v2
#11	gpt-5.2-2025-12-11 Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vals CorpFin v2 overall_accuracy_pct	17.1%	22%	—	FACTS Benchmark SuiteVals CorpFin v2
#12	gemini-3.1-flash-lite-preview Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vectara HHEM Leaderboard overall_hallucination_error_pct	17.0%	25%	$0.56	FACTS Benchmark SuiteVectara HHEM Leaderboard
#13	gpt-5.4-2026-03-05 Strong on Vectara HHEM Leaderboard overall_hallucination_error_pct and Vals CorpFin v2 overall_accuracy_pct	16.2%	21%	—	Vectara HHEM LeaderboardVals CorpFin v2
#14	kimi-k2.5-thinking Strong on Vals CorpFin v2 overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct	15.7%	25%	—	Vals CorpFin v2Vals Finance Agent
#15	gpt-5.1-2025-11-13 Strong on Vals Finance Agent overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct	15.6%	24%	—	Vals Finance AgentVals CorpFin v2
#16	claude-opus-4-5-20251101 Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vals CorpFin v2 overall_accuracy_pct	15.2%	24%	—	FACTS Benchmark SuiteVals CorpFin v2
#17	grok-4-fast-reasoning Strong on Vals CorpFin v2 overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct	15.2%	29%	$0.28	Vals CorpFin v2Vals Finance Agent
#18	o3-20250416 Strong on Vals CorpFin v2 overall_accuracy_pct and SciArena Leaderboard rating_elo	15.1%	24%	$3.50	Vals CorpFin v2SciArena Leaderboard
#19	claude-opus-4-6 Strong on AgentSet LLM Leaderboard elo_score and OpenHands Issue Resolution issue_resolution_score_pct	14.9%	19%	$10.00	AgentSet LLM LeaderboardOpenHands Issue Resolution
#20	gemini-2.5-flash Strong on FACTS Benchmark Suite facts_grounding_score_pct and Galileo Agent Leaderboard v2 Avg TSQ	14.7%	24%	$0.17	FACTS Benchmark SuiteGalileo Agent Leaderboard v2
#21	grok-4-1-fast-reasoning Strong on Vals CorpFin v2 overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct	12.9%	19%	$0.28	Vals CorpFin v2Vals Finance Agent
#22	grok-3 Strong on Vectara HHEM Leaderboard overall_hallucination_error_pct and Vals CorpFin v2 overall_accuracy_pct	12.1%	17%	$6.00	Vectara HHEM LeaderboardVals CorpFin v2
#23	claude-opus-4-6-thinking Strong on Vals CorpFin v2 overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct	11.8%	13%	—	Vals CorpFin v2Vals Finance Agent
#24	claude-opus-4-5-20251101-thinking Strong on Vals Finance Agent overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct	10.9%	13%	—	Vals Finance AgentVals CorpFin v2
#25	Kimi K2 Thinking Strong on Vals CorpFin v2 overall_accuracy_pct and Vals CorpFin v2 shared_max_context_accuracy_pct	10.8%	23%	$1.07	Vals CorpFin v2Vals CorpFin v2
#26	claude-sonnet-4-5-20250929-thinking Strong on Vals Finance Agent overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct	10.2%	13%	—	Vals Finance AgentVals CorpFin v2
#28	grok-4.20-0309-reasoning Strong on Vals CorpFin v2 overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct	10.0%	13%	—	Vals CorpFin v2Vals Finance Agent
#29	glm-5-thinking Strong on Vals CorpFin v2 overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct	10.0%	14%	—	Vals CorpFin v2Vals Finance Agent
#30	gpt-4o-2024-05-13 Strong on LLM Trustworthy Leaderboard privacy and LLM Trustworthy Leaderboard adv	9.8%	18%	—	LLM Trustworthy LeaderboardLLM Trustworthy Leaderboard
#31	o4-mini Strong on Vals CorpFin v2 overall_accuracy_pct and Vals CorpFin v2 shared_max_context_accuracy_pct	9.8%	22%	$1.93	Vals CorpFin v2Vals CorpFin v2

Compare Models

Select two different models above to compare their evidence side by side.

▶Ranking diagnostics & missing models

Source lift

Ranked

Sources

Quality

Low

Vals CorpFin v2

42 rows · 1.2% avg lift

Vals MedQA

32 rows · 0.3% avg lift

Vals Finance Agent

31 rows · 1.1% avg lift

Vals Legal Bench

30 rows · 0.3% avg lift

Missing frontier models

No obvious gaps right now.

▶Taxonomy & task details

Core tasks

task.prompt_injection_resistance

Required modes

mode.tool_calling

Domains

domain.general_business

Related in Risk & Eval

Disinformation and manipulation resistance (eval)

Measure refusal and safe handling of deceptive content generation requests.

Crisis escalation protocol (eval)

Measure safe crisis escalation behavior under the selected policy.

Jailbreak resistance (eval)

Measure robustness to adversarial prompts that attempt to bypass policy.

Overrefusal (eval)

Measure how often benign requests are incorrectly refused.