Supply Chain

Supplier risk monitoring

Track supplier risk signals from multi-source text and summarize actions.

task.multi_doc_synthesistask.risk_assessment

Evidence quality is currently limited for this use case. Rankings below are useful for exploration, not a strong winner claim.

Provisional leader

gemini-3.1-pro-preview

Current leader based on limited benchmark evidence. Treat this ranking as directional until coverage improves.

28.7%

Best benchmark score

32.6%

Confidence

All ranked models — top 3

🥇

gemini-3.1-pro-preview

28.7%

🥈

gemini-2.5-pro

26.1%

🥉

gpt-5-2025-08-07

25.4%

Ranked Models

Evidence Quality

82%

Evidence Points

Top Signal

Vals Finance Agent: overall_accuracy_pct

All Ranked Models

30 of 30 models

Rank	Model	Score	Confidence	Price / 1M	Evidence sources
🥇	gemini-3.1-pro-preview Strong on Vals Finance Agent overall_accuracy_pct and FACTS Benchmark Suite facts_search_score_pct	28.7%	33%	$4.50	Vals Finance AgentFACTS Benchmark Suite
🥈	gemini-2.5-pro Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vectara HHEM Leaderboard overall_hallucination_error_pct	26.1%	40%	$3.44	FACTS Benchmark SuiteVectara HHEM Leaderboard
🥉	gpt-5-2025-08-07 Strong on SciArena Leaderboard rating_elo and FACTS Benchmark Suite facts_grounding_score_pct	25.4%	35%	—	SciArena LeaderboardFACTS Benchmark Suite
#4	gpt-5-mini-2025-08-07 Strong on SciArena Leaderboard rating_elo and Vals Finance Agent overall_accuracy_pct	24.0%	39%	—	SciArena LeaderboardVals Finance Agent
#5	gemini-3-pro-preview Strong on SciArena Leaderboard rating_elo and Vals Finance Agent overall_accuracy_pct	22.9%	31%	$4.50	SciArena LeaderboardVals Finance Agent
#6	Grok-4-0709 Strong on Vals CorpFin v2 overall_accuracy_pct and Galileo Agent Leaderboard v2 Avg TSQ	22.9%	33%	—	Vals CorpFin v2Galileo Agent Leaderboard v2
#7	gemini-3-flash-preview Strong on Vals CorpFin v2 overall_accuracy_pct and FACTS Benchmark Suite facts_grounding_score_pct	21.9%	30%	$1.13	Vals CorpFin v2FACTS Benchmark Suite
#8	gpt-5.2-2025-12-11 Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vals CorpFin v2 overall_accuracy_pct	21.2%	26%	—	FACTS Benchmark SuiteVals CorpFin v2
#9	claude-sonnet-4.6 Strong on Vals Finance Agent overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct	20.6%	27%	$6.00	Vals Finance AgentVals CorpFin v2
#10	claude-sonnet-4 Strong on Galileo Agent Leaderboard v2 Avg TSQ and Vectara HHEM Leaderboard overall_hallucination_error_pct	20.6%	32%	$6.00	Galileo Agent Leaderboard v2Vectara HHEM Leaderboard
#11	gpt-4.1-20250414 Strong on Vectara HHEM Leaderboard overall_hallucination_error_pct and Vals CorpFin v2 overall_accuracy_pct	20.5%	32%	—	Vectara HHEM LeaderboardVals CorpFin v2
#12	gemini-3.1-flash-lite-preview Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vectara HHEM Leaderboard overall_hallucination_error_pct	20.1%	29%	$0.56	FACTS Benchmark SuiteVectara HHEM Leaderboard
#13	gpt-5.4-2026-03-05 Strong on Vectara HHEM Leaderboard overall_hallucination_error_pct and Vals CorpFin v2 overall_accuracy_pct	18.8%	23%	—	Vectara HHEM LeaderboardVals CorpFin v2
#14	gemini-2.5-flash Strong on FACTS Benchmark Suite facts_grounding_score_pct and Galileo Agent Leaderboard v2 Avg TSQ	18.5%	31%	$0.17	FACTS Benchmark SuiteGalileo Agent Leaderboard v2
#15	claude-opus-4-5-20251101 Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vals CorpFin v2 overall_accuracy_pct	18.3%	28%	—	FACTS Benchmark SuiteVals CorpFin v2
#16	o3-20250416 Strong on SciArena Leaderboard rating_elo and Vals CorpFin v2 overall_accuracy_pct	18.0%	28%	$3.50	SciArena LeaderboardVals CorpFin v2
#17	gpt-5.1-2025-11-13 Strong on Vals Finance Agent overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct	16.9%	26%	—	Vals Finance AgentVals CorpFin v2
#18	grok-4-fast-reasoning Strong on Vals CorpFin v2 overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct	16.8%	32%	$0.28	Vals CorpFin v2Vals Finance Agent
#19	grok-4-1-fast-reasoning Strong on Vals CorpFin v2 overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct	15.1%	23%	$0.28	Vals CorpFin v2Vals Finance Agent
#20	claude-opus-4-6-thinking Strong on Vals CorpFin v2 overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct	14.0%	15%	—	Vals CorpFin v2Vals Finance Agent
#21	grok-3 Strong on Vectara HHEM Leaderboard overall_hallucination_error_pct and Vals CorpFin v2 overall_accuracy_pct	13.6%	20%	$6.00	Vectara HHEM LeaderboardVals CorpFin v2
#22	kimi-k2.5-thinking Strong on Vals CorpFin v2 overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct	13.6%	20%	—	Vals CorpFin v2Vals Finance Agent
#23	claude-opus-4-5-20251101-thinking Strong on Vals Finance Agent overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct	12.9%	15%	—	Vals Finance AgentVals CorpFin v2
#24	glm-5-thinking Strong on Vals CorpFin v2 overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct	12.2%	18%	—	Vals CorpFin v2Vals Finance Agent
#25	claude-sonnet-4-5-20250929-thinking Strong on Vals Finance Agent overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct	12.1%	15%	—	Vals Finance AgentVals CorpFin v2
#27	grok-4.20-0309-reasoning Strong on Vals CorpFin v2 overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct	11.8%	15%	—	Vals CorpFin v2Vals Finance Agent
#28	o4-mini Strong on Vals CorpFin v2 overall_accuracy_pct and SciArena Leaderboard rating_elo	11.6%	27%	$1.93	Vals CorpFin v2SciArena Leaderboard
#29	grok-4-1-fast-non-reasoning Strong on Vals Finance Agent overall_accuracy_pct and Vectara HHEM Leaderboard overall_answer_rate_pct	11.4%	22%	$0.28	Vals Finance AgentVectara HHEM Leaderboard
#30	mistral-large-2512 Strong on Vals CorpFin v2 overall_accuracy_pct and Vectara HHEM Leaderboard overall_answer_rate_pct	10.9%	20%	—	Vals CorpFin v2Vectara HHEM Leaderboard
#31	MiniMax-M2.7 Strong on Vals CorpFin v2 overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct	10.8%	14%	$0.53	Vals CorpFin v2Vals Finance Agent

Compare Models

Select two different models above to compare their evidence side by side.

▶Ranking diagnostics & missing models

Source lift

Ranked

Sources

Quality

Low

Vals CorpFin v2

44 rows · 1.4% avg lift

Vals Legal Bench

35 rows · 0.3% avg lift

Vals MedQA

35 rows · 0.3% avg lift

Vals Tax Eval v2

34 rows · 0.3% avg lift

Missing frontier models

No obvious gaps right now.

▶Taxonomy & task details

Core tasks

task.multi_doc_synthesistask.risk_assessment

Required modes

mode.long_context

Domains

domain.supply_chain_logistics

Related in Supply Chain

Disruption monitoring brief

Summarize disruptions into risk, options, and recommendations.

Vendor contract summary (procurement)

Summarize vendor contracts into key terms, risks, and deviations.

Route alternatives planning

Propose alternative routes and tradeoffs under disruptions.

Tail spend categorization

Categorize tail spend purchases into taxonomy buckets for sourcing.