Biomedical

Cross-paper contradiction analysis

Identify contradictions and uncertainty across papers with citations.

task.contradiction_detectiontask.claim_check_with_evidence

Evidence quality is currently limited for this use case. Rankings below are useful for exploration, not a strong winner claim.

Provisional leader

gemini-3.1-pro-preview

Current leader based on limited benchmark evidence. Treat this ranking as directional until coverage improves.

37.7%

Best benchmark score

43.0%

Confidence

All ranked models — top 3

🥇

gemini-3.1-pro-preview

37.7%

🥈

gpt-5-2025-08-07

30.0%

🥉

gemini-2.5-pro

29.6%

Ranked Models

Evidence Quality

85%

Evidence Points

Top Signal

FACTS Benchmark Suite: facts_grounding_score_pct

All Ranked Models

30 of 30 models

Rank	Model	Score	Confidence	Price / 1M	Evidence sources
🥇	gemini-3.1-pro-preview Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vals GPQA overall_accuracy_pct	37.7%	43%	$4.50	FACTS Benchmark SuiteVals GPQA
🥈	gpt-5-2025-08-07 Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vals GPQA overall_accuracy_pct	30.0%	41%	—	FACTS Benchmark SuiteVals GPQA
🥉	gemini-2.5-pro Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vectara HHEM Leaderboard overall_hallucination_error_pct	29.6%	45%	$3.44	FACTS Benchmark SuiteVectara HHEM Leaderboard
#4	gemini-3-pro-preview Strong on Vals GPQA overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct	29.1%	39%	$4.50	Vals GPQAVals Finance Agent
#5	gpt-5-mini-2025-08-07 Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vals GPQA overall_accuracy_pct	28.6%	46%	—	FACTS Benchmark SuiteVals GPQA
#6	gemini-3-flash-preview Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vals GPQA overall_accuracy_pct	28.6%	38%	$1.13	FACTS Benchmark SuiteVals GPQA
#7	gpt-5.2-2025-12-11 Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vals GPQA overall_accuracy_pct	26.8%	33%	—	FACTS Benchmark SuiteVals GPQA
#8	gemini-3.1-flash-lite-preview Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vectara HHEM Leaderboard overall_hallucination_error_pct	26.7%	40%	$0.56	FACTS Benchmark SuiteVectara HHEM Leaderboard
#9	gpt-4.1-20250414 Strong on MMLongBench-Doc Leaderboard acc_score_pct and Vectara HHEM Leaderboard overall_hallucination_error_pct	25.2%	39%	—	MMLongBench-Doc LeaderboardVectara HHEM Leaderboard
#10	Grok-4-0709 Strong on Vals GPQA overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct	24.4%	36%	—	Vals GPQAVals CorpFin v2
#11	claude-opus-4-5-20251101 Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vals GPQA overall_accuracy_pct	24.3%	37%	—	FACTS Benchmark SuiteVals GPQA
#12	gpt-5.4-2026-03-05 Strong on Vals GPQA overall_accuracy_pct and Vectara HHEM Leaderboard overall_hallucination_error_pct	24.2%	31%	—	Vals GPQAVectara HHEM Leaderboard
#13	claude-sonnet-4 Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vectara HHEM Leaderboard overall_hallucination_error_pct	23.8%	41%	$6.00	FACTS Benchmark SuiteVectara HHEM Leaderboard
#14	claude-sonnet-4.6 Strong on Vals Finance Agent overall_accuracy_pct and Vals GPQA overall_accuracy_pct	23.8%	31%	$6.00	Vals Finance AgentVals GPQA
#15	gemini-2.5-flash Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vectara HHEM Leaderboard overall_hallucination_error_pct	21.1%	36%	$0.17	FACTS Benchmark SuiteVectara HHEM Leaderboard
#16	gpt-5.1-2025-11-13 Strong on Vals GPQA overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct	20.9%	32%	—	Vals GPQAVals Finance Agent
#17	o3-20250416 Strong on Vals GPQA overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct	20.8%	34%	$3.50	Vals GPQAVals CorpFin v2
#18	grok-4-fast-reasoning Strong on Vals CorpFin v2 overall_accuracy_pct and Vals GPQA overall_accuracy_pct	20.7%	41%	$0.28	Vals CorpFin v2Vals GPQA
#19	kimi-k2.5-thinking Strong on Vals CorpFin v2 overall_accuracy_pct and Vals GPQA overall_accuracy_pct	18.0%	26%	—	Vals CorpFin v2Vals GPQA
#20	grok-4-1-fast-reasoning Strong on Vals CorpFin v2 overall_accuracy_pct and Vals GPQA overall_accuracy_pct	18.0%	30%	$0.28	Vals CorpFin v2Vals GPQA
#21	claude-opus-4-6-thinking Strong on Vals GPQA overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct	17.3%	20%	—	Vals GPQAVals CorpFin v2
#22	claude-opus-4-5-20251101-thinking Strong on Vals GPQA overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct	17.2%	21%	—	Vals GPQAVals Finance Agent
#23	grok-3 Strong on Vectara HHEM Leaderboard overall_hallucination_error_pct and Vectara HHEM Leaderboard science_hallucination_error_pct	17.0%	24%	$6.00	Vectara HHEM LeaderboardVectara HHEM Leaderboard
#24	claude-opus-4-1-20250805 Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vectara HHEM Leaderboard overall_hallucination_error_pct	16.2%	32%	—	FACTS Benchmark SuiteVectara HHEM Leaderboard
#25	claude-sonnet-4-5-20250929-thinking Strong on Vals GPQA overall_accuracy_pct and Vals Finance Agent overall_accuracy_pct	15.3%	21%	—	Vals GPQAVals Finance Agent
#26	mistral-large-2512 Strong on Vals CorpFin v2 overall_accuracy_pct and Vectara HHEM Leaderboard overall_answer_rate_pct	14.2%	25%	—	Vals CorpFin v2Vectara HHEM Leaderboard
#27	grok-4-1-fast-non-reasoning Strong on Vals Finance Agent overall_accuracy_pct and Vectara HHEM Leaderboard overall_answer_rate_pct	14.1%	29%	$0.28	Vals Finance AgentVectara HHEM Leaderboard
#28	grok-4.20-0309-reasoning Strong on Vals GPQA overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct	14.0%	20%	—	Vals GPQAVals CorpFin v2
#29	claude-opus-4-6 Strong on AgentSet LLM Leaderboard elo_score and Vectara HHEM Leaderboard overall_answer_rate_pct	14.0%	20%	$10.00	AgentSet LLM LeaderboardVectara HHEM Leaderboard
#30	o4-mini Strong on Vals GPQA overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct	13.7%	33%	$1.93	Vals GPQAVals CorpFin v2

Compare Models

Select two different models above to compare their evidence side by side.

▶Ranking diagnostics & missing models

Source lift

Ranked

Sources

Quality

Low

Vals GPQA

45 rows · 1.7% avg lift

Vals CorpFin v2

44 rows · 1.4% avg lift

Vectara HHEM Leaderboard

31 rows · 1.3% avg lift

Vals Finance Agent

31 rows · 1.3% avg lift

Missing frontier models

No obvious gaps right now.

▶Taxonomy & task details

Core tasks

task.contradiction_detectiontask.claim_check_with_evidence

Required modes

mode.citationsmode.long_context

Domains

domain.biomed_literature

Related in Biomedical

Literature synthesis with citations

Synthesize papers and guidelines with citations and uncertainty.

Protocol structuring

Convert protocols/methods text into structured steps and tables.