biomed_science
Best LLM for Literature Review
Ranked models for synthesizing papers and guidelines with citations and uncertainty.
Provisional leader
google/gemini-3.1-pro-preview
Best current option from the available benchmark evidence, but not yet a strong winner claim.
external/google/gemini-3-1-pro-preview
30.2%
Score
35.0%
Confidence
23
Evidence
Ranked Models
30
Evidence Quality
83%
Evidence Points
23
Top Signal
FACTS Benchmark Suite: facts_grounding_score_pct
Benchmark Sources
36
Last Updated
8h ago
All Ranked Models
| Rank | Model | Score |
|---|---|---|
| 🥇 | gemini-3.1-pro-preview Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vals GPQA overall_accuracy_pct | 30.2% |
| 🥈 | gpt-5-2025-08-07 Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vals GPQA overall_accuracy_pct | 25.8% |
| 🥉 | gemini-2.5-pro Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vectara HHEM Leaderboard overall_hallucination_error_pct | 25.6% |
| #4 | gemini-3-flash-preview Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vals CorpFin v2 overall_accuracy_pct | 25.2% |
| #5 | gpt-5-mini-2025-08-07 Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vals GPQA overall_accuracy_pct | 24.7% |
| #6 | gemini-3-pro-preview Strong on Vals GPQA overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct | 24.3% |
| #7 | gpt-5.2-2025-12-11 Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vals GPQA overall_accuracy_pct | 23.7% |
| #8 | gemini-3.1-flash-lite-preview Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vectara HHEM Leaderboard overall_hallucination_error_pct | 23.1% |
| #9 | gpt-4.1-20250414 Strong on MMLongBench-Doc Leaderboard acc_score_pct and Vectara HHEM Leaderboard overall_hallucination_error_pct | 21.8% |
| #10 | claude-opus-4.7 Strong on Vals Finance Agent overall_accuracy_pct and Vals GPQA overall_accuracy_pct | 21.2% |
| #11 | claude-sonnet-4 Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vectara HHEM Leaderboard overall_hallucination_error_pct | 21.0% |
| #12 | Grok-4-0709 Strong on Vals CorpFin v2 overall_accuracy_pct and Vals GPQA overall_accuracy_pct | 20.8% |
| #13 | claude-sonnet-4.6 Strong on Vals Finance Agent overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct | 20.7% |
| #14 | claude-opus-4-5-20251101 Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vals CorpFin v2 overall_accuracy_pct | 20.6% |
| #15 | gpt-5.4-2026-03-05 Strong on Vals GPQA overall_accuracy_pct and Vectara HHEM Leaderboard overall_hallucination_error_pct | 20.2% |
| #16 | gemini-2.5-flash Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vectara HHEM Leaderboard overall_hallucination_error_pct | 18.6% |
| #18 | grok-4-fast-reasoning Strong on Vals CorpFin v2 overall_accuracy_pct and Vals GPQA overall_accuracy_pct | 17.6% |
| #19 | o3-20250416 Strong on Vals GPQA overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct | 17.4% |
| #20 | gpt-5.1-2025-11-13 Strong on Vals GPQA overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct | 17.2% |
| #21 | grok-4-1-fast-reasoning Strong on Vals CorpFin v2 overall_accuracy_pct and Vals GPQA overall_accuracy_pct | 16.9% |
| #24 | kimi-k2.5-thinking Strong on Vals CorpFin v2 overall_accuracy_pct and Vals GPQA overall_accuracy_pct | 15.4% |
| #25 | claude-opus-4-6-thinking Strong on Vals CorpFin v2 overall_accuracy_pct and Vals GPQA overall_accuracy_pct | 15.3% |
| #27 | phi-4 Strong on Open LLM Leaderboard MMLU-Pro mmlu_pro_accuracy_pct and Open LLM Leaderboard GPQA gpqa | 14.5% |
| #31 | grok-4-1-fast-non-reasoning Strong on Vals Finance Agent overall_accuracy_pct and Vectara HHEM Leaderboard overall_answer_rate_pct | 13.7% |
| #32 | claude-opus-4-1-20250805 Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vectara HHEM Leaderboard overall_hallucination_error_pct | 13.4% |
| #36 | o4-mini Strong on Vals CorpFin v2 overall_accuracy_pct and Vals GPQA overall_accuracy_pct | 12.4% |
| #37 | grok-4.20-0309-reasoning Strong on Vals GPQA overall_accuracy_pct and Vals CorpFin v2 overall_accuracy_pct | 12.4% |
| #38 | qwen-2.5-72b-instruct Strong on Open LLM Leaderboard MMLU-Pro mmlu_pro_accuracy_pct and Open LLM Leaderboard GPQA gpqa | 12.2% |
| #46 | Kimi K2 Thinking Strong on Vals CorpFin v2 overall_accuracy_pct and Vals GPQA overall_accuracy_pct | 10.9% |
| #47 | claude-sonnet-4-5-20250929 Strong on Vals CorpFin v2 overall_accuracy_pct and Vectara HHEM Leaderboard overall_hallucination_error_pct | 10.5% |
Head-to-Head: #1 vs #2
#1
Top Pickgoogle/gemini-3.1-pro-preview
Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vals GPQA overall_accuracy_pct
Conf 35.0%
#2
gpt-5-2025-08-07
Strong on FACTS Benchmark Suite facts_grounding_score_pct and Vals GPQA overall_accuracy_pct
Conf 35.0%
Related Lookups
Best LLM for Code Generation
Benchmark-backed ranking of models for generating correct, secure code from requirements.
Best LLM for Debugging
Find the top-ranked models for localizing bugs and proposing fixes with explanations.
Best LLM for Unit Test Generation
Ranked models for generating meaningful unit tests and edge cases from code.
Best LLM for Code Review
Compare models for automated PR review covering correctness, security, and maintainability.
Best LLM for Autonomous Coding
Benchmark-backed ranking of models for end-to-end autonomous software engineering and issue resolution.
Best LLM for Function Calling
Compare models for reliable tool use, function selection, and multi-step API orchestration.