biomed_science
Protocol structuring
Convert protocols/methods text into structured steps and tables.
#1 Recommendation
gpt-4.1-20250414
Strong on MMLongBench-Doc Leaderboard acc_score_pct (75%) and Vectara HHEM Leaderboard science_hallucination_error_pct (93%)
external/openai/gpt-4-1-20250414
21.6%
Score
31.0%
Confidence
Limited benchmark evidence for this use case.
48 ranked models with average evidence of 14.4 points. Rankings may shift as more benchmark data is ingested.
Ranked Models
30
Evidence Quality
80%
Scoring
Benchmark-backed
Top Signal
MMLongBench-Doc Leaderboard: acc_score_pct
All Ranked Models
Compare Models
Model A leads by +1.0%
Shareable Link →Model A
gpt-4.1-20250414
external/openai/gpt-4-1-20250414
Rank #1
MMLongBench-Doc Leaderboard: acc_score_pct
Value 74.6% · Conf 100.0% · Weight 5.0%
mmlongbench_doc_leaderboard.acc_score_pct (Mar 12, 2026)
Vectara HHEM Leaderboard: science_hallucination_error_pct
Value 93.3% · Conf 100.0% · Weight 2.1%
vectara_hhem_leaderboard.science_hallucination_error_pct (Mar 12, 2026)
Vals GPQA: overall_accuracy_pct
Value 53.3% · Conf 100.0% · Weight 1.6%
vals_gpqa.overall_accuracy_pct (Mar 12, 2026)
Galileo Agent Leaderboard v2: Avg AC
Value 100.0% · Conf 100.0% · Weight 1.6%
galileo_agent_v2.avg_ac (Mar 12, 2026)
Model B
google/gemini-3.1-pro-preview
external/google/gemini-3-1-pro-preview
Rank #2
Vals GPQA: overall_accuracy_pct
Value 100.0% · Conf 100.0% · Weight 3.0%
vals_gpqa.overall_accuracy_pct (Mar 12, 2026)
Vals MedCode: overall_accuracy_pct
Value 100.0% · Conf 100.0% · Weight 2.7%
vals_medcode.overall_accuracy_pct (Mar 12, 2026)
MathArena Models: average_score_pct
Value 84.9% · Conf 100.0% · Weight 2.0%
matharena_models.average_score_pct (Mar 12, 2026)
Vectara HHEM Leaderboard: science_hallucination_error_pct
Value 84.7% · Conf 100.0% · Weight 1.9%
vectara_hhem_leaderboard.science_hallucination_error_pct (Mar 12, 2026)
▶Ranking Diagnostics & Missing Models
Source Lift
Ranked
48
Sources
8
Quality
Insufficient
Vals GPQA
vals_gpqa
41 rows
2.1% avg lift
Vals Legal Bench
vals_legal_bench
40 rows
0.4% avg lift
Vals Tax Eval v2
vals_tax_eval_v2
39 rows
0.4% avg lift
Vals MedQA
vals_medqa
38 rows
0.4% avg lift
Missing Strong Models
qwen/qwen3-max
external/qwen/qwen3-max
Rank #55
10.3%
deepseek-v3
external/deepseek-ai/deepseek-v3
Rank #66
8.8%