healthcare
Medical chart summary
Summarize a patient's chart into timeline, problems, and meds for review.
#1 Recommendation
gpt-4.1-20250414
Strong on Galileo Agent Leaderboard v2 Healthcare AC (100%) and MMLongBench-Doc Leaderboard acc_score_pct (75%)
external/openai/gpt-4-1-20250414
20.8%
Score
27.0%
Confidence
Limited benchmark evidence for this use case.
47 ranked models with average evidence of 15.1 points. Rankings may shift as more benchmark data is ingested.
Ranked Models
30
Evidence Quality
80%
Scoring
Benchmark-backed
Top Signal
Galileo Agent Leaderboard v2: Healthcare AC
All Ranked Models
Compare Models
Model A leads by +1.0%
Shareable Link →Model A
gpt-4.1-20250414
external/openai/gpt-4-1-20250414
Rank #1
Galileo Agent Leaderboard v2: Healthcare AC
Value 100.0% · Conf 100.0% · Weight 2.6%
galileo_agent_v2.healthcare_ac (Mar 12, 2026)
MMLongBench-Doc Leaderboard: acc_score_pct
Value 74.6% · Conf 100.0% · Weight 2.6%
mmlongbench_doc_leaderboard.acc_score_pct (Mar 12, 2026)
Vals MedQA: overall_accuracy_pct
Value 90.0% · Conf 100.0% · Weight 2.5%
vals_medqa.overall_accuracy_pct (Mar 12, 2026)
Vectara HHEM Leaderboard: medicine_hallucination_error_pct
Value 96.2% · Conf 100.0% · Weight 1.8%
vectara_hhem_leaderboard.medicine_hallucination_error_pct (Mar 12, 2026)
Model B
gemini-2.5-flash
external/google/gemini-2-5-flash
Rank #2
BRIDGE Medical Leaderboard: average_performance_pct
Value 100.0% · Conf 100.0% · Weight 2.9%
bridge_medical_leaderboard.average_performance_pct (Mar 12, 2026)
Vals MedScribe: overall_accuracy_pct
Value 84.6% · Conf 100.0% · Weight 1.9%
vals_medscribe.overall_accuracy_pct (Mar 12, 2026)
Galileo Agent Leaderboard v2: Healthcare TSQ
Value 97.8% · Conf 100.0% · Weight 1.8%
galileo_agent_v2.healthcare_tsq (Mar 12, 2026)
Vectara HHEM Leaderboard: medicine_hallucination_error_pct
Value 92.5% · Conf 100.0% · Weight 1.7%
vectara_hhem_leaderboard.medicine_hallucination_error_pct (Mar 12, 2026)
▶Ranking Diagnostics & Missing Models
Source Lift
Ranked
47
Sources
8
Quality
Insufficient
Vals MedQA
vals_medqa
37 rows
2.3% avg lift
Vals Legal Bench
vals_legal_bench
37 rows
0.3% avg lift
Vals LiveCodeBench
vals_lcb
36 rows
0.3% avg lift
Vals Tax Eval v2
vals_tax_eval_v2
34 rows
0.3% avg lift
Missing Strong Models
zai/glm-5-thinking
external/zai/glm-5-thinking
Rank #32
13.0%
alibaba/qwen3.5-flash
external/alibaba/qwen3-5-flash
Rank #33
12.3%
qwen/qwen3-max
external/qwen/qwen3-max
Rank #55
10.3%
▶Taxonomy Details
Core Tasks
Required Modes
Domains
Related Use Cases
healthcare
Patient education bot (RAG grounded)
Answer patient FAQ using trusted sources with cautious wording.
Top: gemini-2.5-pro
healthcare
Medical coding support (suggestions)
Extract coding-relevant facts and suggest codes for human review.
Top: gemini-2.5-pro
healthcare
Patient-friendly explanations
Rewrite technical notes into clear, accessible patient language.
Top: gemini-2.5-flash
healthcare
Clinical note drafting
Summarize encounters into structured notes for clinician review.
Top: gpt-4.1-20250414