BasedAGIBasedAGI
Menu
Rankings live

healthcare

Medical chart summary

Summarize a patient's chart into timeline, problems, and meds for review.

#1 Recommendation

gpt-4.1-20250414

Strong on Galileo Agent Leaderboard v2 Healthcare AC (100%) and MMLongBench-Doc Leaderboard acc_score_pct (75%)

external/openai/gpt-4-1-20250414

20.8%

Score

27.0%

Confidence

Limited benchmark evidence for this use case.

47 ranked models with average evidence of 15.1 points. Rankings may shift as more benchmark data is ingested.

Ranked Models

30

Evidence Quality

80%

Scoring

Benchmark-backed

Top Signal

Galileo Agent Leaderboard v2: Healthcare AC

All Ranked Models

Max params:
Min confidence:
30 of 30
RankModelScore
#1gpt-4.1-20250414

Strong on Galileo Agent Leaderboard v2 Healthcare AC (100%) and MMLongBench-Doc Leaderboard acc_score_pct (75%)

20.8%
#2gemini-2.5-flash

Strong on BRIDGE Medical Leaderboard average_performance_pct (100%) and Vals MedScribe overall_accuracy_pct (85%)

19.8%
#3claude-sonnet-4-20250514

Strong on Galileo Agent Leaderboard v2 Healthcare AC (100%) and Vals MedQA overall_accuracy_pct (88%)

19.7%
#4gemini-2.5-pro
19.5%
#5qwen-2.5-72b-instruct
16.9%
#6gpt-4o
16.8%
#7gemini-3-pro-preview
16.3%
#8Grok-4-0709
15.8%
#9google/gemini-3.1-pro-preview
15.4%
#10claude-opus-4-5-20251101
14.9%
#11gpt-5-mini-2025-08-07
14.3%
#12openai/gpt-5.4-2026-03-05
13.9%
#13gemini-3-flash-preview
13.4%
#14gpt-5-2025-08-07
13.3%
#15gpt-5.1-2025-11-13
12.9%
#17gpt-4.1-mini-20250414
11.7%
#18xai-org/grok-4-fast-reasoning
11.7%
#20anthropic/claude-opus-4-6-thinking
11.6%
#21anthropic/claude-opus-4-1-20250805
11.4%
#22anthropic/claude-opus-4-5-20251101-thinking
11.4%
#23gpt-5.2-2025-12-11
11.4%
#24anthropic/claude-sonnet-4.6
11.3%
#25xai-org/grok-4-1-fast-reasoning
10.8%
#26anthropic/claude-sonnet-4-5-20250929-thinking
10.5%
#28o3-20250416
10.2%
#29google/gemini-2.0-flash-001
10.0%
#30kimi/kimi-k2.5-thinking
10.0%
#31gpt-4o-2024-08-06
9.7%
#33deepseek/deepseek-r1
9.4%
#35openai/gpt-4o-mini-2024-07-18
9.1%

Compare Models

Model A leads by +1.0%

Shareable Link →

Model A

gpt-4.1-20250414

external/openai/gpt-4-1-20250414

20.8%

Rank #1

Confidence 27.0%22 evidence pts

Galileo Agent Leaderboard v2: Healthcare AC

Value 100.0% · Conf 100.0% · Weight 2.6%

galileo_agent_v2.healthcare_ac (Mar 12, 2026)

MMLongBench-Doc Leaderboard: acc_score_pct

Value 74.6% · Conf 100.0% · Weight 2.6%

mmlongbench_doc_leaderboard.acc_score_pct (Mar 12, 2026)

Vals MedQA: overall_accuracy_pct

Value 90.0% · Conf 100.0% · Weight 2.5%

vals_medqa.overall_accuracy_pct (Mar 12, 2026)

Vectara HHEM Leaderboard: medicine_hallucination_error_pct

Value 96.2% · Conf 100.0% · Weight 1.8%

vectara_hhem_leaderboard.medicine_hallucination_error_pct (Mar 12, 2026)

Model B

gemini-2.5-flash

external/google/gemini-2-5-flash

19.8%

Rank #2

Confidence 26.2%19 evidence pts

BRIDGE Medical Leaderboard: average_performance_pct

Value 100.0% · Conf 100.0% · Weight 2.9%

bridge_medical_leaderboard.average_performance_pct (Mar 12, 2026)

Vals MedScribe: overall_accuracy_pct

Value 84.6% · Conf 100.0% · Weight 1.9%

vals_medscribe.overall_accuracy_pct (Mar 12, 2026)

Galileo Agent Leaderboard v2: Healthcare TSQ

Value 97.8% · Conf 100.0% · Weight 1.8%

galileo_agent_v2.healthcare_tsq (Mar 12, 2026)

Vectara HHEM Leaderboard: medicine_hallucination_error_pct

Value 92.5% · Conf 100.0% · Weight 1.7%

vectara_hhem_leaderboard.medicine_hallucination_error_pct (Mar 12, 2026)

Ranking Diagnostics & Missing Models

Source Lift

Ranked

47

Sources

8

Quality

Insufficient

Vals MedQA

vals_medqa

37 rows

2.3% avg lift

Vals Legal Bench

vals_legal_bench

37 rows

0.3% avg lift

Vals LiveCodeBench

vals_lcb

36 rows

0.3% avg lift

Vals Tax Eval v2

vals_tax_eval_v2

34 rows

0.3% avg lift

Missing Strong Models

zai/glm-5-thinking

external/zai/glm-5-thinking

Rank #32

13.0%

Thin evidence after weighting

alibaba/qwen3.5-flash

external/alibaba/qwen3-5-flash

Rank #33

12.3%

Thin evidence after weighting

qwen/qwen3-max

external/qwen/qwen3-max

Rank #55

10.3%

Thin evidence after weighting
Taxonomy Details

Core Tasks

task.summarize_medical_charttask.timeline_extraction

Required Modes

mode.long_context

Domains

domain.healthcare_clinical

Related Use Cases