real_estate
Title-like document entity extraction
Extract and reconcile owners/entities across fragmented property docs.
#1 Recommendation
gpt-4.1-20250414
Strong on MMLongBench-Doc Leaderboard acc_score_pct (75%) and Galileo Agent Leaderboard v2 Avg AC (100%)
external/openai/gpt-4-1-20250414
16.4%
Score
21.6%
Confidence
Limited benchmark evidence for this use case.
11 ranked models with average evidence of 15.0 points. Rankings may shift as more benchmark data is ingested.
Ranked Models
11
Evidence Quality
68%
Scoring
Benchmark-backed
Top Signal
MMLongBench-Doc Leaderboard: acc_score_pct
All Ranked Models
| Rank | Model | Score |
|---|---|---|
| #2 | gpt-4.1-20250414 Strong on MMLongBench-Doc Leaderboard acc_score_pct (75%) and Galileo Agent Leaderboard v2 Avg AC (100%) | 16.4% |
| #15 | gemini-2.5-pro | 10.8% |
| #19 | gpt-4o-20241120 | 10.1% |
| #20 | gemini-3-pro-preview | 9.8% |
| #21 | claude-sonnet-4-20250514 | 9.7% |
| #22 | qwen-2.5-72b-instruct | 9.7% |
| #23 | Grok-4-0709 | 9.4% |
| #24 | gemini-2.5-flash | 9.1% |
| #25 | gpt-4o | 8.7% |
| #26 | deepseek/deepseek-r1 | 7.5% |
| #28 | openai/gpt-4o-mini-2024-07-18 | 5.8% |
Compare Models
Model A leads by +5.6%
Shareable Link →Model A
gpt-4.1-20250414
external/openai/gpt-4-1-20250414
Rank #2
MMLongBench-Doc Leaderboard: acc_score_pct
Value 74.6% · Conf 100.0% · Weight 4.8%
mmlongbench_doc_leaderboard.acc_score_pct (Mar 12, 2026)
Galileo Agent Leaderboard v2: Avg AC
Value 100.0% · Conf 100.0% · Weight 3.2%
galileo_agent_v2.avg_ac (Mar 12, 2026)
Galileo Agent Leaderboard v2: Avg TSQ
Value 64.1% · Conf 100.0% · Weight 0.7%
galileo_agent_v2.avg_tsq (Mar 12, 2026)
Vectara HHEM Leaderboard: overall_hallucination_error_pct
Value 82.5% · Conf 100.0% · Weight 0.5%
vectara_hhem_leaderboard.overall_hallucination_error_pct (Mar 12, 2026)
Model B
gemini-2.5-pro
external/google/gemini-2-5-pro
Rank #15
Galileo Agent Leaderboard v2: Avg AC
Value 58.7% · Conf 100.0% · Weight 1.9%
galileo_agent_v2.avg_ac (Mar 12, 2026)
LEXam Leaderboard: average_score_pct
Value 89.4% · Conf 100.0% · Weight 1.3%
lexam_leaderboard.average_score_pct (Mar 12, 2026)
Galileo Agent Leaderboard v2: Avg TSQ
Value 79.5% · Conf 100.0% · Weight 0.8%
galileo_agent_v2.avg_tsq (Mar 12, 2026)
Vectara HHEM Leaderboard: overall_hallucination_error_pct
Value 76.0% · Conf 100.0% · Weight 0.4%
vectara_hhem_leaderboard.overall_hallucination_error_pct (Mar 12, 2026)
▶Ranking Diagnostics & Missing Models
Source Lift
Ranked
11
Sources
8
Quality
Insufficient
Vals CorpFin v2
vals_corp_fin_v2
7 rows
0.3% avg lift
Galileo Agent Leaderboard v2
galileo_agent_v2
6 rows
1.6% avg lift
Vals Legal Bench
vals_legal_bench
6 rows
0.4% avg lift
Vals Mortgage Tax
vals_mortgage_tax
6 rows
0.4% avg lift
Missing Strong Models
anthropic/claude-sonnet-4.6
external/anthropic/claude-sonnet-4-6
Rank #4
21.1%
gpt-5-mini-2025-08-07
external/openai/gpt-5-mini-2025-08-07
Rank #7
19.6%
google/gemini-3.1-pro-preview
external/google/gemini-3-1-pro-preview
Rank #8
19.3%
gpt-5-2025-08-07
external/openai/gpt-5-2025-08-07
Rank #9
19.2%