live
weekly refresh
basedagi.org
▸ vertical

Best LLM for
Factual Accuracy

Vectara HHEM measures summarization faithfulness. SimpleQA adds published factual-recall evidence, but it is reported by OpenAI. BasedAGI does not name the least-hallucinating model until two current independent sources cover at least 20 callable models.

no broad winner published

One independent source is not enough for a least-hallucinating claim, and a lab-published score is not independent corroboration. The source measurements remain available for inspection.

▸ what hallucination benchmarks measure
Summarization faithfulnessDoes the model add facts not present in the source document when summarizing?
Factual recallCan the model answer closed-domain factual questions without inventing answers?
Factual consistency rateFraction of summaries with no hallucinated details — Vectara HHEM primary metric.
SimpleQA accuracyCorrect + factually accurate answers on a 4,326-question OpenAI factual benchmark.
Registry matchingOnly source rows resolving to callable OpenRouter model variants enter this ranking.
▸ benchmarks used
Vectara HHEMSummarization faithfulness · factual consistency rate 0-100%
SimpleQAOpenAI closed-domain factual Q&A · 4,326 questions · % correct
Sources: Vectara HHEM · SimpleQA — public sources; HHEM is independent and SimpleQA is published by OpenAI.
▸ analysis
Read consistency gaps correctly

A 97% versus 94% factual consistency rate implies 3% versus 6% inconsistent summaries on that measure: twice the failure rate, not a cosmetic difference.

Deployment does not replace evidence

If deployment constraints require an open-weight model, compare its current measured factuality row against the leader. Do not infer factuality from weight availability or vendor claims.

Hallucination type matters

HHEM tests summarization faithfulness; SimpleQA tests factual recall. A model that scores well on one may not score well on the other. Always evaluate on your specific task — a model that rarely hallucinates in summaries may still confabulate on open-domain questions.

Use cases: legal, medical, RAG

Low hallucination matters most for legal document review, medical information, RAG pipelines, and any application where invented facts cause real harm. For creative writing or brainstorming, hallucination score is irrelevant — pick by writing or EQ score instead.

▸ frequently asked

Which LLM hallucinates the least?

No broad factuality winner is published yet. Vectara HHEM is independent evidence; SimpleQA is vendor-published and cannot provide the current two-source corroboration required across 20 callable models.

What is Vectara HHEM?

HHEM (Hughes Hallucination Evaluation Model) is an independent benchmark by Vectara that measures how often LLMs introduce factual inconsistencies when summarizing documents. Models are scored on Factual Consistency Rate (0-100%), where higher means fewer hallucinations.

How reliable is hallucination benchmarking?

Hallucination benchmarks measure different failure modes. Vectara HHEM focuses on summarization faithfulness — whether a model invents details not in the source document. SimpleQA measures factual recall on closed-domain questions. Neither captures all hallucination types. Models that score well on summarization may still hallucinate on open-domain questions or multi-step reasoning tasks.