Best LLM for
Factual Accuracy
Vectara HHEM measures summarization faithfulness. SimpleQA adds published factual-recall evidence, but it is reported by OpenAI. BasedAGI does not name the least-hallucinating model until two current independent sources cover at least 20 callable models.
One independent source is not enough for a least-hallucinating claim, and a lab-published score is not independent corroboration. The source measurements remain available for inspection.
A 97% versus 94% factual consistency rate implies 3% versus 6% inconsistent summaries on that measure: twice the failure rate, not a cosmetic difference.
If deployment constraints require an open-weight model, compare its current measured factuality row against the leader. Do not infer factuality from weight availability or vendor claims.
HHEM tests summarization faithfulness; SimpleQA tests factual recall. A model that scores well on one may not score well on the other. Always evaluate on your specific task — a model that rarely hallucinates in summaries may still confabulate on open-domain questions.
Low hallucination matters most for legal document review, medical information, RAG pipelines, and any application where invented facts cause real harm. For creative writing or brainstorming, hallucination score is irrelevant — pick by writing or EQ score instead.
Which LLM hallucinates the least?
No broad factuality winner is published yet. Vectara HHEM is independent evidence; SimpleQA is vendor-published and cannot provide the current two-source corroboration required across 20 callable models.
What is Vectara HHEM?
HHEM (Hughes Hallucination Evaluation Model) is an independent benchmark by Vectara that measures how often LLMs introduce factual inconsistencies when summarizing documents. Models are scored on Factual Consistency Rate (0-100%), where higher means fewer hallucinations.
How reliable is hallucination benchmarking?
Hallucination benchmarks measure different failure modes. Vectara HHEM focuses on summarization faithfulness — whether a model invents details not in the source document. SimpleQA measures factual recall on closed-domain questions. Neither captures all hallucination types. Models that score well on summarization may still hallucinate on open-domain questions or multi-step reasoning tasks.