▸ vertical

Best LLM for
Factual Accuracy

Vectara HHEM measures summarization faithfulness; TruthfulQA tests resistance to common misconceptions. SimpleQA adds vendor-published factual-recall evidence. A factuality winner is named when 20 callable models have current independent evidence.

factuality leaderboard →how scores work →

▸ no current factual accuracy winner published

No current winner is published: qualifying independent evidence is older than 30 days. The table below shows available corroborated evidence, not a publishable current winner.

▸ corroborated evidence · factuality score

#	Model	Factuality	Reasoning	Price/M
1	Meta: Llama 3.3 70B Instruct Meta Llama	94.9	60.4	$0.10/M
2	OpenAI: GPT-5.4 Mini Openai	93.5	—	$0.75/M
3	Qwen: Qwen3 32B Qwen	93.1	—	$0.08/M
4	DeepSeek: DeepSeek V3.2 Deepseek	92.7	54.5	$0.21/M
5	OpenAI: GPT-4.1 Openai	92.6	63.2	$2.00/M
6	Google: Gemini 2.5 Pro Google	92.0	53.2	$1.25/M
7	OpenAI: GPT-5.4 Openai	92.0	72.9	$2.50/M
8	Google: Gemma 3 27B Google	91.6	60.0	$0.08/M
9	Google: Gemini 2.5 Flash Google	91.2	29.2	$0.30/M
10	DeepSeek: DeepSeek V4 Pro Deepseek	90.4	—	$0.43/M
11	OpenAI: GPT-5.5 Openai	89.7	74.3	$5.00/M
12	Qwen: Qwen3 235B A22B Qwen	89.7	—	$0.45/M
13	Anthropic: Claude Sonnet 4 Anthropic	88.7	67.3	$3.00/M
14	DeepSeek: R1 Deepseek	87.7	36.4	$0.70/M
15	Qwen2.5 Coder 32B Instruct Qwen	53.2	—	$0.66/M

scored by Vectara HHEM + SimpleQA composite · scoring methodology

▸ what hallucination benchmarks measure

Summarization faithfulnessDoes the model add facts not present in the source document when summarizing?

Factual recallCan the model answer closed-domain factual questions without inventing answers?

Factual consistency rateFraction of summaries with no hallucinated details — Vectara HHEM primary metric.

SimpleQA accuracyCorrect + factually accurate answers on a 4,326-question OpenAI factual benchmark.

Registry matchingOnly source rows resolving to callable OpenRouter model variants enter this ranking.

▸ benchmarks used

Vectara HHEMSummarization faithfulness · factual consistency rate 0-100%

SimpleQAOpenAI closed-domain factual Q&A · 4,326 questions · % correct

TruthfulQAResistance to common misconceptions and pretraining-absorbed falsehoods

Sources: Vectara HHEM · SimpleQA · TruthfulQA — HHEM and TruthfulQA are independent; SimpleQA is published by OpenAI.

▸ analysis

Read consistency gaps correctly

A 97% versus 94% factual consistency rate implies 3% versus 6% inconsistent summaries on that measure: twice the failure rate, not a cosmetic difference.

Deployment does not replace evidence

If deployment constraints require an open-weight model, compare its current measured factuality row against the leader. Do not infer factuality from weight availability or vendor claims.

Hallucination type matters

HHEM tests summarization faithfulness; SimpleQA tests factual recall. A model that scores well on one may not score well on the other. Always evaluate on your specific task — a model that rarely hallucinates in summaries may still confabulate on open-domain questions.

Use cases: legal, medical, RAG

Low hallucination matters most for legal document review, medical information, RAG pipelines, and any application where invented facts cause real harm. For creative writing or brainstorming, hallucination score is irrelevant — pick by writing or EQ score instead.

▸ frequently asked

Which LLM hallucinates the least?

No broad factuality winner is published yet. Vectara HHEM is independent evidence; SimpleQA is vendor-published and cannot provide the current two-source corroboration required across 20 callable models.

What is Vectara HHEM?

HHEM (Hughes Hallucination Evaluation Model) is an independent benchmark by Vectara that measures how often LLMs introduce factual inconsistencies when summarizing documents. Models are scored on Factual Consistency Rate (0-100%), where higher means fewer hallucinations.

How reliable is hallucination benchmarking?

Hallucination benchmarks measure different failure modes. Vectara HHEM focuses on summarization faithfulness — whether a model invents details not in the source document. SimpleQA measures factual recall on closed-domain questions. Neither captures all hallucination types. Models that score well on summarization may still hallucinate on open-domain questions or multi-step reasoning tasks.

Best LLM forFactual Accuracy

Which LLM hallucinates the least?

What is Vectara HHEM?

How reliable is hallucination benchmarking?

Best LLM for
Factual Accuracy