benchmark evidence

Vectara HHEM

Hallucination evaluation by Vectara using the Hughes Hallucination Evaluation Model. Factual Consistency Rate (0-100): higher means less hallucination.

activefactuality evidence ->upstream source ->

winner on Vectara HHEM

Meta: Llama 3.3 70B Instruct95.9

direct benchmark result, not a broad vertical composite | source row dated 2000-01-01

scored on 2000-01-01 · stale source data (9691d)

latest mapped results | top 20

#	Model	Score	Evidence	Tested
1	Meta: Llama 3.3 70B Instruct Meta Llama	95.9	model-only independent_benchmark	2000-01-01
2	OpenAI: GPT-5.4 Mini Openai	94.5	model-only independent_benchmark	2000-01-01
3	OpenAI: GPT-4.1 Openai	94.4	model-only independent_benchmark	2000-01-01
4	Qwen: Qwen3 32B Qwen	94.1	model-only independent_benchmark	2000-01-01
5	DeepSeek: DeepSeek V3.2 Deepseek	93.7	model-only independent_benchmark	2000-01-01
6	OpenAI: GPT-5.4 Openai	93.0	model-only independent_benchmark	2000-01-01
7	Google: Gemini 2.5 Pro Google	93.0	model-only independent_benchmark	2000-01-01
8	Google: Gemma 3 27B Google	92.6	model-only independent_benchmark	2000-01-01
9	Google: Gemini 2.5 Flash Google	92.2	model-only independent_benchmark	2000-01-01
10	DeepSeek: DeepSeek V4 Pro Deepseek	91.4	model-only independent_benchmark	2000-01-01
11	OpenAI: GPT-5.5 Openai	90.7	model-only independent_benchmark	2000-01-01
12	Qwen: Qwen3 235B A22B Qwen	90.7	model-only independent_benchmark	2000-01-01
13	Anthropic: Claude Sonnet 4 Anthropic	89.7	model-only independent_benchmark	2000-01-01
14	DeepSeek: R1 Deepseek	88.7	model-only independent_benchmark	2000-01-01

what this result means

Hallucination evaluation by Vectara using the Hughes Hallucination Evaluation Model. Factual Consistency Rate (0-100): higher means less hallucination.

This benchmark contributes direct public evidence. Read its scope before generalizing the result.

A win here is a win on Vectara HHEM. Broad task pages require independent corroboration before naming a general winner.

source record

category: hallucination

metric: accuracy

matched models: 14

latest source date: 2000-01-01

direction: higher is better

inspect upstream source ->