Every frontier model provider claims their system is more accurate and less hallucinatory than the competition. This leaderboard cuts through those claims by looking at what the benchmarks actually show - and the picture is messier than the marketing suggests.

No single benchmark captures the full scope of how models fail on facts. TruthfulQA tests whether models parrot common misconceptions. SimpleQA probes short-form factual recall. FACTS Grounding measures faithfulness to provided source documents. The Vectara leaderboard tracks summarization-time hallucinations. AA-Omniscience penalizes wrong answers and rewards abstention. Together they give a more honest picture of where models stand.

TL;DR

Gemini 2.5 Pro leads SimpleQA at 53.0%; no model cracks 70% on FACTS Grounding
Reasoning models score worse on grounded summarization - the "think more" approach sometimes makes faithfulness worse
Phi-3.5-MoE-instruct tops TruthfulQA at 0.775, outscoring much larger closed models on that older benchmark

Why Factuality Benchmarks Diverge

Before diving into the numbers, it helps to understand what each benchmark actually tests. They don't measure the same thing, and strong performance on one doesn't predict performance on another.

If you want the conceptual background, our guide to AI hallucinations explains the core failure modes, and our understanding AI benchmarks guide covers what benchmark scores can and can't tell you.

Benchmark	What it measures	Format	Dataset size
TruthfulQA	Resistance to common misconceptions	Multiple-choice (MC1/MC2)	817 questions
SimpleQA	Short-form factual recall	Open-ended Q&A	4,326 questions
FACTS Grounding	Faithfulness to source documents (up to 32K tokens)	Long-form generation	1,719 examples
Vectara HHEM	Hallucination rate in document summarization	Summarization	7,700+ articles
HaluEval	Hallucination detection across QA, dialogue, summarization	Classification	35,000 examples
HalluLens	Extrinsic and intrinsic hallucination taxonomy	Multi-task	Dynamic generation
AA-Omniscience	Factual recall across 42 topics, rewards abstention	Open-ended Q&A	6,000 questions

One important distinction runs through all of these: hallucination and factuality aren't the same thing. HalluLens (from Meta FAIR, published at ACL 2025) formalizes this: an extrinsic hallucination contradicts a model's own training data, while an intrinsic hallucination contradicts the context provided in the prompt. Benchmarks conflate these two failure modes frequently, which is why a model can look excellent on one test and poor on another.

TruthfulQA

TruthfulQA, introduced by Lin et al. in 2021, targets imitative falsehoods - wrong answers that feel plausible because they appear in human-written text. The 817 questions span 38 categories including health, law, finance, and politics.

The benchmark has a well-documented weakness: it's easy to game by training on similar questions, and contamination from public benchmarks is a real concern. Still, it remains widely reported and its MC2 scoring (normalized probability assigned to the set of true answers) is more robust than MC1.

Rank	Model	Provider	TruthfulQA Score
1	Phi-3.5-MoE-instruct	Microsoft	0.775
2	Granite 3.3 8B Instruct	IBM	0.669
3	Phi 4 Mini	Microsoft	0.664
4	Phi-3.5-mini-instruct	Microsoft	0.640
5	Hermes 3 70B	Nous Research	0.633
6	Llama 3.1 Nemotron 70B Instruct	NVIDIA	0.586
7	Qwen2.5 14B Instruct	Alibaba Cloud	0.584
8	Jamba 1.5 Large	AI21 Labs	0.583
9	Qwen2.5 32B Instruct	Alibaba Cloud	0.578
10	Command R+	Cohere	0.563

Source: llm-stats.com, 17 self-reported results, no verified external evaluation.

The most striking pattern: Microsoft's Phi family controls the top three slots. Phi-3.5-MoE-instruct's 0.775 score stands well above the field. Notably absent from the top 10 are GPT-4o, Claude Opus, and Gemini - the headline frontier models. That's partly because TruthfulQA has become saturated for the largest models (who were likely trained with TruthfulQA-style data in mind), and partly because the leaderboard only has 17 submissions - closed model providers don't always report this score publicly.

The original paper documented an inverse scaling effect: larger models answered less truthfully on this benchmark, not more. That relationship has weakened with instruction tuning and RLHF, but it is a reminder that scale alone doesn't fix factuality.

Stock market analysis documents with magnifying glass - representing close examination of AI model claims Factuality benchmarks require examining model outputs against verifiable sources, much like document analysis. Source: pexels.com

SimpleQA

SimpleQA, released by OpenAI in October 2024, focuses on short-form factual questions with verifiable single answers. The 4,326 questions were verified by multiple human raters and are designed to have clear, objective correct answers - not ambiguous or opinion-based queries.

This is arguably the most important factuality benchmark right now, because it's recent, resistant to training contamination (questions weren't publicly available before the benchmark launched), and covers a wide factual domain.

Rank	Model	Provider	SimpleQA Score
1	Gemini 2.5 Pro	Google	53.0%
2	Qwen3 235B A22B Instruct 2507	Alibaba	50.6%
3	Qwen3 VL 235B A22B Instruct	Alibaba	46.7%
4	GPT-4.1	OpenAI	40.4%
5	Qwen3 Next 80B A3B Instruct	Alibaba	40.1%
6	Grok 3 Beta	xAI	38.3%
7	Qwen3 VL 235B A22B Thinking	Alibaba	37.9%
8	Grok 3	xAI	37.4%
9	ERNIE 4.5 300B A47B	Baidu	36.9%
10	Claude 3.7 Sonnet Thinking	Anthropic	32.8%
11	Claude 3.7 Sonnet	Anthropic	32.8%
12	DeepSeek R1	DeepSeek	29.1%

Source: pricepertoken.com SimpleQA leaderboard, updated April 17, 2026. 45 models evaluated, average score 20.8.

A few things stand out. First, Gemini 2.5 Pro's lead is real but not commanding - the gap between first and fourth (GPT-4.1) is about 12 percentage points. Second, the Qwen3 family punches above its weight, with three models in the top five. Third, the absolute scores are low across the board. The leader gets 53% right. The field average is 20.8%. These aren't figures a vendor would put on a press release, which is probably why SimpleQA scores often get buried when companies announce new models.

The Thinking variants don't consistently outperform their non-thinking counterparts on this benchmark - Claude 3.7 Sonnet Thinking ties with Claude 3.7 Sonnet at 32.8%.

FACTS Grounding

FACTS Grounding, from Google DeepMind, tests something different: given a long document (up to 32,000 tokens) and a user request, does the model answer faithfully based on what's in the document without hallucinating content not present in the source?

The benchmark uses 1,719 examples across finance, technology, medicine, law, and retail. Three LLM judges - Gemini 1.5 Pro, GPT-4o, and Claude 3.5 Sonnet - each score responses, and the final score averages their judgments to reduce individual model bias. Responses that fail to address the user request are disqualified before scoring.

Rank	Model	Provider	FACTS Grounding Score
1	Gemini 2.0 Flash Experimental	Google	83.6%
2	Gemini 1.5 Flash	Google	82.9%
3	Gemini 1.5 Pro	Google	80.0%
4	Claude 3.5 Sonnet	Anthropic	79.4%
5	GPT-4o	OpenAI	78.8%

Source: FACTS Grounding paper (arxiv 2501.03200), original leaderboard results from January 2025.

Google's models dominate the top three, which isn't surprising given they designed the benchmark - though the research team took care to use a multi-judge setup and include non-Google judges. Claude 3.5 Sonnet and GPT-4o are close behind, both above 78%.

The FACTS Benchmark Suite (announced in early 2026) expands this to four dimensions: Grounding v2, Parametric, Search, and Multimodal. Under that harder suite, Gemini 3 Pro leads with an overall FACTS Score of 68.8%, and no model breaks 70%. The added difficulty comes from longer documents and more complex reasoning requirements. The full suite is described on the Google DeepMind FACTS Benchmark Suite blog post.

A caveat worth noting: the research team found models rate their own outputs 3.23 percentage points higher than those from competing providers on average, which is why the multi-judge approach matters. Single-judge evaluations of grounding are suspect.

Vectara HHEM Leaderboard

Vectara's hallucination leaderboard, now in its second generation, measures how often models introduce unsupported information when summarizing documents. The evaluation uses Vectara's HHEM-2.3 model (with a FaithJudge approach for some comparisons) to score factual consistency.

The refreshed dataset expanded from 1,000 to 7,700+ articles spanning law, medicine, finance, technology, education, sports, and news. Articles now reach 32K tokens, which is a meaningful difficulty increase. The leaderboard was last updated March 20, 2026.

Rank	Model	Provider	Hallucination Rate
1	finix_s1_32b	Ant Group	1.8%
2	gpt-5.4-nano	OpenAI	3.1%
3	gemini-2.5-flash-lite	Google	3.3%
4	Phi-4	Microsoft	3.7%
5	Llama-3.3-70B-Instruct-Turbo	Meta	4.1%

Source: Vectara hallucination leaderboard GitHub (last updated March 20, 2026). Lower hallucination rate = better.

The top result from Ant Group's finix_s1_32b at 1.8% is impressive, though this is a less-known model with limited independent benchmarking. The more notable finding from the updated leaderboard: reasoning-focused frontier models - GPT-5, Claude Sonnet 4.5, and Grok-4 - all show hallucination rates above 10% on the harder dataset. Vectara's explanation is that these models "overthink" summarization, deviating from source material in ways that smaller, more focused models don't.

This has direct effects for RAG pipelines. Our RAG Benchmarks Leaderboard covers the retrieval side; on the generation side, these hallucination rates suggest that raw intelligence and grounding faithfulness don't move together.

A person reviewing document carefully using a magnifying glass - representing detailed fact-checking of AI outputs Careful verification of AI outputs remains necessary given hallucination rates that persist even in frontier models. Source: pexels.com

HalluLens and HaluEval

HaluEval (RUCAIBox, EMNLP 2023) is a research benchmark with 35,000 examples across question answering, knowledge-grounded dialogue, and text summarization. It found that ChatGPT produced hallucinated content in roughly 19.5% of user queries when prompted in specific topic domains. It doesn't maintain a live leaderboard, but it's widely used in academic hallucination research as a standard evaluation set.

HalluLens (Meta FAIR, ACL 2025) goes further by distinguishing extrinsic from intrinsic hallucinations. The key result: Llama-3.1-405B-Instruct showed the lowest false acceptance rate (6.88%) on non-existent entity prompts, while some Mistral variants hit rates above 80%. GPT-4o balanced precision and recall best across tasks, scoring 52.59% accuracy on PreciseWikiQA.

The benchmark creates test sets dynamically to prevent data leakage - a design choice that matters for assessing newer models that may have trained on older static benchmarks.

AA-Omniscience

Artificial Analysis released AA-Omniscience in early 2026 as a knowledge and hallucination benchmark that jointly rewards correct answers and penalizes hallucinations. The scoring metric, the AA-Omniscience Index, runs from -100 to 100: a score of 0 means the model is equally likely to be right as wrong.

The 6,000 questions span 42 economically relevant topics across six domains. A key design choice: abstaining when uncertain is rewarded, unlike benchmarks that count refusals as failures.

Rank	Model	Provider	AA-Omniscience Index	Hallucination Rate
1	Gemini 3.1 Pro Preview	Google	33	~50%
2	Claude Opus 4.7 (Adaptive Reasoning)	Anthropic	26	-
3	Gemini 3 Pro Preview	Google	16	~88%
-	Grok 4.20 (Reasoning)	xAI	-	17% (lowest)
-	Claude 4.5 Haiku	Anthropic	-	25%

Source: artificialanalysis.ai/evaluations/omniscience. AA-Omniscience Index measures net factual reliability; hallucination rate is incorrect answers as a fraction of all non-correct responses.

The highest AA-Omniscience Index belongs to Gemini 3.1 Pro Preview at 33. Grok 4.20 in reasoning mode achieves the lowest raw hallucination rate at 17%. Claude 4.5 Haiku comes in third on hallucination rate at 25% - an interesting result for a smaller, non-reasoning model.

The gap between the accuracy ranking and the index ranking shows why the metric design matters. Gemini 3 Pro has a higher raw accuracy (56%) than Gemini 3.1 Pro Preview (55%), but its hallucination rate of ~88% drags the index down severely. A model that guesses more and is wrong more often gets penalized harder under this scoring system, which better reflects real-world reliability requirements.

Key Takeaways

No single benchmark tells the whole story

A model that ranks first on TruthfulQA may not rank first on SimpleQA, and a model that grounds faithfully to documents (FACTS Grounding) may still hallucinate during open-ended generation (AA-Omniscience). The leaderboards need to be read together. Our overall LLM rankings cover aggregate performance, but factuality is best assessed per-use-case.

Reasoning models introduce a grounding tradeoff

The Vectara finding that GPT-5, Claude Sonnet 4.5, and Grok-4 exceed 10% hallucination rates on their harder dataset is consistent with a pattern showing up across evaluations: chain-of-thought reasoning helps with tasks that need derivation, but it can hurt faithfulness on tasks that just need the model to stick to what's in front of it. If your application is document-grounded (RAG, summarization, contract review), a smaller, more focused model may serve you better than the biggest frontier reasoning model.

Benchmark contamination is a real concern for TruthfulQA

TruthfulQA's age (2021) and public availability mean that training datasets likely include examples similar to its questions. SimpleQA was designed to ease this by using a withheld question set. FACTS Grounding uses a private held-out set for the same reason. When evaluating new models, weight newer and harder benchmarks more heavily.

Open-source models can match or exceed closed models on factuality

Phi-3.5-MoE-instruct leads TruthfulQA. Llama-3.1-405B-Instruct performs best on HalluLens extrinsic hallucinations. Ant Group's finix_s1_32b leads the Vectara leaderboard. The narrative that frontier closed models are always more accurate doesn't hold up across these benchmarks.

Practical Guidance

For document-grounded applications (summarization, contract review, RAG): Focus on FACTS Grounding and Vectara HHEM scores. Prefer models that score above 78% on FACTS Grounding. Avoid reasoning-heavy models unless you've tested their grounding behavior on your specific document types.

For factual Q&A assistants (research tools, knowledge bases): Use SimpleQA as your primary signal. Top performers are Gemini 2.5 Pro (53.0%), the Qwen3 235B family (50.6%), and GPT-4.1 (40.4%). The field average of 20.8% means you should build retrieval augmentation into any production system rather than relying on parametric knowledge alone.

For general trust calibration: AA-Omniscience gives the most complete picture because it penalizes overconfidence. Models with a high index score are doing something real - they're either answering correctly more often, hedging appropriately, or both.

Budget-conscious options: Phi-3.5-MoE-instruct punches above its weight class on TruthfulQA. Gemini 2.5 Flash Lite leads the Vectara leaderboard among accessible models at 3.3% hallucination rate. Smaller models with factuality-focused training can outperform larger general-purpose models on specific tasks.

FAQ

Which model has the lowest hallucination rate in 2026?

On the Vectara summarization benchmark, Ant Group's finix_s1_32b leads at 1.8%. On AA-Omniscience, Grok 4.20 in reasoning mode reaches the lowest raw hallucination rate at 17%.

What is SimpleQA and why does it matter?

SimpleQA is OpenAI's 4,326-question benchmark for short-form factual accuracy. It's considered more contamination-resistant than TruthfulQA because it used a withheld question set at launch. The field average score of 20.8% shows factual recall remains a weak point across models.

Does TruthfulQA still matter in 2026?

It's useful but limited. The benchmark dates to 2021, is publicly available, and likely appears in training data. It's also small (817 questions). Use it as a sanity check, not a primary signal. SimpleQA and FACTS Grounding are more reliable for current model comparisons.

Why do reasoning models hallucinate more on summarization?

Vectara's leaderboard data suggests reasoning models deviate from source documents because their chain-of-thought process adds inferences beyond what's written. Document summarization rewards strict faithfulness, not elaboration - a task better suited to smaller, focused models.

What does the FACTS Benchmark Suite measure beyond the original?

The full FACTS Suite adds Parametric (knowledge without retrieval), Search (using web search tools), and Multimodal (image-grounded factuality) on top of the original Grounding benchmark. No model has topped 70% average across all four components as of April 2026.

Sources: