Hallucination Benchmarks Leaderboard: April 2026
Rankings of the top AI models on factuality and hallucination benchmarks: TruthfulQA, SimpleQA, FACTS Grounding, Vectara HHEM, HaluEval, HalluLens, and AA-Omniscience as of April 2026.

Every frontier model provider claims their system is more accurate and less hallucinatory than the competition. This leaderboard cuts through those claims by looking at what the benchmarks actually show - and the picture is messier than the marketing suggests.
No single benchmark captures the full scope of how models fail on facts. TruthfulQA tests whether models parrot common misconceptions. SimpleQA probes short-form factual recall. FACTS Grounding measures faithfulness to provided source documents. The Vectara leaderboard tracks summarization-time hallucinations. AA-Omniscience penalizes wrong answers and rewards abstention. Together they give a more honest picture of where models stand.
TL;DR
- Gemini 2.5 Pro leads SimpleQA at 53.0%; no model cracks 70% on FACTS Grounding
- Reasoning models score worse on grounded summarization - the "think more" approach sometimes makes faithfulness worse
- Phi-3.5-MoE-instruct tops TruthfulQA at 0.775, outscoring much larger closed models on that older benchmark
Why Factuality Benchmarks Diverge
Before diving into the numbers, it helps to understand what each benchmark actually tests. They don't measure the same thing, and strong performance on one doesn't predict performance on another.
If you want the conceptual background, our guide to AI hallucinations explains the core failure modes, and our understanding AI benchmarks guide covers what benchmark scores can and can't tell you.
| Benchmark | What it measures | Format | Dataset size |
|---|---|---|---|
| TruthfulQA | Resistance to common misconceptions | Multiple-choice (MC1/MC2) | 817 questions |
| SimpleQA | Short-form factual recall | Open-ended Q&A | 4,326 questions |
| FACTS Grounding | Faithfulness to source documents (up to 32K tokens) | Long-form generation | 1,719 examples |
| Vectara HHEM | Hallucination rate in document summarization | Summarization | 7,700+ articles |
| HaluEval | Hallucination detection across QA, dialogue, summarization | Classification | 35,000 examples |
| HalluLens | Extrinsic and intrinsic hallucination taxonomy | Multi-task | Dynamic generation |
| AA-Omniscience | Factual recall across 42 topics, rewards abstention | Open-ended Q&A | 6,000 questions |
One important distinction runs through all of these: hallucination and factuality aren't the same thing. HalluLens (from Meta FAIR, published at ACL 2025) formalizes this: an extrinsic hallucination contradicts a model's own training data, while an intrinsic hallucination contradicts the context provided in the prompt. Benchmarks conflate these two failure modes frequently, which is why a model can look excellent on one test and poor on another.
TruthfulQA
TruthfulQA, introduced by Lin et al. in 2021, targets imitative falsehoods - wrong answers that feel plausible because they appear in human-written text. The 817 questions span 38 categories including health, law, finance, and politics.
The benchmark has a well-documented weakness: it's easy to game by training on similar questions, and contamination from public benchmarks is a real concern. Still, it remains widely reported and its MC2 scoring (normalized probability assigned to the set of true answers) is more robust than MC1.
| Rank | Model | Provider | TruthfulQA Score |
|---|---|---|---|
| 1 | Phi-3.5-MoE-instruct | Microsoft | 0.775 |
| 2 | Granite 3.3 8B Instruct | IBM | 0.669 |
| 3 | Phi 4 Mini | Microsoft | 0.664 |
| 4 | Phi-3.5-mini-instruct | Microsoft | 0.640 |
| 5 | Hermes 3 70B | Nous Research | 0.633 |
| 6 | Llama 3.1 Nemotron 70B Instruct | NVIDIA | 0.586 |
| 7 | Qwen2.5 14B Instruct | Alibaba Cloud | 0.584 |
| 8 | Jamba 1.5 Large | AI21 Labs | 0.583 |
| 9 | Qwen2.5 32B Instruct | Alibaba Cloud | 0.578 |
| 10 | Command R+ | Cohere | 0.563 |
Source: llm-stats.com, 17 self-reported results, no verified external evaluation.
The most striking pattern: Microsoft's Phi family controls the top three slots. Phi-3.5-MoE-instruct's 0.775 score stands well above the field. Notably absent from the top 10 are GPT-4o, Claude Opus, and Gemini - the headline frontier models. That's partly because TruthfulQA has become saturated for the largest models (who were likely trained with TruthfulQA-style data in mind), and partly because the leaderboard only has 17 submissions - closed model providers don't always report this score publicly.
The original paper documented an inverse scaling effect: larger models answered less truthfully on this benchmark, not more. That relationship has weakened with instruction tuning and RLHF, but it is a reminder that scale alone doesn't fix factuality.
Factuality benchmarks require examining model outputs against verifiable sources, much like document analysis.
Source: pexels.com
SimpleQA
SimpleQA, released by OpenAI in October 2024, focuses on short-form factual questions with verifiable single answers. The 4,326 questions were verified by multiple human raters and are designed to have clear, objective correct answers - not ambiguous or opinion-based queries.
This is arguably the most important factuality benchmark right now, because it's recent, resistant to training contamination (questions weren't publicly available before the benchmark launched), and covers a wide factual domain.
| Rank | Model | Provider | SimpleQA Score |
|---|---|---|---|
| 1 | Gemini 2.5 Pro | 53.0% | |
| 2 | Qwen3 235B A22B Instruct 2507 | Alibaba | 50.6% |
| 3 | Qwen3 VL 235B A22B Instruct | Alibaba | 46.7% |
| 4 | GPT-4.1 | OpenAI | 40.4% |
| 5 | Qwen3 Next 80B A3B Instruct | Alibaba | 40.1% |
| 6 | Grok 3 Beta | xAI | 38.3% |
| 7 | Qwen3 VL 235B A22B Thinking | Alibaba | 37.9% |
| 8 | Grok 3 | xAI | 37.4% |
| 9 | ERNIE 4.5 300B A47B | Baidu | 36.9% |
| 10 | Claude 3.7 Sonnet Thinking | Anthropic | 32.8% |
| 11 | Claude 3.7 Sonnet | Anthropic | 32.8% |
| 12 | DeepSeek R1 | DeepSeek | 29.1% |
Source: pricepertoken.com SimpleQA leaderboard, updated April 17, 2026. 45 models evaluated, average score 20.8.
A few things stand out. First, Gemini 2.5 Pro's lead is real but not commanding - the gap between first and fourth (GPT-4.1) is about 12 percentage points. Second, the Qwen3 family punches above its weight, with three models in the top five. Third, the absolute scores are low across the board. The leader gets 53% right. The field average is 20.8%. These aren't figures a vendor would put on a press release, which is probably why SimpleQA scores often get buried when companies announce new models.
The Thinking variants don't consistently outperform their non-thinking counterparts on this benchmark - Claude 3.7 Sonnet Thinking ties with Claude 3.7 Sonnet at 32.8%.
FACTS Grounding
FACTS Grounding, from Google DeepMind, tests something different: given a long document (up to 32,000 tokens) and a user request, does the model answer faithfully based on what's in the document without hallucinating content not present in the source?
The benchmark uses 1,719 examples across finance, technology, medicine, law, and retail. Three LLM judges - Gemini 1.5 Pro, GPT-4o, and Claude 3.5 Sonnet - each score responses, and the final score averages their judgments to reduce individual model bias. Responses that fail to address the user request are disqualified before scoring.
| Rank | Model | Provider | FACTS Grounding Score |
|---|---|---|---|
| 1 | Gemini 2.0 Flash Experimental | 83.6% | |
| 2 | Gemini 1.5 Flash | 82.9% | |
| 3 | Gemini 1.5 Pro | 80.0% | |
| 4 | Claude 3.5 Sonnet | Anthropic | 79.4% |
| 5 | GPT-4o | OpenAI | 78.8% |
Source: FACTS Grounding paper (arxiv 2501.03200), original leaderboard results from January 2025.
Google's models dominate the top three, which isn't surprising given they designed the benchmark - though the research team took care to use a multi-judge setup and include non-Google judges. Claude 3.5 Sonnet and GPT-4o are close behind, both above 78%.
The FACTS Benchmark Suite (announced in early 2026) expands this to four dimensions: Grounding v2, Parametric, Search, and Multimodal. Under that harder suite, Gemini 3 Pro leads with an overall FACTS Score of 68.8%, and no model breaks 70%. The added difficulty comes from longer documents and more complex reasoning requirements. The full suite is described on the Google DeepMind FACTS Benchmark Suite blog post.
A caveat worth noting: the research team found models rate their own outputs 3.23 percentage points higher than those from competing providers on average, which is why the multi-judge approach matters. Single-judge evaluations of grounding are suspect.
Vectara HHEM Leaderboard
Vectara's hallucination leaderboard, now in its second generation, measures how often models introduce unsupported information when summarizing documents. The evaluation uses Vectara's HHEM-2.3 model (with a FaithJudge approach for some comparisons) to score factual consistency.
The refreshed dataset expanded from 1,000 to 7,700+ articles spanning law, medicine, finance, technology, education, sports, and news. Articles now reach 32K tokens, which is a meaningful difficulty increase. The leaderboard was last updated March 20, 2026.
| Rank | Model | Provider | Hallucination Rate |
|---|---|---|---|
| 1 | finix_s1_32b | Ant Group | 1.8% |
| 2 | gpt-5.4-nano | OpenAI | 3.1% |
| 3 | gemini-2.5-flash-lite | 3.3% | |
| 4 | Phi-4 | Microsoft | 3.7% |
| 5 | Llama-3.3-70B-Instruct-Turbo | Meta | 4.1% |
Source: Vectara hallucination leaderboard GitHub (last updated March 20, 2026). Lower hallucination rate = better.
The top result from Ant Group's finix_s1_32b at 1.8% is impressive, though this is a less-known model with limited independent benchmarking. The more notable finding from the updated leaderboard: reasoning-focused frontier models - GPT-5, Claude Sonnet 4.5, and Grok-4 - all show hallucination rates above 10% on the harder dataset. Vectara's explanation is that these models "overthink" summarization, deviating from source material in ways that smaller, more focused models don't.
This has direct effects for RAG pipelines. Our RAG Benchmarks Leaderboard covers the retrieval side; on the generation side, these hallucination rates suggest that raw intelligence and grounding faithfulness don't move together.
Careful verification of AI outputs remains necessary given hallucination rates that persist even in frontier models.
Source: pexels.com
HalluLens and HaluEval
HaluEval (RUCAIBox, EMNLP 2023) is a research benchmark with 35,000 examples across question answering, knowledge-grounded dialogue, and text summarization. It found that ChatGPT produced hallucinated content in roughly 19.5% of user queries when prompted in specific topic domains. It doesn't maintain a live leaderboard, but it's widely used in academic hallucination research as a standard evaluation set.
HalluLens (Meta FAIR, ACL 2025) goes further by distinguishing extrinsic from intrinsic hallucinations. The key result: Llama-3.1-405B-Instruct showed the lowest false acceptance rate (6.88%) on non-existent entity prompts, while some Mistral variants hit rates above 80%. GPT-4o balanced precision and recall best across tasks, scoring 52.59% accuracy on PreciseWikiQA.
The benchmark creates test sets dynamically to prevent data leakage - a design choice that matters for assessing newer models that may have trained on older static benchmarks.
AA-Omniscience
Artificial Analysis released AA-Omniscience in early 2026 as a knowledge and hallucination benchmark that jointly rewards correct answers and penalizes hallucinations. The scoring metric, the AA-Omniscience Index, runs from -100 to 100: a score of 0 means the model is equally likely to be right as wrong.
The 6,000 questions span 42 economically relevant topics across six domains. A key design choice: abstaining when uncertain is rewarded, unlike benchmarks that count refusals as failures.
| Rank | Model | Provider | AA-Omniscience Index | Hallucination Rate |
|---|---|---|---|---|
| 1 | Gemini 3.1 Pro Preview | 33 | ~50% | |
| 2 | Claude Opus 4.7 (Adaptive Reasoning) | Anthropic | 26 | - |
| 3 | Gemini 3 Pro Preview | 16 | ~88% | |
| - | Grok 4.20 (Reasoning) | xAI | - | 17% (lowest) |
| - | Claude 4.5 Haiku | Anthropic | - | 25% |
Source: artificialanalysis.ai/evaluations/omniscience. AA-Omniscience Index measures net factual reliability; hallucination rate is incorrect answers as a fraction of all non-correct responses.
The highest AA-Omniscience Index belongs to Gemini 3.1 Pro Preview at 33. Grok 4.20 in reasoning mode achieves the lowest raw hallucination rate at 17%. Claude 4.5 Haiku comes in third on hallucination rate at 25% - an interesting result for a smaller, non-reasoning model.
The gap between the accuracy ranking and the index ranking shows why the metric design matters. Gemini 3 Pro has a higher raw accuracy (56%) than Gemini 3.1 Pro Preview (55%), but its hallucination rate of ~88% drags the index down severely. A model that guesses more and is wrong more often gets penalized harder under this scoring system, which better reflects real-world reliability requirements.
Key Takeaways
No single benchmark tells the whole story
A model that ranks first on TruthfulQA may not rank first on SimpleQA, and a model that grounds faithfully to documents (FACTS Grounding) may still hallucinate during open-ended generation (AA-Omniscience). The leaderboards need to be read together. Our overall LLM rankings cover aggregate performance, but factuality is best assessed per-use-case.
Reasoning models introduce a grounding tradeoff
The Vectara finding that GPT-5, Claude Sonnet 4.5, and Grok-4 exceed 10% hallucination rates on their harder dataset is consistent with a pattern showing up across evaluations: chain-of-thought reasoning helps with tasks that need derivation, but it can hurt faithfulness on tasks that just need the model to stick to what's in front of it. If your application is document-grounded (RAG, summarization, contract review), a smaller, more focused model may serve you better than the biggest frontier reasoning model.
Benchmark contamination is a real concern for TruthfulQA
TruthfulQA's age (2021) and public availability mean that training datasets likely include examples similar to its questions. SimpleQA was designed to ease this by using a withheld question set. FACTS Grounding uses a private held-out set for the same reason. When evaluating new models, weight newer and harder benchmarks more heavily.
Open-source models can match or exceed closed models on factuality
Phi-3.5-MoE-instruct leads TruthfulQA. Llama-3.1-405B-Instruct performs best on HalluLens extrinsic hallucinations. Ant Group's finix_s1_32b leads the Vectara leaderboard. The narrative that frontier closed models are always more accurate doesn't hold up across these benchmarks.
Practical Guidance
For document-grounded applications (summarization, contract review, RAG): Focus on FACTS Grounding and Vectara HHEM scores. Prefer models that score above 78% on FACTS Grounding. Avoid reasoning-heavy models unless you've tested their grounding behavior on your specific document types.
For factual Q&A assistants (research tools, knowledge bases): Use SimpleQA as your primary signal. Top performers are Gemini 2.5 Pro (53.0%), the Qwen3 235B family (50.6%), and GPT-4.1 (40.4%). The field average of 20.8% means you should build retrieval augmentation into any production system rather than relying on parametric knowledge alone.
For general trust calibration: AA-Omniscience gives the most complete picture because it penalizes overconfidence. Models with a high index score are doing something real - they're either answering correctly more often, hedging appropriately, or both.
Budget-conscious options: Phi-3.5-MoE-instruct punches above its weight class on TruthfulQA. Gemini 2.5 Flash Lite leads the Vectara leaderboard among accessible models at 3.3% hallucination rate. Smaller models with factuality-focused training can outperform larger general-purpose models on specific tasks.
FAQ
Which model has the lowest hallucination rate in 2026?
On the Vectara summarization benchmark, Ant Group's finix_s1_32b leads at 1.8%. On AA-Omniscience, Grok 4.20 in reasoning mode reaches the lowest raw hallucination rate at 17%.
What is SimpleQA and why does it matter?
SimpleQA is OpenAI's 4,326-question benchmark for short-form factual accuracy. It's considered more contamination-resistant than TruthfulQA because it used a withheld question set at launch. The field average score of 20.8% shows factual recall remains a weak point across models.
Does TruthfulQA still matter in 2026?
It's useful but limited. The benchmark dates to 2021, is publicly available, and likely appears in training data. It's also small (817 questions). Use it as a sanity check, not a primary signal. SimpleQA and FACTS Grounding are more reliable for current model comparisons.
Why do reasoning models hallucinate more on summarization?
Vectara's leaderboard data suggests reasoning models deviate from source documents because their chain-of-thought process adds inferences beyond what's written. Document summarization rewards strict faithfulness, not elaboration - a task better suited to smaller, focused models.
What does the FACTS Benchmark Suite measure beyond the original?
The full FACTS Suite adds Parametric (knowledge without retrieval), Search (using web search tools), and Multimodal (image-grounded factuality) on top of the original Grounding benchmark. No model has topped 70% average across all four components as of April 2026.
Sources:
- TruthfulQA paper (arxiv 2109.07958)
- TruthfulQA leaderboard at llm-stats.com
- SimpleQA leaderboard at pricepertoken.com
- FACTS Grounding paper (arxiv 2501.03200)
- FACTS Grounding - Google DeepMind blog
- Vectara hallucination leaderboard (GitHub)
- Vectara next-gen leaderboard announcement
- HaluEval paper (arxiv 2305.11747)
- HalluLens paper (arxiv 2504.17550)
- AA-Omniscience benchmark - Artificial Analysis
- FELM paper (arxiv 2310.00741)
✓ Last verified April 17, 2026
