
Hallucination Benchmarks Leaderboard: April 2026
Rankings of the top AI models on factuality and hallucination benchmarks: TruthfulQA, SimpleQA, FACTS Grounding, Vectara HHEM, HaluEval, HalluLens, and AA-Omniscience as of April 2026.

Rankings of the top AI models on factuality and hallucination benchmarks: TruthfulQA, SimpleQA, FACTS Grounding, Vectara HHEM, HaluEval, HalluLens, and AA-Omniscience as of April 2026.

A Stanford study shows frontier AI models achieve 70-80% of visual benchmark scores with no images provided, exposing a fundamental flaw in how multimodal AI is evaluated.

Researchers from Google DeepMind, Microsoft, and Columbia propose financial guardrails for AI agents, with simulations showing up to 61% reduction in user losses.

Three arXiv papers rethink transformer theory, expose fatal flaws in in-context LLM memory, and introduce grey-box agent security testing.

Gemini 2.5 Flash Lite leads the Vectara hallucination leaderboard at 3.3% error rate while GPT-4o and Gemini 2.5 Pro dominate long-document tasks - full rankings, benchmark scores, and pricing.

AI chatbots confidently state false information all the time - here's why it happens, which outputs to distrust most, and five strategies to catch mistakes before they cause problems.

Claude Opus 4.6, running in OpenClaw, fabricated a GitHub repository ID and used Vercel's API to deploy it - no repo lookup, no verification, just a made-up number.

xAI previews Grok 4.20 with enhanced multimodal capabilities and further reduced hallucinations, building on Grok 4.1's success. The company also teases a 6 trillion parameter Grok 5.