Best AI Models for Text Summarization - June 2026

TL;DR

Gemini 2.5 Flash Lite holds its #1 position at 3.3% hallucination rate on the Vectara leaderboard - but Gemini 2.0 Flash (which briefly led at 0.7%) was deprecated June 1, 2026
Best value shift since March: Mistral Large 3 (2512) dropped from ~$2.00/M to $0.50/M - now the cheapest API option with a real 262K context window
Gemini 3.5 Flash (released May 2026, $1.50/M) adds a strong 1M-context mid-tier option that wasn't there before

The best model for summarization still depends on what you mean by "best." Three months after the March rankings, the faithfulness leaders haven't changed much - Gemini 2.5 Flash Lite, Phi-4, Llama 3.3 70B, and Mistral Large remain the top performers on the Vectara grounded summarization benchmark. What has changed is the pricing picture and the available mid-tier options.

Gemini 3.5 Flash launched at Google I/O in May with 1M token context at $1.50/M - slotting cleanly between Mistral Large 3 (value) and GPT-4.1 (premium long-context). Mistral Large 3 (2512) saw a significant price cut, making it the cheapest API option with a 262K context window. And Gemini 2.0 Flash - which briefly led the original Vectara dataset at 0.7% - was deprecated on June 1.

Rankings

Rankings use the Vectara Hallucination Leaderboard (updated May 11, 2026) as the primary source for faithfulness scores, and HELMET benchmark data for long-context performance. The Vectara leaderboard now runs two datasets: the original shorter-document set, and a harder enterprise-length set (documents up to 32K tokens). Scores below reflect the enterprise dataset where available; original-dataset scores are noted.

Rank	Model	Provider	Hallucination Rate	Price (Input)	Verdict
1	Gemini 2.5 Flash Lite	Google	3.3%	$0.10/M	Top faithfulness; cheapest reliable API option
2	Phi-4	Microsoft	3.7%	~$0.07/M (self-host)	Exceptional at 14B; 16K context limits use cases
3	Llama 3.3 70B Instruct	Meta	4.1%	Free (self-host)	Best open-source faithfulness; 128K context
4	Mistral Large 3 (2512)	Mistral	4.5%	$0.50/M	Significant price cut since March; 262K context
5	GPT-4.1	OpenAI	5.6%	$2.50/M	1M context; long-doc stability leader
6	Gemini 3.5 Flash	Google	Not yet rated*	$1.50/M	New May 2026; 1M context; fast parallel processing
7	DeepSeek V4 Pro	DeepSeek	8.6%	$1.74/M	1M context; strong value for bulk summarization
8	Gemini 3.1 Pro	Google	10.4%	$2.00/M	1M context; strong narrative coherence
9	Claude Sonnet 4.6	Anthropic	10.6%	$3.00/M	Better structure and nuance; poor faithfulness
10	Claude Opus 4.8	Anthropic	est. ~10.9%	$5.00/M	Best Claude coherence; released May 28, 2026
11	GPT-5.4	OpenAI	est. ~10.8%	$2.50/M	Strong quality; frontier-scale faithfulness penalty
12	GPT-5.5	OpenAI	4.2% (Pro mode); higher w/o	$5.00/M	Extended thinking cuts hallucinations; very expensive

*Gemini 3.5 Flash was released May 19, 2026 and hasn't yet been independently scored on the Vectara leaderboard.

Detailed Analysis

Gemini 2.5 Flash Lite

Three months later, nothing has unseated Gemini 2.5 Flash Lite at 3.3% hallucination rate on the enterprise Vectara dataset. The model that briefly beat it - Gemini 2.0 Flash, at 0.7% on the original shorter-document dataset - was deprecated on June 1, 2026. That record is gone. For active deployments, Flash Lite remains the top option on faithfulness.

The pricing picture also holds. At $0.10/M input and $0.40/M output, it runs 50x cheaper than Claude Opus 4.8 for identical tasks. For high-volume pipelines summarizing customer support tickets, news articles, or product reviews, the math doesn't change: Flash Lite wins on both cost and accuracy. One limitation worth noting - on long structured documents (legal filings, multi-chapter reports), larger models catch nuance that Flash Lite misses. The faithfulness benchmark measures grounded accuracy, not summary completeness.

Mistral Large 3 (2512)

The most significant change since March isn't a new model - it's a price cut. Mistral Large 3 (2512) dropped from roughly $2.00/M input to $0.50/M. At 262K context window and 4.5% hallucination rate, it now occupies a gap that didn't exist in March: a cheap API option that handles long enterprise documents with solid faithfulness, at a fraction of GPT-4.1's cost.

For organizations that need API access (no self-hosting) and regularly summarize 100K-200K token documents, Mistral Large 3 (2512) is now the strongest value-for-accuracy option. It scores slightly worse than Gemini 2.5 Flash Lite and Phi-4 on faithfulness, but it handles longer documents than both.

Gemini 3.5 Flash

Released at Google I/O on May 19, Gemini 3.5 Flash fills the space between Mistral Large 3 and GPT-4.1. At $1.50/M input and 1M token context, it's positioned as Google's high-throughput mid-tier model. Independent evaluations show strong agentic performance - it beats Gemini 3.1 Pro on some coding and tool-use benchmarks - but specific Vectara faithfulness scores aren't published yet.

One important caveat: at maximum 1M context, Gemini 3.5 Flash degrades. Independent testing shows its MRCR score dropping from 77.3% at 128K tokens to 26.6% at the full 1M - a significant quality decline. For summarization of very long documents near the 1M token limit, GPT-4.1 or Gemini 3.1 Pro may deliver more consistent results. For documents in the 32K-256K range, it's a solid and fast choice at its price point.

GPT-4.1

GPT-4.1 remains the long-document stability leader. The HELMET benchmark (ICLR 2025) showed GPT-4o as one of the only models that holds stable summarization quality all the way to 128K tokens. GPT-4.1 extends that with 1M token context at the same price ($2.50/M), and its 5.6% hallucination rate on the enterprise Vectara dataset is respectable for a frontier model.

For documents that genuinely require the full 128K-1M range - legal agreements, medical records, long research reports - GPT-4.1 is still the most consistent API option. Gemini 3.1 Pro scores better on other benchmarks but has a 10.4% hallucination rate on the enterprise Vectara set, nearly double GPT-4.1's score. See our long-context retrieval capability page for full comparisons at scale.

Open-Source Options

Llama 3.3 70B Instruct still leads open-source faithfulness at 4.1% - below GPT-4.1 and well below frontier-scale models like Gemini 3.1 Pro and Claude. For organizations processing sensitive documents that can't go through commercial APIs, it remains the most defensible option. Phi-4 (Microsoft, ~14B parameters) hits 3.7% but its 16K context window limits it to shorter documents.

Two additions worth noting from the June Vectara update: Gemma 3 12B scored 4.4% hallucination rate - impressive for a 12B self-hosted model - and Qwen3-8B came in at 4.8%. Both are runnable on consumer hardware and provide viable options for on-device summarization. Our open-source LLM leaderboard tracks these and other self-hosted options.

Dense pages of text being condensed into a single highlighted summary sheet The core challenge in summarization is faithfulness: keeping every factual claim accurate while cutting document length by 80-90%. Source: unsplash.com

Why Frontier Models Still Fail Faithfulness Tests

The Vectara results continue to show the same pattern from March: larger, more capable models score worse on faithfulness than smaller constrained ones. Claude Opus 4.8 scores roughly 10.9% hallucination; Gemini 2.5 Flash Lite scores 3.3%. GPT-5.5, designed for the hardest reasoning tasks, shows very high hallucination rates on standard summarization without extended thinking engaged.

Frontier models trained for broad instruction-following confabulate more during summarization because they have more world knowledge to draw from - and they use it even when the source document doesn't support it.

GPT-5.5 with extended thinking (Pro mode) drops to 4.2% on Vectara-style grounded tasks. That's a significant improvement - but it also adds latency and costs $5.00/M input. For most production summarization pipelines, the extra compute isn't worth it when Gemini 2.5 Flash Lite reaches 3.3% at $0.10/M. Extended thinking makes more sense for complex reasoning tasks, not grounded document summarization where you just need the model to stay within the source.

See our hallucination benchmarks leaderboard for detailed cross-benchmark analysis of these patterns.

Methodology

Rankings draw from two primary sources.

Vectara Hallucination Leaderboard (primary) assesses factual consistency when summarizing a passage using only information in that passage. The leaderboard now runs two datasets: the original (~7,700 articles, shorter documents) and an enterprise dataset (documents up to 32K tokens spanning law, medicine, finance, and tech), launched in late 2025. The enterprise dataset better reflects real workloads; rates are 3-10x higher across all models. Caveats: HHEM-2.3 scoring doesn't measure writing quality, structure, or completeness. A model can lead faithfulness and still produce disorganized summaries.

HELMET Benchmark (secondary) from Princeton NLP (ICLR 2025) tests summarization at 8K to 128K tokens across 59 models. It's the most rigorous academic evaluation for long-context summarization. Most open-source models fail badly past 32K; only a handful of closed-source models hold stable quality at 128K. Newer models (Gemini 3.5 Flash, Gemini 3.1 Pro, GPT-5.4) were released after HELMET's evaluation set was finalized.

ROUGE scores are still cited in older papers but correlate poorly with human judgment on modern LLMs. The field has moved to LLM-as-judge evaluation (G-Eval with GPT-4o as evaluator) for open-ended summarization. Use ROUGE only to compare within the same test set.

Historical Progression

2022-2023 - ROUGE-optimized fine-tuned models (BART-large-CNN, PEGASUS) lead academic benchmarks. Human judges consistently prefer GPT-3 and GPT-4 output despite lower ROUGE scores.
Early 2024 - GPT-4 Turbo and Claude 3 Opus confirm that instruction-tuned frontier LLMs beat task-specific fine-tuned models for abstractive summarization. Context windows hit 128K-200K tokens, making chunking unnecessary for most documents.
Mid-2024 - HELMET benchmark published (Princeton NLP). First rigorous proof that most open-source models collapse at 128K token summarization. GPT-4o emerges as the most consistent long-context model.
Late 2024 - Vectara publishes the Hallucination Leaderboard with HHEM-2.3 scoring. Clear evidence that faithfulness and general capability diverge - smaller models outperform larger ones on factual consistency.
2025 - Context windows hit 1M+ tokens (Gemini 2.5 Pro, Llama 4). Vectara launches a harder enterprise dataset with documents up to 32K tokens. Hallucination rates jump 3-10x across all models on the harder benchmark.
March 2026 - Gemini 2.0 Flash leads the original Vectara dataset at 0.7%. Gemini 2.5 Flash Lite holds the enterprise dataset at 3.3%. GPT-5.4-nano (a smaller variant) scores 3.1% on the original dataset.
June 2026 - Gemini 2.0 Flash deprecated June 1. Gemini 3.5 Flash launches at $1.50/M with 1M context (May 19). Mistral Large 3 (2512) price drops to $0.50/M. Claude Opus 4.8 released May 28. Gemini 2.5 Flash Lite remains the top active model at 3.3% on the enterprise dataset.

Pricing Reality Check

Summarization is token-intensive. A 10,000-word document runs roughly 13,000 input tokens; a 200-word summary is ~260 output tokens. Input pricing controls.

Model	Input	Output	Cost per 1K docs (10K-word each)
Gemini 2.5 Flash Lite	$0.10/M	$0.40/M	$1.30
Mistral Large 3 (2512)	$0.50/M	$1.50/M	$6.50
Gemini 3.5 Flash	$1.50/M	$9.00/M	$19.50
DeepSeek V4 Pro	$1.74/M	$3.48/M	$22.62
Gemini 3.1 Pro	$2.00/M	$12.00/M	$26.00
GPT-4.1	$2.50/M	$10.00/M	$32.50
Claude Sonnet 4.6	$3.00/M	$15.00/M	$39.00
Claude Opus 4.8	$5.00/M	$25.00/M	$65.00
GPT-5.5	$5.00/M	$30.00/M	$65.00

Mistral Large 3 (2512) is the biggest pricing shift since March: it went from $2.00/M to $0.50/M, cutting the cost from $26/1K docs to $6.50. For teams needing API access with long-document capability, it's now a clear first stop before considering GPT-4.1.

Prompt caching (supported by Anthropic, Google, and OpenAI) reduces input costs 50-90% for documents with shared system prompts or repeated context - especially relevant for batch summarization pipelines with consistent instructions.

For a full breakdown of API costs across all use cases, see our LLM API pricing comparison.

A bar chart on a screen showing model comparison metrics Mistral Large 3 (2512)'s price drop to $0.50/M shifts it from the "expensive API" tier to direct competition with self-hosted options. Source: unsplash.com

FAQ

What's the most faithful AI model for summarization?

Gemini 2.5 Flash Lite at 3.3% hallucination rate on the Vectara enterprise leaderboard - ahead of Phi-4 (3.7%), Llama 3.3 70B (4.1%), and well below GPT-4.1 (5.6%) and frontier models like Claude Sonnet 4.6 (10.6%).

Which model handles the longest documents?

Gemini 3.5 Flash, Gemini 3.1 Pro, GPT-4.1, and DeepSeek V4 Pro all support 1M token context. GPT-4.1 has the best proven long-doc stability per HELMET benchmarks. Gemini 3.5 Flash degrades noticeably past 256K tokens in independent testing.

Is open-source competitive for summarization?

For documents under 128K tokens, Llama 3.3 70B (4.1% hallucination) beats most commercial models on faithfulness. Gemma 3 12B (4.4%) and Qwen3-8B (4.8%) are strong self-hosted options for shorter documents. Infrastructure cost is the main barrier, not quality.

Why does Claude Opus 4.8 score worse than Phi-4 on faithfulness?

Frontier models trained on varied instruction-following tasks are more prone to extending source material with plausible facts. Smaller constrained models stay closer to the input text. This is a consistent finding across every Vectara leaderboard update.

How often do summarization model rankings change?

The Vectara leaderboard updates monthly. Rankings shift every 1-3 months as new models release and pricing changes. Check the lastVerified date on this page and the hallucination benchmarks leaderboard before committing a model to a production pipeline.

What's the best budget option?

Gemini 2.5 Flash Lite ($0.10/M input) leads both faithfulness and cost. For API access with longer documents, Mistral Large 3 (2512) at $0.50/M is now the strongest budget-tier option. For zero-cost self-hosting, Phi-4 or Llama 3.3 70B are the top picks.

Sources: