Best AI Models for Text Summarization - March 2026
Gemini 2.5 Flash Lite leads the Vectara hallucination leaderboard at 3.3% error rate while GPT-4o and Gemini 2.5 Pro dominate long-document tasks - full rankings, benchmark scores, and pricing.

TL;DR
- Gemini 2.5 Flash Lite has the lowest hallucination rate in summarization tasks at 3.3% on the Vectara leaderboard - better than Claude Opus, GPT-5, and most frontier models
- For long documents (100K+ tokens), GPT-4o and Gemini 2.5 Pro are the only models that don't degrade badly, per HELMET benchmark results from ICLR 2025
- Best budget pick: Gemini 2.5 Flash Lite at $0.10/$0.40 per million tokens combines low cost with the top faithfulness score
The best model for summarization depends almost completely on what you mean by "best." If you want faithful output that doesn't invent facts, smaller and more constrained models beat frontier giants by a wide margin. If you need to summarize a 200-page legal agreement without chunking, only a handful of models stay coherent through the whole thing.
These two requirements point in opposite directions. Gemini 2.5 Flash Lite leads on faithfulness. Gemini 2.5 Pro (1M token context) leads on highly long documents. GPT-4o holds up through 128K tokens while maintaining output quality that smaller models can't match. There's no single answer - but there's a clear answer for each use case.
Rankings
The primary ranking uses the Vectara Hallucination Leaderboard (updated March 10, 2026), which measures factual consistency in summarization tasks across 7,700+ documents at temperature 0. For long-context performance, HELMET benchmark results (ICLR 2025) are noted separately.
| Rank | Model | Provider | Hallucination Rate | Price (Input) | Verdict |
|---|---|---|---|---|---|
| 1 | Gemini 2.5 Flash Lite | 3.3% | $0.10/M | Best faithfulness; also cheapest frontier option | |
| 2 | Phi-4 | Microsoft | 3.7% | ~$0.07/M (self-host) | Exceptional for 14B params; 16K context limit |
| 3 | Llama 3.3 70B Instruct | Meta | 4.1% | Free (self-host) | Best open-source faithfulness; 128K context |
| 4 | Mistral Large (Dec 2025) | Mistral | 4.5% | $2.00/M | 256K context; strong long-doc faithfulness |
| 5 | GPT-4.1 | OpenAI | 5.6% | $2.50/M | 1M context; solid all-rounder |
| 6 | Gemini 2.5 Flash | 7.8% | $0.30/M | 1M context; fast; faithfulness drops vs Lite | |
| 7 | Gemini 2.5 Pro | 7.0% | $1.25/M | 1M context; strong long-doc quality | |
| 8 | GPT-4o | OpenAI | 9.6% | $2.50/M | HELMET leader at 128K; stable on long docs |
| 9 | Claude Haiku 4.5 | Anthropic | 9.8% | $0.25/M | Cheapest Claude; weakest faithfulness |
| 10 | Claude Sonnet 4.6 | Anthropic | 10.3% | $3.00/M | Better nuance; worse faithfulness than Haiku |
| 11 | Claude Opus 4.6 | Anthropic | 10.9% | $5.00/M | Best Claude for long-doc coherence; poor faithfulness score |
| 12 | GPT-5 | OpenAI | 10.5-15.1% | Varies | Strong quality; surprisingly high hallucination rate |
Detailed Analysis
Gemini 2.5 Flash Lite
The Vectara leaderboard puts Gemini 2.5 Flash Lite at 3.3% hallucination rate - the best among major commercial models and more than 6 percentage points below GPT-4o (9.6%). That gap matters in production. If you're running 10,000 document summaries per day, the difference between 3.3% and 9.6% is the difference between 330 and 960 summaries that need review.
The pricing picture makes it more compelling. At $0.10 per million input tokens and $0.40 output, it's an order of magnitude cheaper than Claude Opus 4.6 ($5.00/$25.00). For most routine summarization - customer support tickets, news articles, product reviews - Flash Lite hits the right combination of cost, speed, and factual accuracy.
The limitation is context. At 1M tokens, context isn't the issue. The issue is that faithfulness scores in the Vectara leaderboard measure short-to-medium document summarization. For very long inputs where the model needs to synthesize across dozens of pages, the gap between Flash Lite and larger models is less clear. HELMET benchmark data (which specifically tests long-context summarization) doesn't include Flash Lite in its rankings. For anything under 50K tokens, Flash Lite is hard to beat on value.
GPT-4o
GPT-4o scores 9.6% hallucination rate on Vectara - well below the best performers - but its strength is long-document stability. The HELMET benchmark (Princeton NLP, ICLR 2025) assessed 59 models on summarization tasks across context lengths of 8K, 16K, 32K, 64K, and 128K tokens. GPT-4o (August 2024) is one of the very few models that maintains stable performance all the way through 128K tokens. Open-source models collapse; many closed models degrade noticeably past 32K.
For meeting transcripts (50-100K tokens depending on length), financial documents, legal filings, and medical records, GPT-4o's consistency at length is worth the faithfulness trade-off - especially when those documents don't repeat the same information the model might accidentally confuse.
See our long-context retrieval capability page for how GPT-4o and Gemini 2.5 Pro compare on retrieval tasks at full context length.
Gemini 2.5 Pro
Gemini 2.5 Pro sits at 7.0% hallucination rate on Vectara - better than GPT-4o but worse than Flash Lite. Where it separates from the pack is context: 1M tokens, matching Gemini 2.5 Flash but with stronger overall quality. For book-length summarization, full legal agreements, or research corpora, it's currently the strongest commercial option.
The pricing is reasonable for what you get: $1.25 per million input tokens for documents under 200K tokens (context pricing increases above that). Compared to Claude Opus 4.6 ($5.00/M) with its 10.9% hallucination rate, Gemini 2.5 Pro delivers better faithfulness at lower cost for the same long-document tasks.
Our long-context benchmarks leaderboard has detailed scoring for both models across HELMET and ZeroSCROLLS.
Open-Source Options
Llama 3.3 70B Instruct leads open-source faithfulness at 4.1% - below GPT-4.1 (5.6%) and well below GPT-4o (9.6%). For organizations that need to summarize sensitive documents and can't send data to commercial APIs, Llama 3.3 70B is the most defensible choice available.
Phi-4 (Microsoft, ~14B parameters) hits 3.7% hallucination - the second-best score among all models - but its 16K context window limits it to shorter documents. It's exceptional for article-length summarization at minimal compute cost.
For long-context open-source work, Mistral Large (December 2025 release, 256K context) scores 4.5% on Vectara. Qwen3-30B-A3B offers 256K context at comparable cost through API providers ($0.10/$0.40). Our open-source LLM leaderboard tracks current rankings for self-hosted options.
The core challenge in summarization is faithfulness: keeping every factual claim accurate while cutting document length by 80-90%.
Source: unsplash.com
Why Frontier Models Fail Faithfulness Tests
The Vectara results have a counterintuitive pattern: more capable models score worse on faithfulness. Claude Opus 4.6 (10.9% hallucination) performs significantly worse than Llama 3.3 70B (4.1%). GPT-5 variants log hallucination rates between 10.5% and 15.1%. Gemini 2.5 Flash Lite (3.3%) beats Gemini 2.5 Pro (7.0%).
Faithfulness does not scale with general capability. The models that answer harder questions more fluently also confabulate more often when summarizing documents.
The explanation is that frontier models trained for broad instruction-following have learned to produce confident, fluent completions even when the source material doesn't support them. Smaller, constrained models are more likely to stay within the document's scope because they don't have the same reserve of world knowledge to draw from.
This matters for production summarization pipelines. A model that adds plausible-sounding but wrong context to a legal summary is worse than a model that produces a shorter summary missing some detail. Faithfulness errors in high-stakes domains (medical, legal, financial) are clearly more costly than omissions.
Methodology
Rankings draw from two primary sources.
Vectara Hallucination Leaderboard (primary) assesses models on factual consistency when summarizing a passage using only information in that passage. The leaderboard uses 7,700+ documents at temperature 0 and scores consistency with the HHEM-2.3 model. It's updated monthly. Caveats: the dataset skews toward short-to-medium documents (under 4K tokens); long-document faithfulness may differ. It doesn't test writing quality, structure, or completeness. A model can score well here and still produce poorly organized summaries.
HELMET Benchmark (secondary, long-context only) from Princeton NLP (ICLR 2025) tests 59 models across 7 task categories including summarization at 8K to 128K tokens, using Multi-LexSum and InfBench datasets. It's the most rigorous academic evaluation for long-context summarization and the source for claims about which models hold up at 128K tokens.
One structural limitation: ROUGE scores (still cited in older papers) correlate poorly with human judgment for modern LLMs. The field has moved toward LLM-as-judge evaluation (G-Eval using GPT-4o as evaluator) for open-ended summarization tasks. ROUGE remains useful for comparing within the same test set but shouldn't be compared across different benchmarks.
Historical Progression
2022-2023 - ROUGE-optimized fine-tuned models (BART-large-CNN, PEGASUS) lead academic benchmarks. Abstractive quality is high by automatic metrics but human judges consistently prefer GPT-3 and GPT-4 output.
Early 2024 - GPT-4 Turbo and Claude 3 Opus establish that instruction-tuned frontier LLMs outperform task-specific fine-tuned models for abstractive summarization. Context windows reach 128K-200K tokens, making chunking unnecessary for most real-world documents.
Mid-2024 - HELMET benchmark published (Princeton NLP). First rigorous proof that most open-source models completely collapse at 128K token summarization tasks. GPT-4o emerges as the most consistent long-context model.
Late 2024 - Vectara publishes the Hallucination Leaderboard with HHEM-2.3 scoring. First clear evidence that faithfulness and general capability diverge - smaller models are more trustworthy for factual summarization than larger ones.
2025 - Context windows reach 1M+ tokens (Gemini 2.5 Pro, Llama 4). Faithfulness research accelerates with clinical studies achieving under 2% hallucination rates through prompt engineering. Gemini 2.5 Flash Lite emerges as the top faithfulness performer in production settings.
March 2026 - Gemini 2.5 Flash Lite leads the Vectara leaderboard at 3.3%. No single model leads across all dimensions. The market is routing different document types to different models - small/fast for high-volume news; long-context frontier for legal/medical.
Pricing Reality Check
Summarization is token-intensive. A 10,000-word document runs roughly 13,000 input tokens; a concise 200-word summary is ~260 output tokens. Output-to-input ratio is low, which means input pricing matters more than output pricing for this task.
| Model | Input | Output | Cost per 1K docs (10K-word each) |
|---|---|---|---|
| Gemini 2.5 Flash Lite | $0.10/M | $0.40/M | $1.30 |
| Claude Haiku 4.5 | $0.25/M | $1.25/M | $3.25 |
| Gemini 2.5 Flash | $0.30/M | $2.50/M | $3.90 |
| Gemini 2.5 Pro | $1.25/M | $10.00/M | $16.25 |
| GPT-4o | $2.50/M | $10.00/M | $32.50 |
| Claude Sonnet 4.6 | $3.00/M | $15.00/M | $39.00 |
| Claude Opus 4.6 | $5.00/M | $25.00/M | $65.00 |
At $1.30 per 1,000 documents, Gemini Flash Lite runs 50x cheaper than Claude Opus for the same task. Prompt caching (supported by Anthropic, Google, and OpenAI) reduces input costs 50-90% for documents with shared system prompts or repeated context, which can shift the economics further.
For a full breakdown of API costs across all use cases, see our LLM API pricing comparison.
Cost-per-document varies 50x across models for a standard summarization task - pricing differences dwarf quality differences for most routine workloads.
Source: unsplash.com
FAQ
What's the most faithful AI model for summarization?
Gemini 2.5 Flash Lite at 3.3% hallucination rate on the Vectara leaderboard - lower than Phi-4 (3.7%), Llama 3.3 70B (4.1%), and far below GPT-4o (9.6%) and Claude Opus 4.6 (10.9%).
Which model handles the longest documents?
Gemini 2.5 Pro and GPT-4.1 both support 1M token context. GPT-4o is proven stable at 128K by HELMET benchmarks. Open-source models degrade significantly past 32K tokens in controlled evaluations.
Is open-source good enough for summarization?
For documents under 128K tokens, Llama 3.3 70B (4.1% hallucination) beats most commercial models. Phi-4 at 14B parameters is excellent for article-length summaries. Both require infrastructure to self-host, but the quality is there.
Why does Claude Opus score worse than smaller models on faithfulness?
Frontier models have broader world knowledge and are more prone to supplementing source material with plausible-sounding facts. Smaller models stay closer to the input because they don't have the same reserve to draw from. This is a known finding from the Vectara hallucination research.
How often do summarization model rankings change?
The Vectara leaderboard updates monthly. Rankings shift every 1-3 months as new models release. Check before committing a model to a production summarization pipeline.
What's the cheapest model that still produces usable summaries?
Gemini 2.5 Flash Lite ($0.10/M input) leads both faithfulness and cost. For self-hosted zero-cost summarization, Phi-4 or Llama 3.3 70B are the strongest options.
Sources:
- Vectara Hallucination Leaderboard - GitHub
- HELMET Benchmark - Princeton NLP
- HELMET arXiv paper (ICLR 2025)
- ZeroSCROLLS / LongBench-v2 - EmergentMind
- Best Open Source LLMs for Summarization - SiliconFlow
- LLM API Pricing 2026 - tldl.io
- AI API Pricing Comparison 2026 - dev.to
- Hallucination detection in summarization - Nature
- Clinical LLM summarization study - Nature npj Digital Medicine
- Evaluating LLMs for Text Summarization - SEI CMU
✓ Last verified March 11, 2026
