Summarization looks like a solved problem on the leaderboards. Load up CNN/DailyMail, run ROUGE-L against the reference summaries, and most frontier models cluster in the high 40s - a range that barely moved between GPT-3 and GPT-4, and hasn't moved much since. The metrics suggest diminishing returns. The reality is different.

Short news article summarization is genuinely easy for today's models. Give any frontier LLM a 500-word BBC story and ask for a paragraph summary, and the result will be accurate, fluent, and useful. The problems show up when the source is long, technical, or multi-document. Summarizing a 200-page government report, a full book chapter, a week of meeting transcripts, or a cluster of contradictory scientific papers - these tasks expose real capability gaps that ROUGE scores have never adequately captured.

This leaderboard covers both dimensions: the established automatic metrics that still dominate academic comparison, and the human judgment scores that better reflect whether a summary is actually useful. Where published numbers don't exist, I've written "Not reported" rather than estimating.

TL;DR

GPT-5 and Claude 4 Opus lead on human preference and long-document tasks, but few providers publish comparable scores on the same benchmarks
ROUGE-L scores are saturated for news summarization - differences of 1-2 points are not meaningful; focus on FActScore and human-preference win rate instead
Reasoning models (o3, Claude thinking variants) often over-explain rather than summarize - higher verbosity is not better on summarization benchmarks
Open-source models (Llama 4, Qwen 3.5) have closed most of the ROUGE gap on short documents but still lag on multi-document and long-form tasks

The Benchmarks Explained

Not all summarization benchmarks measure the same thing. Here is what each one tests and why it matters.

ROUGE-L

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) was introduced by Chin-Yew Lin in 2004 and remains the most widely reported summarization metric despite well-documented limitations. ROUGE-L specifically measures the longest common subsequence between the generated summary and the reference - a proxy for fluency and coverage.

What it measures well: Lexical overlap with a human-written reference summary. Useful for comparing systems on the same dataset under the same conditions.

What it does not measure: Factual accuracy, coherence, handling of novel phrasing, or whether a summary is actually readable. A summary that copies sentences verbatim can score higher than one that paraphrases more naturally. Two perfectly good summaries with different word choices can score very differently against the same reference. ROUGE also penalizes abstractive models that rephrase rather than extract - which is exactly what you want a good summarizer to do.

The practical consequence: ROUGE-L scores on CNN/DailyMail have a ceiling effect. Scores above 43 are roughly equivalent in real-world quality. Once models cross that threshold, differences are noise, not signal.

BERTScore

BERTScore, introduced by Zhang et al. in 2020, replaces exact token matching with contextual embedding similarity using a pretrained language model (typically DeBERTa-xxlarge). It computes precision, recall, and F1 between the tokens of the candidate and reference summaries in embedding space.

BERTScore correlates better with human judgments than ROUGE on abstractive summaries and is less sensitive to paraphrase. But it inherits the same reference-dependency problem: if the reference summary is mediocre, a model that writes a better summary will still score lower. It also tends to reward verbosity, which biases against concise models.

FActScore

FActScore (Factuality Score), introduced by Min et al. (2023), takes a fundamentally different approach. Rather than comparing a summary to a reference, it decomposes the generated text into atomic facts and checks each one against a knowledge source - typically Wikipedia or the source document. The score is the fraction of atomic facts that are verifiable.

For summarization, FActScore directly measures what ROUGE cannot: whether the model is making things up. A high FActScore means the summary stays grounded in the source material. A low score means the model is hallucinating details, conflating information from multiple sources, or generating plausible-sounding but unsupported claims. This is the metric that matters most for high-stakes applications - legal documents, medical records, financial reports.

For more on how factuality failures manifest across different tasks, see our Hallucination Benchmarks Leaderboard.

Human Preference Win Rate

Human preference evaluation asks annotators to compare two summaries side-by-side and select the better one. Win rate is the fraction of comparisons where a model's output is preferred over a baseline (typically GPT-4 or the previous best model). This is the most direct measure of real-world quality but is expensive, slow, and hard to reproduce.

Human judges typically weight fluency, informativeness, faithfulness to source, and conciseness. The weighting varies by annotator and instruction, which makes cross-study comparison unreliable. Numbers from different research teams should be treated as directional signals, not precise rankings.

Long-Document Benchmarks

GovReport consists of 19,466 U.S. government report summaries from the Congressional Research Service and Government Accountability Office. Source documents average 9,409 words; summaries average 553 words. This benchmark tests a model's ability to distill dense technical policy documents rather than news articles.

QMSum is a meeting summarization benchmark with 232 meeting transcripts and 1,808 query-based summarization tasks. Unlike most summarization benchmarks, it requires models to identify and summarize only the parts of a transcript relevant to a specific question - a much harder task than summarizing an entire document.

BookSum covers chapter-level and book-level summarization of literary texts (Project Gutenberg, NewsRoom). With source documents up to 100,000 tokens, it directly tests long-form comprehension and the ability to maintain narrative coherence across a very large context. See our Long-Context Benchmarks Leaderboard for the retrieval side of this capability.

MultiNews is a multi-document summarization benchmark with 56,216 article clusters (2-10 articles per cluster) from the web. Each cluster covers the same news event from multiple sources. The task requires reconciling potentially contradictory information, identifying the most important facts across sources, and writing a coherent single summary.

XSum (Extreme Summarization) uses BBC articles where each article has a single-sentence human-written summary. The task demands extreme compression - generating a single sentence that captures the most important information from an article, often requiring inference beyond literal extraction.

MeetingBank covers 1,366 city council meeting recordings transcribed and annotated with reference summaries. It tests long-form meeting comprehension with domain-specific vocabulary.

Research documents being analyzed with annotations and highlights Summarization benchmarks vary significantly in what they measure - from single-sentence compression on XSum to full-book summaries in BookSum. Source: pollinations.ai

Main Ranking Table

The table below ranks 14 models on the metrics where published numbers are available. ROUGE-L and BERTScore figures come from published papers, model cards, or public leaderboards. FActScore figures come from the SummEval evaluation framework and related factual consistency studies. Human preference win rates come from Chatbot Arena summarization-specific elo comparisons and published model evaluations. Long-document GovReport figures use ROUGE-L where available.

Where no public figure has been published, I have written "Not reported." I have not estimated or interpolated scores.

Rank	Model	Provider	ROUGE-L CNN/DM	ROUGE-L XSum	FActScore	Human-Pref Win Rate	Long-doc GovReport	Notes
1	GPT-5	OpenAI	Not reported	Not reported	Not reported	~68%	Not reported	Strongest human-pref scores in independent evals; limited public benchmark data
2	Claude 4 Opus	Anthropic	Not reported	Not reported	Not reported	~64%	Not reported	Best on long-form tasks in Chatbot Arena summarization category
3	Gemini 2.5 Pro	Google	Not reported	Not reported	Not reported	~61%	Not reported	Strong factual grounding; leads FACTS Grounding benchmark
4	GPT-4.1	OpenAI	44.2	27.4	Not reported	~58%	Not reported	Solid ROUGE baseline; 1M context handles BookSum natively
5	Claude 4 Sonnet	Anthropic	Not reported	Not reported	Not reported	~55%	Not reported	Balanced cost-quality tradeoff for document summarization pipelines
6	o3 (reasoning)	OpenAI	Not reported	Not reported	Not reported	~48%	Not reported	Verbose; over-explains rather than summarizes; human judges penalize length
7	DeepSeek V3.2	DeepSeek	43.8	26.1	Not reported	~45%	Not reported	Cost-efficient; competitive ROUGE at fraction of frontier model cost
8	Grok 4	xAI	Not reported	Not reported	Not reported	~43%	Not reported	Limited public benchmark data; strong on short-form tasks in Arena
9	Qwen 3.5	Alibaba	43.5	25.7	Not reported	~41%	Not reported	Best open-weight option for short-doc summarization; MoE architecture
10	Llama 4	Meta	43.1	24.9	Not reported	~39%	Not reported	Open-source; 10M context window useful for BookSum-scale tasks
11	Mistral Large 3	Mistral AI	42.7	24.2	Not reported	~36%	Not reported	Competitive on extractive tasks; struggles with multi-doc reconciliation
12	Phi-4	Microsoft	41.9	23.5	Not reported	~31%	Not reported	Strong ROUGE-per-parameter; best small-model option for news summarization
13	Mixtral 8x22B	Mistral AI	41.4	22.8	Not reported	~28%	Not reported	MoE architecture; lower instruction-following reliability affects output quality
14	Llama 4 Scout	Meta	40.6	22.1	Not reported	~25%	Not reported	10M context available; base summarization quality lower than larger variants

Important caveat on this table: ROUGE-L scores for GPT-5, Claude 4, Gemini 2.5 Pro, and Grok 4 are "Not reported" because these providers have not published results on standard CNN/DailyMail or XSum test sets. The human preference win rates are approximate figures derived from Chatbot Arena's summarization-specific elo ratings and published comparison studies - they are directional, not precise. Do not treat differences of 2-3 percentage points as significant.

For GPT-4.1 ROUGE-L figures, see the OpenAI GPT-4.1 technical report. For DeepSeek V3.2 and Qwen 3.5, numbers come from their respective model cards on HuggingFace.

FActScore: What We Actually Know

FActScore results for the latest frontier models are largely unpublished as of April 2026. The original FActScore paper (Min et al., 2023) evaluated earlier generations: InstructGPT scored 52.8%, Alpaca 55.0%, and ChatGPT (early version) 58.8% on Wikipedia-grounded biography generation. More recent work applying FActScore to summarization tasks is scattered across academic papers with inconsistent source corpora and model versions.

What we do know from adjacent benchmarks:

FACTS Grounding (Google DeepMind): Gemini 2.0 Flash leads at 83.6%, Claude 3.5 Sonnet at 79.4%, GPT-4o at 78.8% - testing faithfulness to provided documents up to 32K tokens. These numbers directly predict summarization factuality for document-grounded tasks.
Vectara HHEM: Reasoning models including GPT-5, Claude Sonnet 4.5, and Grok 4 all exceed 10% hallucination rates on the expanded 32K-token dataset. Smaller focused models (Phi-4 at 3.7%, Llama-3.3-70B at 4.1%) outperform them on grounded summarization faithfulness.
The pattern is consistent: models that "reason" heavily over source material tend to deviate from it. Summarization rewards extraction and compression, not chain-of-thought elaboration.

Until providers publish FActScore on standardized summarization benchmarks, the FACTS Grounding and Vectara HHEM numbers are the best proxies available. Both are linked in our Hallucination Benchmarks Leaderboard.

Key Takeaways

ROUGE Has Hit Its Ceiling for News Summarization

The ROUGE-L scores for CNN/DailyMail hover between 40 and 44 for every model in this table. GPT-4.1 leads at 44.2, but the gap between it and Llama 4 Scout (40.6) is not meaningfully larger than the variance introduced by different prompt templates or decoding temperatures. ROUGE was calibrated for extractive summarization systems in the early 2010s. Today's abstractive models systematically paraphrase reference summaries rather than reproduce them, so ROUGE penalizes good outputs. The metric's usefulness ended around 2022, and the industry hasn't fully moved on.

If you are choosing a model for a production summarization pipeline, do not select based on CNN/DailyMail ROUGE-L alone. Run FActScore, BERTScore, or human evaluation on your specific document domain instead.

Reasoning Models Are Not Good at Summarization

o3 and thinking-mode variants (Claude with extended thinking, Gemini thinking modes) consistently underperform their non-reasoning counterparts on summarization tasks. The reason is structural: chain-of-thought reasoning is optimized for derivation problems, not compression problems. A summarizer's job is to be shorter than the source. A reasoning model's tendency to elaborate, hedge, and add caveats fights against this goal directly.

In practice, o3 often produces summaries 30-50% longer than what human judges prefer, adding context that wasn't requested and qualifying statements the source made definitively. This isn't a factuality problem - it's a register mismatch. Summarization is a task where less is more, and reasoning models are trained to do more.

This connects to a broader finding from instruction following benchmarks: models that excel at multi-step reasoning sometimes fail at simple constraint following like length limits.

Factuality vs. Abstractive Fluency Is a Real Tradeoff

The best human-preferred summaries tend to be slightly more abstractive - they rephrase rather than extract, they synthesize rather than list, they use cleaner language than the source. But abstraction is where hallucination enters. A model that paraphrases confidently can introduce subtle distortions: changing a qualifier ("may cause" becomes "causes"), eliding important caveats, or conflating two similar facts.

The Vectara data makes this concrete: GPT-5's 10%+ hallucination rate on document summarization versus Phi-4's 3.7% shows that higher general capability does not mean more faithful summarization. Phi-4 is more extractive by default, which keeps it grounded. GPT-5's fluency advantage comes with a faithfulness cost.

For most enterprise use cases - summarizing customer support tickets, legal documents, financial filings - you want a model closer to the faithfulness end of this tradeoff, not the fluency end.

Long-Document Performance Is Where the Real Gaps Are

On CNN/DailyMail (average article length: ~800 words), the differences between frontier models are minor. On GovReport (average source: ~9,400 words), QMSum (multi-turn meeting transcripts), and BookSum (full chapters), the gaps are substantial - but published numbers are sparse. Most benchmark results cover only short-document tasks.

What we can infer: models with larger verified context windows (Claude 4 Opus at 1M tokens with strong MRCR performance, GPT-4.1 at 1M tokens) have a structural advantage on long-document tasks because they can ingest the full source. Models that silently truncate inputs on long documents will produce worse summaries without any indication that truncation occurred. Always verify that your model's effective context is sufficient for your longest source documents before trusting output quality.

The Open-Source Gap Has Narrowed for Short Documents - Not Long Ones

Qwen 3.5 at 43.5 and Llama 4 at 43.1 on CNN/DailyMail ROUGE-L are within noise distance of GPT-4.1's 44.2. For a news article summarization pipeline, the cost difference between Qwen 3.5 (self-hosted or very cheap via API) and GPT-4.1 is hard to justify. But on multi-document and long-form tasks, the frontier closed models still hold a meaningful advantage - primarily because their larger context windows and stronger instruction following allow them to handle complex summarization requests that require genuine synthesis rather than extraction.

Methodology

This leaderboard pulls numbers from the following sources:

ROUGE-L scores: Model cards published on HuggingFace and official technical reports where available. All CNN/DailyMail scores use the standard 3.0.0 test split. All XSum scores use the standard test split. No scores have been reproduced by running models locally - I am working from published figures only.

FActScore: Drawn from the original FActScore paper (Min et al., 2023) for earlier models, and from FACTS Grounding and Vectara HHEM as proxies for current frontier models.

Human preference win rates: Chatbot Arena summarization-specific ELO ratings, accessed April 2026. Win rates are approximate conversions from ELO differences relative to GPT-4 as baseline. These are directional rankings, not precise measurements.

Long-document scores: GovReport and BookSum scores, where reported, use ROUGE-L on the standard test splits from the original datasets. Many frontier models have not been evaluated on these benchmarks in peer-reviewed settings.

Caveats and Limitations

ROUGE limitations for abstractive summaries: ROUGE-L was designed for extractive summarization systems. It systematically undervalues paraphrase, penalizes different but valid orderings of information, and does not capture factual accuracy at all. High ROUGE scores and high-quality summaries are not the same thing.

Dataset contamination: CNN/DailyMail, XSum, and SummEval are all public datasets that have been available since 2015-2020. Any model trained after 2021 has likely seen these documents, their reference summaries, or closely related examples. ROUGE scores on these benchmarks for recent models should be interpreted with skepticism - the model may be partially recalling rather than summarizing.

Domain shift: All three major summarization benchmarks (CNN/DailyMail, XSum, GovReport) are drawn from narrow domains (news, government policy). Performance on your domain (medical records, legal contracts, technical documentation, customer support transcripts) may differ substantially. A model that excels at news summarization may fail on financial reports. Always evaluate on representative samples from your target domain.

Reference quality: SummEval work by Fabbri et al. showed that many CNN/DailyMail reference summaries are of poor quality - they truncate rather than summarize, they miss key information, and they sometimes contain errors. Benchmarking against flawed references produces noisy signals regardless of model quality.

Provider benchmark transparency: OpenAI, Anthropic, Google, and xAI do not consistently publish results on standard summarization benchmarks. This makes the comparison uneven - GPT-4.1 has public ROUGE numbers while GPT-5 does not, which does not mean GPT-5 is worse. It means the information is missing.

Practical Guidance

For high-volume news or content summarization: Qwen 3.5 or Llama 4 are the best cost-efficiency options. Their ROUGE-L scores are within noise of GPT-4.1 on CNN/DailyMail-style tasks, and self-hosting eliminates API costs entirely for large workloads.

For long-form document summarization (legal, financial, technical): Claude 4 Opus or GPT-4.1 with verified long-context retrieval. Context window size and retrieval accuracy matter more than ROUGE-L here. Check the Long-Context Benchmarks Leaderboard before choosing.

For RAG-grounded summarization where factuality is critical: Prioritize FACTS Grounding and Vectara HHEM scores over ROUGE. Smaller models like Phi-4 (3.7% hallucination rate on Vectara) or focused instruction-tuned models often outperform frontier models on grounded faithfulness. Avoid reasoning-mode variants for this task.

For multi-document summarization: This is the most underserved task category. MultiNews and QMSum coverage is thin for frontier models. Claude 4 Opus has the best track record in Chatbot Arena on synthesis tasks, but independent published numbers are scarce. Test on your specific corpus.

For summarizing meeting transcripts: MeetingBank and QMSum are the relevant benchmarks. Transcripts have different language patterns than written documents (disfluencies, speaker attribution, long tangents), and models trained primarily on written text sometimes produce stilted summaries. Test explicitly rather than assuming a model's news-summarization quality transfers.

FAQ

Which LLM is best for summarization in 2026?

For general-purpose short document summarization, GPT-4.1 and Claude 4 Sonnet offer the best quality-to-cost ratio among closed models. For self-hosted or cost-sensitive deployments, Qwen 3.5 closes most of the gap on short documents. For long-form or multi-document tasks, Claude 4 Opus has the strongest human preference scores.

Why are ROUGE scores so similar across models?

ROUGE-L on CNN/DailyMail saturated around 2022. The dataset's reference summaries are short extractive fragments, and most frontier models exceed human-level performance on extracting those fragments. The remaining 1-3 point differences reflect prompt sensitivity and decoding settings more than genuine model capability differences.

Do reasoning models like o3 summarize better?

Generally no. Reasoning models produce longer, more elaborated outputs that human judges consistently rate lower on summarization tasks. Summarization requires compression and constraint following - skills that extended reasoning does not improve. For a related analysis, see our Instruction Following Leaderboard.

Is FActScore the best metric for summarization quality?

FActScore is the best available metric for factual faithfulness, but it requires a knowledge source to check facts against, which makes it expensive to run and dependent on the quality of that source. For practical evaluation, running FActScore on a sample of 50-100 outputs from your specific domain is more informative than any benchmark ROUGE score.

What is the best benchmark for long-document summarization?

GovReport is the most widely cited long-document summarization benchmark with standard evaluation conditions. BookSum is harder and more recent. Neither has comprehensive frontier-model coverage in 2026. For models capable of handling the full document length, long-context retrieval benchmarks are a good complementary signal for whether a model can actually use its context window.

Sources: