Long-Context Benchmarks Leaderboard: MRCR, RULER, and LongBench v2

Every model ships with a context window number on the spec sheet. Gemini 3 Pro says 10 million tokens. Llama 4 Scout says 10 million. GPT-5.2 says 400K. But claiming a context window and actually using it reliably are two very different things. Research consistently shows that effective capacity is usually 60 to 70 percent of the advertised maximum, and performance degrades in ways that simple token counts never capture - missed retrievals, hallucinated details, lost reasoning chains.

This leaderboard cuts through the marketing and ranks models on what actually matters: can they find information buried deep in context, reason across scattered facts, and maintain accuracy as input length scales? We track three benchmarks that test this from different angles.

The Benchmarks Explained

MRCR v2 (Multi-Round Coreference Resolution) is a needle-in-a-haystack benchmark developed by OpenAI that tests a model's ability to locate and distinguish between multiple pieces of target information ("needles") hidden within large volumes of distractor text. The v2 variant comes in 2-needle, 4-needle, and 8-needle configurations tested at various context lengths up to 1M tokens. The 8-needle 1M variant is the hardest configuration and represents the current frontier of long-context retrieval. A model that scores well here can reliably find and cross-reference multiple specific facts across enormous documents.

RULER is NVIDIA's synthetic benchmark that goes well beyond vanilla needle-in-a-haystack testing. It includes 13 task types across four categories: retrieval (finding specific facts), multi-hop tracing (following chains of references), aggregation (collecting and summarizing distributed information), and question answering. RULER tests at configurable sequence lengths and is designed to expose models that pass simple retrieval but fail on more complex long-context reasoning. The original evaluation of 17 models found that despite claiming 32K+ context sizes, only half maintained satisfactory performance at 32K tokens.

LongBench v2 is a benchmark from Tsinghua University consisting of 503 challenging multiple-choice questions with contexts ranging from 8K to 2M words. It covers six task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repository understanding, and long structured data understanding. Human experts scored 53.7% accuracy under a 15-minute time constraint, making this one of the few benchmarks where the human baseline is genuinely difficult to beat.

Overall Long-Context Rankings

Rank	Model	Provider	Max Context	MRCR v2 (8-needle, 1M)	MRCR v2 (4-needle, 256K)	LongBench v2	Notes
1	Gemini 3 Pro	Google DeepMind	10M	26.3% (1M) / 77% (128K)	~85%	68.2%	Largest window, but steep accuracy drop beyond 128K
2	Claude Opus 4.6	Anthropic	1M	76%	~90%	-	4x improvement over predecessor on 8-needle 1M
3	GPT-5.2	OpenAI	400K	-	98%	54.5%	Near-perfect at 4-needle 256K, first to hit this mark
4	Claude Sonnet 4.6	Anthropic	1M	-	~82%	-	1M context at fraction of Opus pricing
5	Grok 4 Fast	xAI	2M	-	-	-	2M window, limited independent benchmarks available
6	Llama 4 Scout	Meta	10M	-	-	-	Near-perfect NIAH up to 10M, open-source
7	DeepSeek V4	DeepSeek	1M	-	-	-	Engram O(1) memory architecture, recently launched
8	Qwen 3.5	Alibaba	1M	-	-	-	MoE architecture, 256K native with YaRN extension
9	GPT-4.1	OpenAI	1M	-	~80%	-	Predecessor model, still competitive
10	DeepSeek V3.2	DeepSeek	128K	-	-	-	Strong NIAH at 128K, cost-efficient

A note on the table above: dashes indicate scores I could not verify from published sources. The long-context benchmarking landscape is fragmented - not every model has been evaluated on every benchmark under identical conditions, which makes direct comparison harder than it should be.

Key Takeaways

The Context Window Arms Race Has Outpaced Evaluation

We are in a strange moment where models advertise 1M and 10M token context windows, but the benchmarking infrastructure has not kept pace. MRCR v2, RULER, and LongBench v2 were all designed when 128K was considered long context. While they have been extended to test at larger scales, coverage is spotty. Many providers self-report needle-in-a-haystack results at their maximum context length but have not been independently evaluated on harder multi-hop or aggregation tasks at those scales. Until standardized evaluation at 1M+ tokens becomes routine, take advertised context window sizes with skepticism.

Gemini 3 Pro: Biggest Window, Sharpest Dropoff

Gemini 3 Pro's 10 million token context window is the largest in the industry, and at 128K tokens it performs exceptionally well - 77% on the 8-needle MRCR variant and 68.2% on LongBench v2, the highest published score on that benchmark. But Google's own evaluation card reveals the uncomfortable truth: at the actual 1M-token mark, MRCR 8-needle performance drops to 26.3%. That is a massive 50-point cliff. The 10M claim looks impressive on a spec sheet, but practical reliability appears to peak well before that. For document-intensive workloads under 128K tokens, Gemini 3 Pro is arguably the best choice. Beyond that, proceed with caution.

Claude Opus 4.6: The Long-Context Breakthrough

The most dramatic improvement in this generation belongs to Claude Opus 4.6. On the 8-needle MRCR v2 at 1M tokens, it scores 76% - compared to its predecessor Claude Sonnet 4.5's 18.5% on the same test. That is a 4x improvement and represents the strongest verified performance at 1M tokens on a multi-needle retrieval task. Anthropic has clearly invested heavily in long-context architecture for this release, and the result is a model that maintains retrieval quality across its full 1M window in a way that competitors have not yet matched. At $15 per million input tokens for Opus and $3 for Sonnet 4.6 (which also gets the 1M window), the pricing reflects the premium positioning.

GPT-5.2: Precision King at 256K

OpenAI's GPT-5.2 takes a different approach. Rather than chasing the largest context window, it focuses on being exceptionally reliable within its 400K token range. It is the first model to achieve near-100% accuracy on the 4-needle MRCR variant out to 256K tokens - a 98% score that no other model has matched at that scale. The 8-needle variant drops to 70% at 256K, which is still strong. For use cases where you need rock-solid retrieval within a few hundred thousand tokens rather than shaky performance at a million, GPT-5.2 is the safest bet.

Open-Source Long Context: Promising but Unproven

Llama 4 Scout's 10M token context window with near-perfect needle-in-a-haystack retrieval sounds extraordinary for an open-source model. Meta's MTOB (Machine Translation of Books) results are genuinely impressive - Maverick scored 50.8% on English-to-Kalamang translation versus Gemini 2.0 Flash's 45.5%, demonstrating real long-context comprehension. DeepSeek V4's Engram architecture with O(1) memory retrieval could be a fundamental shift in long-context economics. But independent benchmark coverage for both models is thin. Until third-party evaluations confirm the self-reported numbers on harder tasks like multi-hop reasoning and aggregation, I am cautiously optimistic rather than fully sold.

Context Window vs. Effective Context: The Gap That Matters

The single most important thing to understand about long-context models is the difference between the advertised context window and the effective context - the range within which the model maintains acceptable accuracy. Research from multiple sources consistently finds that effective capacity is 60-70% of the maximum for most models.

Anthropic's Claude Sonnet 4 shows less than 5% accuracy degradation across its full 200K token range - an unusually consistent performance profile. Gemini 3 Pro, despite its massive 10M token window, shows practical reliability that drops sharply beyond 128K on multi-needle tasks. GPT-5.2's strength is maintaining near-perfect retrieval up to 256K, making its effective context ratio one of the highest in the industry.

If you are building applications that depend on long-context retrieval, always test at your actual working context length with multi-needle tasks, not just single-needle retrieval. A model that passes the basic needle-in-a-haystack test may still fail when it needs to track and cross-reference multiple pieces of information.

Pricing for Long-Context Workloads

Long context is expensive. Here is what you are looking at per million input tokens:

Model	Input (per 1M tokens)	Output (per 1M tokens)	Max Context
Gemini 2.0 Flash	$0.08	$0.30	1M
GPT-5 Nano	$0.05	$0.40	128K
DeepSeek V3.2	$0.27	$0.55	128K
GPT-5.2	$1.50	$6.00	400K
Claude Sonnet 4.6	$3.00	$15.00	1M
Gemini 3 Pro	$12.00	$12.00	10M
Claude Opus 4.6	$15.00	$75.00	1M
Claude Opus 4.5	$25.00	$75.00	200K

The cost gap is enormous. Processing 1M tokens through Claude Opus 4.6 costs $15 in input alone, while Gemini 2.0 Flash would cost $0.08 for the same input at its 1M window. For bulk document processing where retrieval accuracy is less critical, the cheaper models offer compelling economics. For high-stakes applications - legal review, research synthesis, codebase analysis - the frontier models justify their premium.

Practical Guidance

For legal and compliance document review (where missing a single clause can be costly): Claude Opus 4.6 offers the best combination of large context window and verified multi-needle retrieval accuracy. GPT-5.2 is a strong alternative if your documents fit within 256K tokens.

For codebase analysis and large repository understanding: Gemini 3 Pro's 10M token window can ingest entire codebases, and its performance at sub-128K context lengths is excellent. Llama 4 Scout is the open-source option for teams wanting to self-host.

For cost-sensitive batch processing: Gemini 2.0 Flash at $0.08 per million input tokens with a 1M context window is hard to beat on price. DeepSeek V3.2 offers strong performance at 128K for even less on output tokens.

For research and multi-document synthesis: Claude Opus 4.6's ability to maintain retrieval accuracy across 1M tokens makes it the top choice for synthesizing information across many papers or reports. The 76% score on 8-needle MRCR at 1M tokens means it can reliably track multiple threads of information across massive inputs.

For maximum context at any cost: Gemini 3 Pro and Llama 4 Scout both offer 10M token windows, but keep in mind that accuracy degrades significantly at those scales. For most practical workloads, staying within 128K-256K tokens and picking the model with the best accuracy in that range will give you better results than pushing to the maximum.

Sources: