Best Models for Long-Context Retrieval - March 2026
Claude Opus 4.6 leads multi-needle retrieval at 1M tokens with 76% on MRCR v2, while GPT-5.4 achieves near-perfect single-needle accuracy across its full 1M context.

TL;DR
- Claude Opus 4.6 leads the hardest retrieval benchmark (MRCR v2 8-needle at 1M tokens) at 76%, a 4x improvement over its predecessor
- GPT-5.4 achieves near-perfect single-needle accuracy across its full 1M token context, the strongest "needle in a haystack" result to date
- Context window size alone means little - Llama 4 Scout claims 10M tokens but scores 15.6% on long-context reasoning, while models with 200K windows beat it
The gap between advertised context windows and actual retrieval performance is the defining story of long-context AI in 2026. Claude Opus 4.6 scores 76% on the MRCR v2 8-needle benchmark at 1M tokens, meaning it can locate and reproduce eight specific facts hidden across a million-token document. That score represents a 4x improvement over Claude Sonnet 4.5's 18.5% on the same test. GPT-5.4 achieves near-perfect accuracy on single-needle retrieval across its full 1M context window, and GPT-5.2 Codex leads the Artificial Analysis Long Context Reasoning benchmark at 75.7%.
For practical long-document applications, the choice depends on whether you need retrieval (finding facts) or reasoning (synthesizing information across documents). These are different skills, and models that excel at one often lag on the other.
Rankings Table
| Rank | Model | Provider | Context Window | MRCR v2 8-needle (1M) | AA-LCR | Price (Input/Output) | Verdict |
|---|---|---|---|---|---|---|---|
| 1 | Claude Opus 4.6 | Anthropic | 200K (1M beta) | 76% | - | $5/$25 | Best multi-needle retrieval at 1M tokens |
| 2 | GPT-5.4 | OpenAI | 1M | - | - | $2.50/$20 | Near-perfect single-needle across full context |
| 3 | GPT-5.2 Codex | OpenAI | 400K | - | 75.7% | $1.75/$14 | Top long-context reasoning score |
| 4 | Gemini 3.1 Pro | 2M | 26.3% (1M) | - | $2/$12 | Largest window, retrieval drops at 1M | |
| 5 | Claude Sonnet 4.6 | Anthropic | 200K (1M beta) | - | - | $3/$15 | <5% accuracy drop across full window |
| 6 | Gemini 3 Pro | 2M | 24.5% (1M) | - | $1.25/$10 | Budget option with massive window | |
| 7 | Grok 4 | xAI | 2M | - | - | $3/$15 | 2M window with real-time web search |
| 8 | Llama 4 Scout | Meta | 10M | - | - | Free (open) | Largest window, weak on reasoning tasks |
| 9 | GPT-5.2 | OpenAI | 400K | 63.9% (256K) | 75.6% | $1.75/$14 | Strong reasoning, weaker multi-needle |
| 10 | Llama 4 Maverick | Meta | 1M | - | - | Free (open) | Open-weight 1M option, limited eval data |
Detailed Analysis
Claude Opus 4.6 - The Retrieval Accuracy Leader
Anthropic's Claude Opus 4.6 holds the most impressive number in long-context evaluation: 76% on MRCR v2 8-needle at 1M tokens. This benchmark asks models to find and reproduce eight specific facts scattered across a million-token prompt. The previous best from any Claude model was 18.5% from Sonnet 4.5. That's not incremental improvement. It's a generational leap.
At 256K tokens, Opus 4.6 pushes MRCR 8-needle accuracy to 92-93%, crushing GPT-5.2's 63.9% at the same length. For applications that involve searching through large document collections, contracts, or codebases for specific information, Opus 4.6 is the clear leader.
The context window itself is 200K tokens by default, with a 1M token beta available to higher-tier API customers. That limitation means teams needing routine access to million-token contexts may prefer Gemini's 2M default window, even though retrieval accuracy at that range is weaker.
GPT-5.4 - Near-Perfect at Full Range
OpenAI's GPT-5.4 ships with a 1M token context window (922K input, 128K output) and achieves near-perfect accuracy on single-needle retrieval tests across the entire range. Previous GPT models showed significant accuracy degradation beyond 200K tokens. GPT-5.4 doesn't.
OpenAI also reports that the model is 33% less likely to make factual errors on individual claims compared to GPT-5.2, with overall responses 18% less likely to contain errors. That hallucination reduction matters for long-context tasks where the model processes large volumes of information and synthesizes answers from scattered sources.
At $2.50/$20, GPT-5.4 sits between Gemini 3.1 Pro's $2/$12 and Opus 4.6's $5/$25 on pricing. For teams that need reliable retrieval across very long documents without gating access behind beta programs, it's the most straightforward option.
GPT-5.2 Codex - Long-Context Reasoning Champion
While retrieval measures whether a model can find specific facts, the AA-LCR benchmark tests whether it can reason across long documents - synthesizing information from academic papers, financials, legal documents, and reports. GPT-5.2 Codex leads this evaluation at 75.7%, followed by GPT-5 at 75.6% and GPT-5.1 at 75.0%.
The AA-LCR benchmark requires multi-step reasoning to connect information dispersed across sections of documents spanning roughly 100K tokens. OpenAI's models dominate this category, suggesting their training pipeline handles multi-document synthesis well. At $1.75/$14, GPT-5.2 Codex offers the best reasoning-per-dollar for long-context work.
Gemini 3.1 Pro - The Biggest Window, With Caveats
Gemini 3.1 Pro ships with a 2M token context window by default, the largest among major API providers. At $2/$12, it also offers the lowest per-token cost for processing very long inputs. For applications where you need to ingest entire codebases or document collections in a single prompt, the raw window size is useful.
The problem shows up in retrieval accuracy. Gemini 3.1 Pro scores only 26.3% on MRCR v2 8-needle at 1M tokens, compared to Opus 4.6's 76%. At 128K tokens, both models score similarly (around 84.9%), but performance diverges dramatically as context grows. If your use case involves finding specific facts in million-token documents, the 2M window doesn't help if the model can't locate what you need.
For bulk processing where approximate understanding matters more than precise retrieval, Gemini 3.1 Pro's window size and pricing make it the practical choice. For applications where missing a single fact has consequences, Opus 4.6 or GPT-5.4 are safer picks.
Llama 4 Scout - Big Claims, Mixed Reality
Llama 4 Scout advertises a 10M token context window, the largest of any model. Its needle-in-a-haystack single-needle performance is reportedly perfect across all depths. But the model scores only 15.6% on Fiction.LiveBench, a long-context reasoning benchmark, and practical inference requires 8xH100 GPUs with effective limits around 1.4M tokens.
Scout illustrates why context window size alone is a misleading metric. The ability to accept 10M tokens means little if the model can't reason effectively across that span. For teams evaluating open-source long-context options, Scout's retrieval capability is genuine, but its reasoning at scale falls well short of proprietary alternatives.
Methodology
Long-context evaluation requires multiple benchmarks because "long context performance" isn't a single capability:
MRCR v2 (Multi-needle) tests whether models can locate and reproduce multiple specific facts hidden in long contexts. The 8-needle variant at 1M tokens is currently the hardest standardized retrieval test. Scores are reported as mean match ratios.
AA-LCR (Artificial Analysis Long Context Reasoning) uses 100 questions across diverse document types requiring multi-step reasoning over roughly 100K tokens. It tests synthesis, not just retrieval.
Needle-in-a-Haystack remains the simplest retrieval test - can the model find one specific fact at various depths? Most frontier models now score near-perfectly on this, making it useful only as a minimum bar rather than a differentiator.
RULER (NVIDIA) tests 13 task types including retrieval, multi-hop tracing, aggregation, and QA at configurable sequence lengths. Coverage at 1M+ tokens is still spotty.
A critical caveat about context window claims: many models advertise windows far beyond where they perform reliably. Gemini 3.1 Pro offers 2M tokens but drops to 26% retrieval accuracy at 1M. Llama 4 Scout claims 10M but struggles past 1.4M in practice. Always test at your actual working context length before committing to a model.
Historical Progression
March 2025 - Most models maxed out at 128K tokens. GPT-4 Turbo and Claude 3 Opus led with reliable retrieval at this range. Gemini 1.5 Pro offered 1M tokens but with significant accuracy drops.
July 2025 - Claude Opus 4.0 and Gemini 2.5 Pro both shipped 1M+ token windows. Multi-needle retrieval tests emerged as the new standard, replacing simple needle-in-a-haystack.
October 2025 - Llama 4 Scout launched with 10M tokens, generating headlines but mixed benchmark results. The gap between window size and retrieval quality became a major discussion point.
February 2026 - Claude Opus 4.6 reached 76% on MRCR v2 8-needle at 1M tokens, quadrupling the previous best from Anthropic and establishing a clear lead in multi-needle retrieval.
March 2026 - GPT-5.4 shipped near-perfect single-needle retrieval across its full 1M window. The focus shifts from window size to retrieval quality and reasoning capability at scale.
The next twelve months will likely see the window-size race cool down while the quality-at-scale race heats up. Models with 2M+ token windows are already common, but reliable multi-needle retrieval at 1M+ tokens remains a capability that only one or two models deliver.
FAQ
What's the cheapest model for long-context work?
Gemini 3.1 Pro at $2/$12 with a 2M token window offers the lowest per-token cost. Retrieval accuracy drops markedly past 500K tokens, but for bulk processing it's the value leader.
Does a bigger context window mean better retrieval?
No. Llama 4 Scout has a 10M token window but scores 15.6% on long-context reasoning. Opus 4.6 with a 200K default (1M beta) leads retrieval accuracy at 76% on 8-needle MRCR v2 at 1M tokens.
Is RAG still necessary with million-token context windows?
For most production workloads, yes. RAG offers better cost control, fresher data, and deterministic source tracking. Long context is better for tasks requiring holistic understanding of a single large document.
How do I test a model's real context performance?
Run multi-needle retrieval tests at your actual working context length. Single-needle tests are too easy - every frontier model passes them. The 8-needle MRCR v2 benchmark is the best publicly available stress test.
Which model is best for analyzing legal documents?
Claude Opus 4.6 for precision retrieval across long contracts. GPT-5.2 Codex for reasoning across multiple legal documents simultaneously. Gemini 3.1 Pro for high-volume, cost-sensitive processing.
Sources:
✓ Last verified March 11, 2026
