Best LLMs with 1M+ Context Window in 2026

A practical comparison of every production LLM with a 1M+ token context window - verified pricing, real retrieval notes, and clear picks for different workloads.

Best LLMs with 1M+ Context Window in 2026

For most of AI's short history, running out of context was a given. You chunked documents, built RAG pipelines, and accepted that the model only ever saw a fraction of your data at once. That constraint is largely gone now. A dozen production models can process a million tokens or more in a single call - roughly 750,000 words, several long novels, or an entire mid-size codebase.

The question isn't whether 1M-token context exists anymore. It's which model actually delivers on it, and at what price.

This article covers every production LLM with 1M+ tokens of context as of May 2026. Pricing figures come from official model documentation and verified API listings. We skip models where context window claims aren't backed by production availability.

TL;DR

  • Best overall: Gemini 3.1 Pro at $2/$12 per million tokens - 1M context, no pricing cliff, strong accuracy
  • Best budget API: DeepSeek V4-Flash at $0.14/$0.28 - full 1M window, MIT-licensed open weights
  • Best open source: Llama 4 Scout - 10M token window, Apache 2.0, fits on a single H100

Every Model with 1M+ Context

Ten production models support at least 1M tokens in a single context window as of May 2026. One - Llama 4 Scout - reaches 10M.

ModelContextInput ($/M)Output ($/M)License
Llama 4 Scout10M$0.08 (DeepInfra)$0.30Apache 2.0
DeepSeek V4-Flash1M$0.14$0.28MIT
Gemini 2.5 Pro1M$1.25 / $2.50*$10 / $15*Proprietary
Gemini 3.5 Flash1M$1.50$9.00Proprietary
DeepSeek V4-Pro1M$1.74$3.48MIT
GPT-4.11M$2.00$8.00Proprietary
Gemini 3.1 Pro1M$2.00$12.00Proprietary
Claude Sonnet 4.61M$3.00$15.00Proprietary
Claude Opus 4.61M$5.00$25.00Proprietary
Claude Opus 4.71M$5.00$25.00Proprietary

*Gemini 2.5 Pro reprices the entire request at 2x input and 1.5x output once it beats 200K tokens. That's not a token-overflow surcharge - the whole request moves to the higher tier.

A large library with tall stacks of books representing the vast amount of text 1M+ context windows can process A 1M-token context can hold the equivalent of a large reference library in a single call. Source: unsplash.com


Llama 4 Scout - The Open-Source Outlier

Llama 4 Scout isn't just the only open-weight model in this category - it has the longest context window of any production model, commercial or otherwise. Meta ships it at 10M tokens with a 17B active / 109B total MoE architecture and native multimodal support for text and images. It fits on a single 80GB H100 with int4 quantization.

At DeepInfra's hosted rate of $0.08 per million input tokens, it's the cheapest option by a wide margin. Self-hosting via Hugging Face cuts that further.

The tradeoff is capability. Llama 4 Scout performs well on straightforward retrieval tasks but trails frontier proprietary models on complex multi-hop reasoning at scale. For scanning large codebases for patterns, summarizing bulk transcripts, or parsing extensive user activity logs, it covers those workloads cheaply. For nuanced reasoning across a million tokens of dense text, test it on your specific data before committing.

For a full self-hosting setup guide, see our best open-source LLMs to self-host article.


DeepSeek V4-Flash - Best Budget API

DeepSeek's V4 family ships two variants, both with 1M context and MIT licensing. The Flash version at $0.14 input / $0.28 output costs roughly one-tenth what Claude Sonnet 4.6 charges for the same window. Both are MIT-licensed, so you can run them via the DeepSeek V4 API or self-host the weights.

V4-Pro (1.6T total / 49B active parameters) is substantially more capable at $1.74/$3.48. It also has the largest maximum output capacity in this group - up to 384K tokens per response, compared to the 64K-128K ceiling common elsewhere. That matters when you're producing full reports, complete documentation, or large code outputs from long input contexts.

V4-Flash handles lighter loads. Bulk document processing, transcript summarization, structured data extraction from large files - it covers those use cases at a price that makes million-token batch runs viable without burning through budget.

DeepSeek V4 at a glance

VariantParametersContextInputOutputMax Output
V4-Flash284B total / 13B active1M$0.14/M$0.28/M64K
V4-Pro1.6T total / 49B active1M$1.74/M$3.48/M384K

GPT-4.1 - Built for Code at Scale

GPT-4.1 is OpenAI's coding-optimized model with a 1M token window at $2/$8 per million. On SWE-bench Verified it scores 54.6%, which is the strongest coding score among the models in this list. It handles repository-scale context well, which makes it a natural choice for the use case most developers actually want 1M context for: feeding a full codebase into a single prompt to debug, document, or refactor.

The pricing is flat at $2/$8 across the entire 1M token range - no cliff, no tier change. That predictability matters when budgeting for large context calls.

For non-coding workloads, GPT-4.1 is a solid mid-tier option, but Gemini 3.1 Pro and Claude Opus 4.6 beat it on complex reasoning and instruction following. Its advantage is narrow and specific: software engineering tasks on large codebases.


Gemini 3.1 Pro - Best Price-to-Performance

Gemini 3.1 Pro leads most long-context benchmarks among models in the $2 input tier. It ships with a 1M token window, up to 64K tokens of output, and dynamic thinking by default - the model allocates more compute to harder problems rather than answering right away.

Google released it on February 19, 2026, with benchmark leading scores on 13 of 16 evaluated tasks, including 77.1% on ARC-AGI-2. That's strong general reasoning performance, which translates to better results on the multi-hop reasoning tasks that actually need 1M context.

At $2/$12 per million tokens, no pricing cliff, and strong overall performance, it's the easiest recommendation for teams that need reliable long-context inference at reasonable cost.

One caveat that applies here too: the $2 rate holds only for requests under 200K tokens. Above that, Gemini 3.1 Pro moves to $4/$18 (the same cliff structure as Gemini 2.5 Pro, but at a higher base). If your workload consistently lands between 200K and 1M tokens, verify the actual request costs before comparing to flat-rate alternatives.


Gemini 2.5 Pro and 3.5 Flash - Watch the Cliffs

Gemini 2.5 Pro has a 1M context window and strong reasoning capability. At $1.25 input / $10 output for requests under 200K tokens, it's cheaper than Gemini 3.1 Pro at that tier.

The pricing cliff problem is real. Cross 200K tokens and the entire request reprices at $2.50 input / $15 output. A 201K-token request costs the same as a full 1M-token request at the higher tier. If your use case truly uses the long context - meaning you're regularly sending 300K-1M token prompts - budget at the higher rate.

Context caching partially offsets this. If the same large document stays in context across multiple queries, caching the key-value representation cuts repeat-context costs sharply. Gemini 2.5 Pro supports caching, and for agentic workflows with a stable document as a reference, that changes the economics.

Gemini 3.5 Flash is worth mentioning here. It has the same 1M token window at $1.50/$9 - slightly cheaper than 3.1 Pro at the lower tier, markedly faster at 289 tokens/second, and the default model for Gemini and Google AI Mode. For high-throughput batch use cases where speed matters more than peak reasoning capability, it's the better choice.


Claude - Flat Pricing, No Cliffs

Anthropic's models across the Claude 4 family all carry 1M token windows with no pricing tiers. The input rate is identical whether you send 9K or 900K tokens.

Server racks in a data center - the infrastructure that powers large-scale long-context inference Processing 1M-token requests requires sizable infrastructure - and the costs reflect it. Source: unsplash.com

Claude Opus 4.6 at $5/$25 is Anthropic's most capable long-context model for agentic and enterprise work. The 1M window has been GA since March 2026. It supports up to 128K output tokens - double the standard limit - which matters for tasks that produce long documents or detailed analyses. It also supports prompt caching, which cuts effective costs sharply when the same large document anchors repeated queries.

Claude Opus 4.7 at the same $5/$25 adds improved vision (3x higher resolution), a new xhigh effort level, and task budgets for cost control. The context window and pricing are identical to Opus 4.6.

Claude Sonnet 4.6 at $3/$15 offers the same 1M window at a meaningful discount. On most enterprise knowledge tasks, Sonnet 4.6 performs close to the Opus tier. If you're running long-context Q&A, document review, or information extraction rather than complex multi-hop reasoning, Sonnet 4.6 is the better value.

The premium over Gemini and GPT-4.1 is real. For use cases where the quality difference matters - legal analysis, compliance review, research synthesis that requires nuanced judgment - it's justified. For bulk document processing where speed and cost drive the decision, it isn't.


What You Need to Know Before Using 1M+ Context

Buying a 1M token context window and getting reliable performance from it are two separate things.

Lost in the middle. Every model in this list shows measurable accuracy degradation when target information sits in the center of a very long context. Single-needle retrieval (find one specific fact) holds up well at 1M tokens across all models here. Multi-needle retrieval (find five related facts scattered through the document) degrades noticeably past 500K tokens on every current model, regardless of what the spec sheet says. Test your specific retrieval task before assuming the full window works for it.

Latency. A 100K-token prompt is already slow. At 1M tokens, time-to-first-token across these models runs in tens of seconds or longer. DeepSeek V4-Flash and Gemini 3.5 Flash offer the best throughput, but none of these are suitable for real-time, latency-sensitive applications at full context depth.

Cost vs. RAG. A single 1M-token call at Gemini 3.1 Pro's higher tier ($4 input per million) costs $4 in input tokens alone. Processing 1,000 such queries per day runs $4,000 daily in input tokens. A well-tuned RAG pipeline handles the same workload at one to two orders of magnitude less. Long-context inference earns its advantage when full-context reasoning matters - legal analysis that requires cross-document synthesis, code review where understanding the entire call graph changes the answer, compliance audits where a missed connection between two distant passages creates liability.

Context caching. If you're sending the same large document as part of every call, check whether your provider supports context caching before paying full per-token rates. Anthropic, Google, and DeepSeek all support caching, and on workflows where a large system prompt or reference document stays constant across dozens of queries, caching can cut effective per-call cost by 80% or more.

Advertised context size is a capacity statement. It doesn't tell you how accurately the model uses that capacity at full depth.


Best Picks by Use Case

For coding and repository analysis: GPT-4.1 is the top choice. 54.6% SWE-bench Verified, 1M context, flat pricing at $2/$8. Nothing else in this list matches it on software engineering tasks specifically.

For cost-sensitive API workloads: DeepSeek V4-Flash at $0.14/$0.28. An order of magnitude cheaper than any proprietary model here. For bulk transcript summarization, large document scanning, or structured data extraction at scale, nothing else is close.

For self-hosting: Llama 4 Scout is the only option with an Apache 2.0 license and a 10M token window. It runs on a single H100 in int4. That's a practical on-premises deployment, not a multi-GPU research project.

For best overall: Gemini 3.1 Pro at $2/$12 - strongest benchmark profile for the price, 1M context, no cliff below 200K tokens.

For accuracy-first enterprise work: Claude Opus 4.6 or Claude Opus 4.7. Flat pricing with no tier changes, 128K output, and consistently better performance on complex reasoning tasks than anything at this price range.


FAQ

What model has the longest context window?

Llama 4 Scout holds the record at 10M tokens. Among proprietary models, all current 1M-context models are tied at that ceiling.

Is 1M token context reliable across the whole window?

Single-fact retrieval is reliable at 1M tokens for all models listed here. Multi-fact retrieval (extracting multiple related pieces of information from different parts of a long document) degrades for every model past roughly 500K tokens. Always test on your actual data before assuming full-window performance.

Do providers charge more for longer contexts?

Some do. Gemini 2.5 Pro and Gemini 3.1 Pro both double their rates when a request tops 200K tokens, repricing the entire request. GPT-4.1, DeepSeek V4, Claude, and Llama 4 Scout use flat pricing across their full context window.

When is long-context better than RAG?

When the task requires reasoning across relationships that span the full document and a retrieval system would miss them. Legal contracts where a clause on page 3 changes a definition on page 87, codebases where understanding a bug requires tracing 15 function calls, research synthesis where the insight comes from comparing contradictory passages. For straightforward information retrieval, RAG is cheaper and faster.

What is context caching?

Context caching saves the model's internal representation of a large context prefix so next requests with the same prefix don't recompute it. Anthropic, Google, and DeepSeek support it. On workflows where the same large document or system prompt anchors every call, caching typically cuts effective per-call input cost by 70-85%.

Can I self-host a 1M+ context model?

Llama 4 Scout is the only model here with open weights, an Apache 2.0 license, and a 10M context window. In int4 quantization it fits on a single 80GB H100. All other models in this list are API-only.


Sources

✓ Last verified May 23, 2026

James Kowalski
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.