MIT's Recursive Language Models Bypass the Context Ceiling
MIT researchers show that treating long documents as a Python environment - and letting models recursively spawn sub-models to explore them - beats RAG and extended context windows on every benchmark tested.

The RLM architecture: a root model orchestrates the Python environment, delegating document exploration to recursive sub-model calls.
Every few months, the long-context problem gets declared solved. First it was RAG - retrieve the relevant chunks, stuff them in the prompt. Then it was ever-expanding context windows, with vendors now advertising 1M, 2M, even 10M token limits. Neither approach has been fully satisfying. RAG loses information in the retrieval step. Extended windows work, but the model's attention drifts - a phenomenon sometimes called "context rot" - and the compute costs scale roughly quadratically with length.
A new paper from MIT CSAIL takes a different angle. Rather than asking how to fit more text into a model's context, Alex L. Zhang, Tim Kraska, and Omar Khattab ask what happens if you never load the full document into the model's context at all.
The answer is Recursive Language Models (RLMs), and the benchmark results are striking enough that this architecture deserves serious attention.
The Core Idea: Documents as an External Environment
The central move in RLMs is conceptually clean. Instead of feeding a long document directly into the model's prompt, you store it in a Python environment - literally as a variable in a running REPL - and give the model access to that environment through code execution.
The model doesn't read the document. It writes code to interact with it.
That code can slice the text into chunks, run regex searches, count occurrences, extract sections by header, or do anything else Python allows on a string. Crucially, it can also call llm_query() - a function that spawns a sub-model, hands it a snippet of the document, and gets a response back. Those sub-models can themselves spawn further sub-models. Hence "recursive."
When the root model is confident in an answer, it emits FINAL(answer) or FINAL_VAR(variable_name) and the call returns. The model's own context window stays clean throughout: it only ever sees the query, the code it has written, and the (truncated) outputs from code execution. The full document never enters.
This is architecturally distinct from RAG in an important way. RAG requires a retrieval system - an index, an embedding model, a similarity search - that is built before the query arrives. The retrieval strategy is fixed at index time. RLMs have no pre-built index. At query time, the model examines the document, decides what is relevant, and determines its own decomposition strategy on the fly. The model does retrieval; the programmer doesn't hard-code it.
What Emerges Naturally
One thing the paper highlights is that RLMs, given this environment, develop interpretable strategies without being explicitly trained to do so. The researchers observed four recurring patterns in model behavior:
Peeking: The model examines the first few hundred characters of the document to understand its structure before deciding how to proceed. Sensible behavior, and not something you'd get from a retrieval system that treats all chunks equally.
Grepping: The model uses regex or keyword matching to narrow the search space before doing anything more expensive. Cheap filtering before expensive recursive calls.
Partition + Map: The model chunks the document, fans out a sub-model call per chunk, and aggregates the results. Standard map-reduce, discovered by the model as the natural approach to certain aggregation tasks.
Summarization: The model extracts a subset of information from each chunk and uses those summaries for the outer model's decision. Unlike summarization-based long-context scaffolds, this is selective and targeted rather than indiscriminate.
These patterns are not programmed in. They emerge because the Python environment makes them naturally expressible, and because the underlying model is capable of recognizing when each is appropriate.
Benchmark Results
The paper evaluates four tasks at increasing levels of complexity:
S-NIAH - A needle-in-a-haystack task at constant complexity, used as a sanity check. RLMs perform at ceiling.
BrowseComp-Plus - Multi-hop retrieval across up to 1,000 documents. RLM(GPT-5) achieves 91.33% accuracy; the CodeAct baseline reaches 51.00%. More telling: RLM(GPT-5) is the only system that "maintains perfect performance at the 1,000-document scale" in the authors' words. The retrieval-augmented baselines degrade as document count grows; RLMs don't.
OOLONG - Semantic aggregation at linear complexity, requiring the model to combine information across a document. RLM(GPT-5) scores 56.50% versus 44.00% for vanilla GPT-5 on the same task.
OOLONG-Pairs - The hardest benchmark, requiring pairwise comparison across all document segments (quadratic complexity). This is where the gap is most dramatic. RLM(GPT-5) achieves 58.00% F1. Vanilla GPT-5, given the same context, scores 0.04%. Standard models effectively cannot do this task at all; RLMs handle it routinely.
The authors also post-trained a native recursive variant called RLM-Qwen3-8B - the first model explicitly trained to think recursively rather than treating the REPL as a tool to be prompted into using. That model outperforms the base Qwen3-8B by 28.3% on average and approaches vanilla GPT-5 on three of the four long-context tasks. A fine-tuned 8B model approaching a frontier frontier model on hard long-context tasks is not a trivial result.
Cost and Scale
The cost profile is counterintuitive in a good way. RLMs spend tokens on code and sub-model calls rather than on loading raw context. On BrowseComp-Plus, RLM(GPT-5) averages $0.99 per query. Direct context extension for the same task runs $1.50 to $2.75. Summarization-based approaches are up to 3x more expensive than RLMs while being substantially less accurate.
The architecture also scales to inputs that no existing context window can handle. The paper tests inputs reaching 2^18 tokens (roughly 262,000 tokens), and the framework is designed to handle 10M+ token documents by making the recursion depth configurable. At these scales, "context window extension" is not a meaningful alternative - there is no context window large enough. RLMs continue to work by never needing one.
Limitations Worth Knowing
The authors are candid about what the current implementation doesn't do well.
Recursive calls are blocking. There is no asynchrony, and the implementation doesn't use prefix caching across sub-calls. A single query can take anywhere from a few seconds to several minutes depending on document length and task complexity. This is a systems engineering problem more than an architectural one - the authors describe it as "rich with opportunities for optimization" - but it matters for latency-sensitive applications.
Cost is also non-deterministic. The median cost is reasonable; the tail is not. A query that the model handles efficiently might cost pennies. One that requires many recursive expansions might cost much more. There are currently no hard caps or budgets built in.
Finally, RLMs require a model capable of reasoning in code. The paper notes that Llama 3 8B "likely struggles to navigate the Python environment without specific distillation or fine-tuning." This isn't a critique of the approach - it's a statement about the model capability floor. The pattern works well with GPT-5, Claude 3.5 Sonnet, Qwen-Coder, and Gemini 3; it is not a plug-in for every model.
Why This Matters
The practical implications depend on where you sit.
If you are building RAG pipelines for document-heavy workloads, the OOLONG-Pairs result (58.00% versus 0.04% for standard models) should give you pause. There is a class of aggregation task - pairwise comparison, cross-document synthesis, anything requiring quadratic complexity across segments - where retrieval-based approaches and fixed context windows are structurally ill-suited. RLMs are not.
If you are thinking about long-context models as a scaling story, RLMs offer a different scaling axis. Rather than scaling the context window (hardware-constrained, expensive), you scale the recursion depth (software-defined, cheaper). The 10M+ token capability isn't a hardware spec; it's a consequence of the architecture.
The code is available on GitHub and the authors plan integration into the DSPy framework. RLM-Qwen3-8B is the first model explicitly trained for this paradigm - expect more.
Sources:
- Recursive Language Models - Zhang, Kraska, Khattab, arXiv 2512.24601
- RLM GitHub Repository - Alex L. Zhang
- Recursive Language Models: A new framework for infinite context - BD TechTalks
- MIT's Recursive Language Models Improve Performance on Long-Context Tasks - InfoQ
