Fix 8% of Tokens, Dodge Memory Attacks, Cut Agent Costs

Three papers from today's arXiv sit at the intersection of interpretability, agent security, and deployment efficiency. Each one has something concrete for practitioners - whether you're squeezing more reasoning out of a base model, building memory-augmented agents, or running web automation at scale.

TL;DR

Reasoning restored via 8% of tokens - A sparse delegation method recovers reasoning-model performance by correcting only the highest-disagreement planning tokens in a base model's output
Memory laundering in agent systems - Toxic content survives summarization in agent memory, bypassing safety filters while retaining behavioral influence
Skim cuts web agent costs 1.9x - Speculative execution that pre-profiles website structure reduces inference cost and latency with no accuracy drop

Only 8% of Tokens Drive the Reasoning Gap

If you've ever wondered why reasoning models beat base models, researchers from AlphaLab-USTC now have a precise answer. The paper "Reasoning Can Be Restored by Correcting a Few Decision Tokens" (Changshuo Shen, Leheng Sheng, Yuxin Chen, An Zhang, Xiang Wang) finds that the gap isn't distributed evenly across every generated token. It concentrates in a small, identifiable subset.

Their analysis of token-level disagreements between base and reasoning models found that roughly 8% of generated tokens account for the key disagreement between the two. These aren't random tokens - they cluster at the beginning of responses and correspond to planning decisions. The base model's uncertainty is highest at exactly these positions, which makes sense: early planning choices cascade through the rest of the generation.

The practical implication is direct. Rather than running an expensive reasoning model end-to-end, you can use a technique the authors call disagreement-guided token intervention: run the base model by default, identify positions where it diverges from a reasoning model's distribution, delegate only those positions to the reasoning model, then hand control back to the base model. Sparse delegation. The reasoning model only touches the ~8% that matters.

arXiv abstract page for "Reasoning Can Be Restored by Correcting a Few Decision Tokens" The paper's abstract page on arXiv. The core insight is that high-disagreement tokens are enriched 17x in planning positions. Source: arxiv.org

Results show this approach substantially recovers reasoning model performance on challenging benchmarks - sometimes matching or topping same-size reasoning models - while keeping computational cost well below full reasoning-model inference.

If you're running a reasoning model and wondering whether you can cut costs without sacrificing quality, this paper is a direct answer: yes, for many queries, you don't need the reasoning model at all for most tokens. The question becomes how cheaply you can identify which tokens actually need it.

What It Changes for Practitioners

The 8% figure is more useful than it might seem. It means you can think about reasoning model cost as roughly proportional to the number of genuinely difficult planning decisions in a task, not the raw token count. Tasks with repetitive structure - code formatting, template completion, summarization - probably hit far fewer of these tokens than open-ended reasoning. Profiling disagreement rates per task type could let you route more aggressively.

Memory Laundering: A Hidden Agent Vulnerability

The second paper lands on agent security. Yian Wang, Agam Goyal, Yuen Chen, and Hari Sundaram (in "State Contamination in Memory-Augmented LLM Agents") document what they call memory laundering - a failure mode where toxic or adversarial content survives the summarization process that agent memory systems rely on.

The setup is familiar to anyone who's built a memory-augmented agent. Agents build up conversation history, summarize it periodically to keep context manageable, and store those summaries for future retrieval. The safety assumption most builders make is that if the summary passes a toxicity filter, the stored memory is clean.

That assumption is wrong.

The paper shows that toxic content can be compressed into summaries that look benign to standard safety detectors. The hostile framing or conflict structure gets preserved at a sub-detection level - what the authors call a Sub-Threshold Propagation Gap (SPG). The summary scores safe. But downstream, it still nudges the agent's outputs in the direction the original toxic content intended.

arXiv abstract page for "State Contamination in Memory-Augmented LLM Agents" The paper's abstract on arXiv. The key distinction is between raw transcript toxicity and the sub-threshold behavioral influence carried by summaries. Source: arxiv.org

Two findings stand out. First, the toxicity channel is path-dependent: raw transcripts drive overt toxic output, while compressed memory carries hidden influence. Second, intervention timing is everything. Sanitizing content before summarization effectively limits contamination. Cleaning summaries after the fact does almost nothing - the influence is already baked in.

Cleaning summaries after the fact does almost nothing. The influence is already baked in at compression time.

For anyone building production agents with persistent memory, the mitigation is clear: filter and sanitize before summarization, not after. This is a harder pipeline change than it sounds. Most agent frameworks run the summarization step automatically; inserting a sanitization pass before it requires restructuring the compression loop.

The broader issue this raises is that agent safety research has mostly focused on model-level safeguards and output filtering. This paper treats memory as a control system with its own state dynamics - and shows that interventions which work on outputs don't automatically work on state transitions.

Skim: Web Agents Get 1.9x Cheaper

The third paper is the most directly deployable. "Skim: Speculative Execution for Fast and Efficient Web Agents" (Mike Wong, Kevin Hsieh, Suman Nath, Ravi Netravali) applies a classic computer architecture technique - speculative execution - to web agent inference.

The core observation is that purpose-built websites are repetitive. An e-commerce product page has the same structure as every other product page on that site. A search result page follows a fixed URL pattern. A checkout flow takes the same steps every time. Current web agents don't use this. They invoke expensive vision-language models to re-analyze each page from scratch, even when the structure is identical to a page they just processed.

Skim fixes this with a two-phase approach. In the offline phase, it profiles each website: capturing URL patterns, answer formats, and common task trajectories. At runtime, it matches an incoming query against these templates, synthesizes the destination URL directly, and extracts the answer with a smaller, faster model. If the prediction fails verification, it falls through to full agent execution - warm-started from where the fast path reached, so you don't pay full cost even for misses.

arXiv abstract page for "Skim: Speculative Execution for Fast and Efficient Web Agents" The abstract for Skim on arXiv. The system was tested with three backbone agents: WebVoyager, AgentOccam, and BrowserUse. Source: arxiv.org

Numbers from the paper: 1.9x median per-task cost reduction, 33.4% latency improvement, no accuracy degradation across benchmarks. The system was tested with three different backbone agents (WebVoyager, AgentOccam, and BrowserUse), which suggests the approach generalizes across agent architectures.

The technique is directly relevant to anyone running web automation in production. If your agentic workflows hit the same set of websites repeatedly, profiling those sites and caching their structure is essentially free compute. Skim formalizes that intuition into a framework with a fallback path that preserves correctness.

The Speculative Execution Parallel

Borrowing from CPU architecture here is clever and apt. Speculative execution in CPUs predicts branch outcomes and executes ahead, rolling back if the prediction fails. Skim predicts task outcomes using website structure and executes with lightweight models, falling back to full agent execution on failures. The key in both cases is that the fast path is cheap enough that even a 50% hit rate saves money overall.

A Common Thread

These three papers share an underlying premise: the expensive, general-purpose computation isn't always necessary. Reasoning models spend most of their effort on tokens that base models already handle fine. Agent memory systems apply heavy safety filtering to the wrong stage of the pipeline. Web agents pay full-model costs for tasks that fit a pre-profiled template.

The research agenda across all three is about identifying where the hard work actually lives - and only paying for it there. That's a useful frame for anyone thinking about how to build efficient AI agents without just throwing more compute at the problem.

Sources: