When to Stop - Overthinking, Handoffs, and Abstention

Three new papers show that AI agents fail not by doing the wrong thing, but by doing things when they should have stopped.

When to Stop - Overthinking, Handoffs, and Abstention

Three papers published today on arXiv arrive at the same diagnosis from different angles: the AI systems we're building are clearly better at starting tasks than at stopping them. A reasoning model that overrides a correct answer with more thinking. A coding agent that can't efficiently pick up where another left off. An autonomous agent that proceeds when it should have waited for authorization. These aren't unrelated edge cases. They're the same gap in training and evaluation, showing up in different places.

TL;DR

  • Thinking Past the Answer - Reasoning models that keep thinking past a correct answer reduce their own accuracy by up to 21%; early stopping helps with verbose overthinking but not the harmful kind
  • Handoff Debt - Structured handoff notes between coding agents cut token costs 42-63% and agent actions 20-59% versus handing off raw repository state alone
  • Agents trained under RLHF develop "compliance bias," a tendency to proceed even when they lack authorization or safe inputs; runtime abstention reaches 89.2% hazard blocking while preserving 87.5% usability

Reasoning Models That Think Their Way to the Wrong Answer

The Setup

Reasoning models - systems that create a chain of thought before producing a final answer - have become a standard approach for hard tasks. The implicit contract is that more reasoning equals better answers. A team including Simone Caldarella, Davide Talon, Rahaf Aljundi, Elisa Ricci, and Massimiliano Mancini tested that assumption directly and found it doesn't hold.

Their paper, "Thinking Past the Answer: Evaluating Harmful Overthinking in Large Reasoning Models", introduces a protocol for identifying the earliest point in a reasoning trace where the model first produces a correct answer. Then they track what happens if reasoning continues.

What They Found

Stopping at the first correct prefix improves accuracy by up to 21% over letting the model finish its reasoning. The model reaches the right answer, then argues itself out of it.

The researchers distinguish two types of overthinking. Verbose overthinking is the unnecessary kind - extra steps and restatements that don't change the outcome. Harmful overthinking is the kind that actually reverses a correct answer. Early stopping strategies, the most common mitigation in production systems, cut verbose overthinking by up to 50%. They do basically nothing for harmful overthinking.

That's the result with real practical weight. Token budget controls and early stopping are standard tools in launched reasoning pipelines. This work says they're solving the easier problem. Harmful overthinking - where continued reasoning flips a correct answer - requires something the inference-time fix can't provide.

For multimodal tasks, the mechanism they identify is "logical drift and visual reinterpretation": the model reconvinces itself that what it perceived wasn't what it perceived. The problem extends to language-only reasoning benchmarks as well.

A researcher reviews handwritten notes and analysis papers Reasoning traces that look thorough can be working against accuracy. The paper shows that earlier answers are often better answers. Source: unsplash.com

Why This Matters

Reasoning-capable models are expensive. Every token in a chain of thought adds latency and cost. The popular framing is that you optimize by setting a budget - run the model until you hit the token limit. This paper says that framing misses the core issue: the model may have been right much earlier, before your budget even kicked in.

The fix is at training time, not inference time. That's a harder engineering problem, and it's one the field hasn't fully confronted yet.

We've covered how reasoning models hit hard capability limits even as they scale. Harmful overthinking is a different kind of limit - not a ceiling on what the model can reach, but a floor it can fall through if left running too long.


The Cost of Handing Off Incomplete Context

The Setup

Every coding agent benchmark evaluates a single agent completing a single task from start to finish. Real software development doesn't work that way. Tasks get interrupted. Engineers shift to other work. A different agent - or a different person - inherits a partial repository state with unclear history.

Dipesh KC and Anjila Budathoki introduce a concept they call "handoff debt": the rediscovery cost imposed on a successor agent when a predecessor's work is opaque or incomplete. Their paper, "Handoff Debt: The Rediscovery Cost When Coding Agents Take Over Interrupted Tasks", is the first systematic study of this problem.

What They Found

The researchers built a takeover protocol. They interrupted coding agents at deterministic handoff points, froze the repository state, and evaluated successor agents under four handoff conditions: repository state only, raw predecessor trace, summary notes, and structured notes.

Across 75 source tasks creating 181 handoff points and 724 takeover runs across three successor models, the results were consistent:

  • Structured handoff notes reduced median agent events by 20-59% versus repository-only handoffs
  • Cumulative prompt tokens dropped 42-63% with structured context
  • Solved-rate improvements were smaller and model-dependent, but efficiency gains held across all three successors

The format of the handoff matters enormously. Raw predecessor traces are nearly as bad as handing off no context at all. Structured notes - not summaries, specifically structured notes with clear state descriptions - are what successor agents need to avoid rederiving work that's already been done.

Software engineers collaborating and reviewing work together Agent handoffs are less like documentation and more like a baton pass: the format of the transfer determines whether the race continues or restarts. Source: unsplash.com

Why This Matters

The gap between benchmark performance and real-world utility for coding agents is well-documented. This paper adds a dimension the benchmark community hasn't addressed: continuity. SWE-bench and similar evaluations measure isolated problem-solving ability. They say nothing about whether an agent can inherit partial work and continue efficiently.

The immediate practical implication is concrete. If you're building multi-agent coding pipelines, the structure of handoff artifacts isn't optional - it's a direct multiplier on token costs and success rates. Designing a handoff format is now part of pipeline architecture.

The paper also contains an implicit call to benchmark maintainers: add interrupted-and-resumed task variants. Current evaluations don't penalize agents for being bad at this. They don't even measure it.


Agents That Don't Know When Not to Act

The Setup

Most agent safety work focuses on what agents say. Output filtering. Jailbreak resistance. Whether the model's response contains harmful content. Victor Ojewale and Suresh Venkatasubramanian take a different angle: when agents act.

Their paper, "What Benchmarks Don't Measure: The Case for Evaluating Abstention Competence in Autonomous Agents", won Best Paper at the ACM CAIS 2026 RLEval Workshop. It identifies "compliance bias" - a structural tendency in RLHF-trained agents to proceed with tasks even when they shouldn't.

The mechanism is straightforward. RLHF reward signals treat task completion as the correct default. The benchmark scores agents on whether they completed the task. Abstaining doesn't score points. So agents learn to proceed, even when proceeding isn't safe or authorized.

What They Found

The researchers introduce a three-gap taxonomy for when abstention is the correct behavior:

  • Specification gaps: the agent lacks necessary inputs or information to act correctly
  • Verification gaps: the agent can't confirm whether a planned action is safe or authorized
  • Authority gaps: the agent lacks authorization to take the action regardless of technical capability

Across 144 enterprise agent scenarios and five model families, they show that runtime-enforced abstention mechanisms reach 89.2% hazardous-action blocking while preserving 87.5% usability on authorized scenarios. That's not a fixed tradeoff - it's an adjustable parameter that current evaluation frameworks don't even expose.

Different model families vary substantially in their baseline abstention behavior. Some are structurally more inclined to proceed despite authority gaps than others. That variation isn't currently part of how organizations select models for deployment.

Why This Matters

Anyone who has watched a launched agent confidently execute a task it was never authorized to handle will recognize the compliance bias problem. The agent isn't malfunctioning. It's doing exactly what it was trained to do: complete tasks.

Research into agent autonomy and oversight has focused heavily on behavioral alignment - making sure agents don't lie, manipulate, or pursue hidden goals. Abstention competence is a different and more immediate problem: making sure agents know when to pause and ask rather than proceed and fail.

The benchmark gap is also worth noting. Standard agent evaluations don't include abstention as a category. An agent that blocks a hazardous action because it recognizes an authority gap gets no credit in current scoring regimes. The paper proposes three evaluation metrics - Safety Rate, Usability Rate, and Informed Refusal Rate - as additions to standard benchmarks.


The Shared Gap

Three teams. Three failure modes. One structural problem.

Current training frameworks reward agents for completing tasks. They don't penalize - and often actively select against - the capacity to recognize when stopping, pausing, or refusing is the correct action. The overthinking paper shows reasoning systems trained to produce long chains keep reasoning past correct answers. The handoff paper shows benchmark protocols that evaluate isolated completion miss continuity entirely. The abstention paper shows RLHF reward signals make agents structurally averse to waiting for authorization.

None of these are fixable at inference time alone. The evaluation frameworks that shape training incentives need to add a question: not just whether the agent completed the task, but whether it knew when not to.


Sources:

Elena Marchetti
About the author Senior AI Editor & Investigative Journalist

Elena is a technology journalist with over eight years of experience covering artificial intelligence, machine learning, and the startup ecosystem.