Tool Overuse, Precision Leaks, Metacognition Fails

Three new papers expose systematic failure modes in LLM agents - from unnecessary tool calls to jailbreaks that emerge only under quantization.

Tool Overuse, Precision Leaks, Metacognition Fails

Today's arXiv brings three papers that, read together, point at the same structural problem: LLMs have systematic blind spots about their own behavior, and those blind spots show up in ways that standard evaluations don't catch. Agents reach for tools they don't need. Safety alignment disappears under quantization. Models can't convert self-knowledge into better decisions.

TL;DR

  • The Tool-Overuse Illusion - LLMs make unnecessary tool calls because they misjudge their own knowledge limits; preference optimization cuts the overuse rate by 82.8%
  • PrecisionDiff - Prompts correctly refused at bfloat16 produce harmful output at int8 - a jailbreak attack surface hiding inside your inference config
  • MIRROR - Telling models their calibration scores does nothing; only external architectural constraints reduce confident failures, by 76%

LLMs Reach for Tools They Don't Need

When you equip an LLM with external tools - web search, code execution, calculator APIs - you solve real limitations. You also introduce a problem the field has mostly ignored: the model will use those tools even when it doesn't have to.

A new paper puts numbers to this. The researchers call it the tool-overuse illusion, and across a range of models and tasks, unnecessary tool calls weren't an edge case. They were the norm.

The root cause is what the authors call a knowledge epistemic illusion: models methodically misidentify the edge of their internal knowledge. When uncertain, they delegate to a tool. The problem is that they're uncertain even when they shouldn't be - they've forgotten they know the answer, or never trusted their recall in the first place.

A robotic arm in operation, illustrating how automated systems can overreach their remit Giving LLM agents access to external tools adds capability - but new research finds agents routinely call tools they don't need, adding latency and cost with no accuracy gain. Source: unsplash.com

What the Fix Looks Like

The team's proposed solution applies preference optimization directly to the boundary problem. They train models to correctly identify when internal knowledge is sufficient, using a DPO-style approach that aligns the model's epistemic self-assessment with reality. Unnecessary tool calls dropped by 82.8% while task accuracy held.

A second mechanism is also at work. When reward models grade on task outcomes alone, they accidentally reinforce tool use - calling a tool often feels safer to the model, even if the answer was already there internally. Adjusting reward signals to penalize unnecessary calls cut overuse by 60.7-66.7% in separate experiments, independent of the epistemic alignment fix.

For anyone building agentic pipelines, the practical takeaway is direct: every unnecessary tool call adds latency, token cost, and failure surface. The assumption that more tools equals more capable is worth auditing. If you're building your first AI agent, this is a structural consideration from day one.


Your Safety Alignment Doesn't Survive Quantization

Nobody put this on the safety evaluation checklist: running the same model under different numeric precisions can produce different answers to the same prompt.

Researchers from Shanghai Jiao Tong University, Beihang University, Nanyang Technological University, and Singapore Management University introduce PrecisionDiff, an automated differential testing framework that feeds identical prompts to the same model at bfloat16, float16, int16, and int8, then compares outputs. The gap in alignment behavior is the finding.

Precision-induced disagreements manifest as jailbreak divergence: a prompt that the bfloat16 version correctly refuses will, under int8 quantization, produce a harmful response. The researchers tested Llama-2, Llama-3, Mistral, Vicuna, and Guanaco. Divergences appeared across all of them.

A close-up of a PCB circuit board, representing the hardware precision layer where LLM behavior can diverge Safety alignment is typically evaluated at full precision - but production deployments commonly use quantized int8 or int16 for efficiency. PrecisionDiff shows the guardrails can disappear in translation. Source: pexels.com

Why Quantization Breaks Alignment

The reason is structural. Quantization compresses weights by rounding values to lower-bit representations, and that compression isn't uniform - it hits some layers and attention heads harder than others. A safety-relevant direction in the model's activation space might survive bfloat16 cleanly but degrade below a functional threshold at int8. The alignment property stays in place on paper. The safety behavior vanishes in practice.

This has direct consequences for anyone deploying quantized models at scale. The standard workflow is to assess safety in full precision, then deploy at lower precision for efficiency. PrecisionDiff shows that workflow has a blind spot. Running safety evaluations at actual deployment precision isn't overly cautious - it's a requirement.

The finding also connects to a pattern we covered earlier: prior work on agent safety probes found that alignment is regularly assessed on a surface that differs from where the model actually runs. Precision is one version of that gap. The paper suggests there are likely others.


Knowing Your Limits Doesn't Help If You Can't Act on Them

The third paper, submitted to NeurIPS 2026, introduces MIRROR - a hierarchical benchmark for testing whether LLMs can translate self-knowledge into better decisions. Across 16 models from 8 organizations, assessed over roughly 250,000 instances, the answer is mostly no.

MIRROR tests four ascending levels of metacognitive capability:

  1. Self-knowledge - does the model know what it knows within a domain?
  2. Knowledge transfer - can it apply that awareness across domains?
  3. Compositional self-prediction - can it predict its own cross-domain performance?
  4. Adaptive self-regulation - does it change its behavior accordingly?

Level three is where things collapse universally. Compositional Calibration Error - measuring how accurately a model predicts its own multi-domain performance - ranges from 0.434 to 0.758 across all 16 models. For reference: pure random guessing scores 0.5. The best models barely clear that floor.

Telling models their calibration scores produces no significant improvement. Only external architectural constraints work.

Level four is where the practical stakes are. Even models that show above-chance self-knowledge in specific domains "systematically fail to translate even this partial awareness into appropriate agentic action-selection," per the paper. Knowing you're likely to be wrong about chemistry doesn't stop you from answering a chemistry question confidently.

The Only Fix Is External

The authors tested two interventions. Feeding calibration scores back to the model - telling it "you're 70% accurate in this domain" - produced no statistically significant improvement. External architectural constraints that block high-confidence responses in flagged scenarios cut the Confident Failure Rate from 0.600 to 0.143, a 76% reduction at temperature 0.

This challenges a common design assumption in agentic systems: that models will surface their own uncertainty honestly, and that providing them calibration data will improve their caution. MIRROR's data says both assumptions don't hold. The architecture, not the model, has to enforce epistemic limits.

The result fits neatly alongside earlier research on AI safety and alignment and our coverage of tool poisoning and agent manipulation - a pattern of internal model properties proving insufficient and external scaffolding being the reliable corrective.


Common Threads

These three papers aren't about different problems. Tool overuse, precision-dependent alignment, metacognitive failure - each is a variant of the same issue. Models have internal states that don't accurately reflect the world, and those inaccuracies spread into behavior in ways that aren't visible from standard benchmarks run at standard conditions.

None of these are solved by scaling or capability improvements alone. Each fix is external to the model itself: better reward signals during training, safety re-evaluation at deployment precision, architectural enforcement of epistemic limits. All three belong on the engineering checklist for production agentic systems.


Sources:

Elena Marchetti
About the author Senior AI Editor & Investigative Journalist

Elena is a technology journalist with over eight years of experience covering artificial intelligence, machine learning, and the startup ecosystem.