Bad Science, Poisoned Tools, and Aligned Reasoning

Three papers landed today that each poke at something we thought we understood about AI systems - and find it shakier than expected.

TL;DR

AI scientists produce results without reasoning scientifically - In 25,000+ runs across eight domains, LLM agents ignored available evidence 68% of the time and rarely revised beliefs when contradicted
How Adversarial Environments Mislead Agentic AI - POTEMKIN, an MCP-compatible harness, shows that fixing one class of agent robustness problem often makes another worse
Reasoning Structure Matters for Safety Alignment - AltTrain uses 1,000 supervised examples to fix reasoning models' safety failures - no reinforcement learning required

AI Agents Don't Reason Like Scientists - They Just Look Like They Do

A team led by Martiño Ríos-García, Nawaf Alampara, and Kevin Maik Jablonka ran over 25,000 agent runs across eight scientific domains and asked a simple question: do these agents actually reason scientifically? Their answer is a clear no.

The number that stands out is 68%: that's the share of agent traces where available evidence was simply ignored. When agents had data that contradicted their working hypothesis, they mostly proceeded anyway. Genuine refutation-driven belief revision - updating your position because the data says you're wrong - appeared in only 26% of runs.

A researcher reviews data at a computer workstation AI scientific agents execute the right workflow steps but skip the reasoning that connects evidence to conclusions. Source: unsplash.com

The study also breaks down where this behavior comes from. The underlying language model accounts for 41.4% of explained variance in agent reasoning patterns. The agent scaffold - whether it's a ReAct loop, an AutoGPT-style harness, or something custom - contributes just 1.5%. The base model is doing almost all the epistemic heavy lifting, and most base models don't reason the way scientists do.

There are two things worth unpacking here. First, this pattern holds regardless of whether agents are running fixed workflows or open-ended hypothesis-driven inquiry. The researchers tested both modes and found the problem persists across both. Second, providing successful reasoning examples as context didn't fix the underlying issues. You can show an agent what good scientific reasoning looks like, and it still won't reproduce it reliably.

Why Outcome-Based Evaluation Misses This

The paper makes a pointed methodological argument: if you only evaluate whether agents reach correct answers, you'll miss these failures entirely. An agent can reach a correct result via flawed reasoning - which means standard eval suites give a green light to a process that's broken at its core.

The proposed fix isn't a smarter scaffold. It's making reasoning structure itself a training target. Models need to learn what updating a belief on evidence actually means, not just how to mimic the surface pattern of a scientific workflow. Until that happens, AI-assisted research pipelines are producing outputs that look like science without the underlying rigor.

If you're using these tools in research workflows, the guide to using AI for academic research covers practical guardrails - but this paper is a reminder that no guardrail substitutes for understanding what's happening inside the loop.

Agents That Trust Their Tools Are Easy Targets

The second paper introduces a threat model most practitioners haven't thought about yet: what happens when the tools your agent uses are actively lying to it?

Zhonghao Zhan, Huichi Zhou, Zhenhao Li, Peiyuan Jing, and colleagues introduce Adversarial Environmental Injection (AEI). The idea is to attack agents not directly, but by poisoning the outputs of the external tools they call. Their testing harness, POTEMKIN, is MCP-compatible, making it plug-and-play against current agent architectures.

The researchers identify a "Trust Gap" central to how agents are currently built and assessed: capability is measured extensively, but skepticism toward tool outputs is never tested. Agents are assessed on whether they use tools correctly - not on whether they can detect when tools are lying.

Green matrix-style code representing adversarial environments for AI agents POTEMKIN creates MCP-compatible adversarial tool outputs to probe whether agents can detect when the environment is lying to them. Source: unsplash.com

POTEMKIN targets two distinct attack surfaces. "The Illusion" is a breadth attack: the tool returns plausible but false information that gradually corrupts the agent's beliefs about the world. "The Maze" is a depth attack: the tool creates conditions that cause the agent's policy to collapse, trapping it in loops or dead ends it can't escape.

Across 11,000+ runs testing five frontier agents, the core finding is uncomfortable. Robustness to epistemic attacks (Illusions) and robustness to navigational attacks (Mazes) are in tension. Agents that become better at detecting false information tend to become worse at navigating when they're stuck - and vice versa. There's no single tuning that defends against both.

What This Means in Practice

Anyone building agents that call external APIs, databases, or MCP servers should read this carefully. The standard defense posture is: confirm inputs before they reach the model, and trust nothing from the outside. This research shows you also need to worry about inputs that are bad but look completely legitimate - and that defending against the two failure modes may require distinct mitigations with no single unified fix.

The POTEMKIN harness is the practical deliverable here. Running your agent through adversarial tool-output testing is now something you can actually do. Whether that becomes standard practice before shipping depends on whether teams treat this as a real attack surface, not a theoretical one. The agentic AI benchmarks leaderboard currently has no adversarial tool-environment track - that's the gap this paper is calling out.

Fixing Reasoning Model Safety Doesn't Require Reinforcement Learning

Large reasoning models - the kind that produce extended chains of thought before answering - tend to be good at hard problems and bad at refusing harmful ones. Earlier work showed they can be manipulated by exploiting the reasoning phase itself. The fix, according to a new ACL 2026 paper, is simpler than most assumed.

Yeonjun In, Wonjoong Kim, Sangwu Park, and Chanyoung Park introduce AltTrain, a post-training approach that adjusts the structure of the reasoning chain rather than just the final output. The training cost is minimal: supervised fine-tuning on 1,000 examples. No reward model, no RL loop, no expensive custom training pipeline.

The results generalize across model backbones and sizes, and transfer to reasoning, question-answering, summarization, and multilingual settings. That breadth is the key credential - safety interventions that only work on one model architecture or one task type have limited production value.

Where the Safety Failure Actually Lives

The paper's conceptual contribution is identifying the right level at which to intervene. Safety failures in reasoning models aren't just about the final output token. They originate in the reasoning structure that generates the output. Aligning on the surface while leaving the chain of thought untouched is like waterproofing the roof while ignoring a broken pipe inside the wall.

AltTrain addresses the structure directly, and the ACL 2026 acceptance suggests reviewers found the evidence credible. For practitioners working with reasoning models, this connects to a recurring question in AI safety and alignment: do safety interventions transfer as capabilities scale? The paper's cross-backbone and cross-size results suggest they can, at least at current scales.

The Pattern Across All Three

Each paper is documenting a gap between what AI systems appear to do and what they actually do.

Scientific agents look like they're reasoning - they run experiments, query data, produce outputs with citations - but 68% of the time they're ignoring the evidence in front of them. Tool-integrated agents look solid until the tools themselves become attack vectors. Reasoning models look aligned after standard fine-tuning until someone probes the chain of thought.

All three papers also share a diagnostic: outcome-based evaluation doesn't catch any of these failures. An agent that ignores evidence can still reach the right answer. An agent being fed false tool outputs can still complete tasks. A reasoning model with a broken safety structure can still refuse most harmful requests. The failures only become visible when you assess process, not just result.

Sources: