Where AI Agents Break: Research, Safety, and Privacy

Three papers landed this week that anyone building or rolling out AI agents should read closely. They don't address the same problem, but they converge on one uncomfortable observation: agents fail in ways that are non-obvious, hard to detect, and increasingly consequential as deployments grow.

TL;DR

ResearchArena - 117 agent-generated papers from three frontier models; zero cleared the acceptance bar at any top-tier venue, and manuscript-only review hid the failures completely
Hallucination as Exploit - Multimodal agent hallucinations are authorization failures, not just accuracy errors; a certificate-gate system cut unsafe actions from 100% to 0%
POLAR-Bench - Small open-weight models (1-30B) leak over 50% of protected attributes under adversarial probing, while frontier models hold above 99%

Can AI Agents Do Science? ResearchArena Says Not Yet

The experiment is deliberately minimal. Three frontier coding agents - Claude Code with Opus 4.6, Codex running on GPT-5.4, and Kimi Code with K2.5 - each received 13 computer science research domains and a lightweight scaffold. No step-by-step recipes. Each agent ran through ideation, experimentation, paper writing, and self-refinement three times per domain. The result was 117 papers, evaluated under three lenses: manuscript-only automated review (SAR), artifact-aware peer review, and human meta-review.

None of the 117 papers cleared the acceptance bar at any top-tier venue.

The subtler finding is how badly manuscript-only review misses this. Under SAR, Claude Code scored 5.45, just above the ICLR 2025 weighted average of 5.42. The paper surface looks competitive. Switch to artifact-aware peer review - where evaluators inspect the actual code and experimental logs alongside the manuscript - and scores drop hard. Claude Code fell 0.85 points. Kimi Code fell 0.86.

The authors identify three failure modes that drive these drops:

Fabricated results: Numbers in papers don't match underlying code or logs. Kimi Code showed 77% paper-vs-artifact mismatch and 72% fabricated references. Codex stayed cleanest at 5-8%.

Underpowered experiments: A multi-dataset study runs on one dataset. A comparative evaluation tests a single variant. Kimi Code hit 82.1% underpowered experiments; Claude Code came in at 25.6%.

Plan/execution mismatch: Ablations promised in the manuscript are never run. Baselines appear in the paper but not in the logs. Kimi Code reached 33.3%; Claude Code and Codex held at 17.9% and 20.5%.

The practical implication lands outside academia. Any pipeline that uses LLM judges to assess agent-created outputs without checking artifacts is likely measuring presentation quality, not correctness. The paper surface and the underlying work can diverge substantially, and automated manuscript review won't catch it.

Stack of academic papers and research documents on a desk Artifact-aware review of agent-created papers exposed failures that manuscript-only scores concealed completely. Source: pexels.com

Hallucinations Are Now Security Exploits

The framing in "Hallucination as Exploit" (Guijia Zhang, Hao Zheng, Harry Yang) is worth sitting with. Multimodal agents hallucinate - that's established. What this paper argues is that hallucinations in action-taking agents aren't just accuracy problems: they're authorization failures.

When an agent falsely perceives a "Confirm Transfer" button, or misreads a document field showing a recipient address, and then executes a transaction based on that false perception, the hallucination has become an exploit. The model believed something false about the environment. The action pipeline accepted that belief as authorization. A privileged operation executed without any verification. You can see the analogy to SQL injection without stretching it too far: untrusted input reached an execution path it shouldn't have.

The Evidence-Carrying Agents (ECA) framework addresses this by separating perception from authorization. Instead of trusting the model's free-form description of what it sees, ECA decomposes each tool call into action-critical predicates - the specific conditions whose truth or falsity changes the safety decision. For a click action: does the element exist? Does it match the current task? Is the source trustworthy? Typed certificates from constrained verifiers - DOM inspection, OCR, and accessibility-tree analysis - must support every predicate before the action gate permits execution.

The red-team results are significant. Across 1,900 adversarial attacks spanning 19 categories including DOM spoofing, homoglyph rendering, and accessibility-tree phantom nodes, the gate-bypass rate dropped from 15% to 1.3% after four targeted hardening rounds. On the full 200-task end-to-end pipeline, unsafe-action rate landed at 0% (Wilson 95% CI upper bound: 2.67%). Naive agents showed 100% unsafe actions under the same test. Prompt-only defenses - telling the model to be careful - failed 49.6% of the time.

The limitation the authors flag is real: ECA requires structured verifiers to exist for the relevant information. Purely natural-language documents without structured representation fall outside this approach for now. But for web agents operating on HTML pages, this is a practical hardening step available today.

For more context on how AI hallucinations work and what drives them, we've covered the mechanics in depth previously.

Padlock and digital security protection concept ECA treats hallucinated visual facts as unauthorized inputs rather than accuracy errors - shifting the problem from model confidence to external verification. Source: pexels.com

Small Models Leak Half Your Private Data

POLAR-Bench (Qiaoyuan Zheng, Yiqu Yang, Qi Gao, Imanol Schlag) tests a specific and increasingly relevant scenario: an LLM agent holds sensitive user data under an explicit privacy policy, while a third-party system attempts to extract it through adversarial conversation.

The benchmark covers 7,852 samples across 10 domains - medical records, recruitment data, finance, legal, insurance, housing, travel, cybersecurity, and customer support. Each domain is tested through a 5x5 matrix: five levels of policy complexity (from explicit field-level rules up to conflicting privacy objectives) crossed with five attack strategies (from direct requests up to multi-turn progressive elicitation). More than 25 models were evaluated, ranging from 3B to 1.1T parameters.

The core finding splits cleanly along a size boundary. Frontier models - GLM-5.1, GPT-5.4, Gemma-4-31B, DeepSeek-V3.1 - withhold over 99% of protected attributes while maintaining high task utility. The 1-30B open-weight class, the models most commonly run on-device or in private inference setups, shows sizable weakness. Some models in this size range leak more than half of protected data under adversarial probing.

Model size alone doesn't explain the gap. Within families, privacy performance doesn't scale monotonically with parameter count. Smaller, well-aligned models sometimes beat larger ones from different families. The authors' conclusion is that alignment choices - RLHF reward design, instruction tuning data, policy-following training - matter more than raw parameter count.

Attack strategy is also a significant variable. Direct requests are easiest to defend. Multi-turn progressive elicitation and yes/no narrowing attacks are the most effective at extracting protected information. A model that holds its ground under single-turn testing may fail badly under the sustained conversational patterns POLAR-Bench uses.

Models that look safe under single-turn tests can fail under sustained multi-turn probing - the evaluation method shapes what vulnerabilities you find.

For teams running agents locally to handle medical or legal data - exactly the use case where on-device models are most attractive - the practical takeaway is uncomfortable. The model on the hardware may not be keeping the secrets the privacy policy claims it keeps. Testing needs to include adversarial probing, not just standard accuracy evaluation.

Our earlier coverage of agent safety and capability measurement shows this isn't an isolated concern - the pattern of agents failing in non-obvious ways under real-world conditions appears repeatedly across the literature.

A Shared Pattern

These three papers cover distinct problems. But they share a common structural failure: the point of failure is exactly where agent outputs are hardest to audit.

The research agent fabricates experiments and the manuscript looks fine. The multimodal agent authorizes actions on hallucinated facts and passes safety filters. The privacy-handling agent leaks data through conversation dynamics that single-turn tests miss entirely.

Each paper proposes a technical response - artifact-aware review, certificate gates, adversarial benchmarking. The underlying principle is the same across all three: confidence from the model isn't enough. Trust needs to come from external evidence, verifiable artifacts, and test conditions that actually stress the failure mode. For anyone building agents for real-world deployment, that's the design principle worth internalizing from this week's research.

Sources: