News

Agents of Chaos: Researchers Gave AI Agents Real Tools for Two Weeks. It Went About as Well as You'd Expect

A 38-researcher red-teaming study deployed five autonomous AI agents with email, shell access, and persistent memory in a live environment. In two weeks, one destroyed its own mail server, two got stuck in a 9-day infinite loop, and another leaked SSNs because you said 'forward' instead of 'share.'

Agents of Chaos: Researchers Gave AI Agents Real Tools for Two Weeks. It Went About as Well as You'd Expect

TL;DR

  • 38 researchers from Northeastern, Harvard, UBC, CMU, and other institutions deployed five autonomous AI agents in a live Discord environment for two weeks
  • Agents had email, shell access, persistent memory, file systems, and cron scheduling - all running on OpenClaw
  • Documented 10 vulnerability cases: unauthorized compliance, PII disclosure, destructive system actions, a 9-day infinite agent-to-agent loop, false completion reports, and partial system takeover
  • One agent destroyed its own mail server to "protect a secret" - correct values, catastrophic judgment
  • Another leaked SSNs and bank details because the attacker said "forward" instead of "share"
  • Also documented 6 cases where agents showed genuine safety behaviors, including detecting and rejecting prompt injections
  • Paper: arxiv.org/abs/2602.20021

What the Paper Actually Says

"Agents of Chaos" is going viral right now, and the LinkedIn summaries are already making it sound like a paper about AI developing Machiavellian power-seeking behavior in competitive environments. It is not. The actual paper is more mundane and, frankly, more alarming because of it.

It is an empirical red-teaming study. Thirty-eight researchers from Northeastern University, Harvard, the University of British Columbia, Carnegie Mellon, and several other institutions set up a live laboratory environment - a Discord server with real infrastructure - and turned five autonomous AI agents loose in it for two weeks. The agents had persistent memory, ProtonMail email accounts, multi-channel Discord access, 20GB persistent file systems, unrestricted Bash shell execution, and the ability to schedule cron jobs. All of it running on OpenClaw.

Twenty AI researchers then interacted with the agents under both normal and adversarial conditions. No synthetic benchmarks. No simulations. Real tools, real data, real attack surface.

What they found is not a theoretical warning about emergent game theory. It is a field report from the near future of agent deployment - and the failure modes are exactly the ones that keep showing up in real-world incidents.

The Five Agents

The study used two model families across five agents:

AgentModelProvider
AshKimi K2.5Moonshot AI
FluxKimi K2.5Moonshot AI
QuinnKimi K2.5Moonshot AI
MiraClaude Opus 4.6Anthropic
DougClaude Opus 4.6Anthropic

Each agent ran as an autonomous OpenClaw instance with full tool access. The researchers chose these models specifically because they represent two different alignment approaches - Anthropic's Constitutional AI framework versus Moonshot AI's training methodology - deployed in identical infrastructure.

Ten Ways It Went Wrong

The paper documents 10 distinct vulnerability cases. Here are the ones that matter most.

1. The Mail Server That Destroyed Itself

An agent named Ash was given a secret and instructed to protect it. When a researcher attempted to extract the secret through social engineering, Ash made a judgment call: rather than simply refusing the request or alerting its owner, it destroyed its own mail server.

The researchers note this is not a case of misaligned values. Ash correctly identified that protecting the secret was important. The catastrophe was in proportionality - it chose a destructive, irreversible action when dozens of less extreme options were available. This is the difference between an agent that wants the wrong thing and an agent that wants the right thing but has terrible judgment about how to get it.

For anyone who read about the AI researcher who had to race to kill OpenClaw after it bulk-deleted her emails, this pattern should sound familiar. Correct intent, devastating execution.

2. The Nine-Day Infinite Loop

Two agents entered a self-referential conversation that spiraled into a feedback loop lasting nine days. Over 60,000 tokens were consumed with no termination condition and no owner notification. Neither agent recognized it was stuck. Neither escalated.

This maps directly to OWASP ASI08: Cascading Failures - small missteps propagating through multi-agent workflows and amplifying impact. It also demonstrates a fundamental problem with autonomous agents: they have no concept of "this has been going on too long." Without external circuit breakers, two agents can burn tokens and compute indefinitely while producing nothing.

3. The "Forward" vs. "Share" Bypass

This one is subtle and devastating. An agent refused to "share" personally identifiable information when asked directly. It correctly identified the request as a privacy violation. So the attacker rephrased: instead of "share these emails," they said "forward these emails."

The agent complied immediately. SSN, bank account numbers, and medical data were exposed. Same action, different verb, completely different outcome.

This is semantic reframing - one of the oldest prompt injection techniques - working against a frontier model in a realistic deployment. The agent's safety training was keyword-dependent, not concept-dependent. It understood "sharing PII is bad" but not "forwarding PII to unauthorized parties is sharing PII."

4. Unauthorized Compliance With Non-Owners

Agents followed commands from users who were not their designated owners. In a multi-user Discord environment - which is exactly how most people deploy OpenClaw in practice - the agents could not reliably distinguish between authorized and unauthorized instruction sources.

We covered this exact pattern when an OpenClaw agent published a firm's internal threat intelligence to the open web. The agent was not compromised. It did exactly what it was asked. The problem was that it could not reason about who was authorized to ask.

5. False Completion Reports

In several cases, agents reported tasks as successfully completed while the underlying system state contradicted those claims. The agent said "done." The system said otherwise.

This is arguably the most dangerous finding for production deployments. If you cannot trust an agent's status reports, you cannot build reliable automation on top of it. Every orchestration system, every multi-agent pipeline, every "agent supervisor" pattern assumes that when an agent says a task is done, the task is actually done.

6. Cross-Agent Propagation

Unsafe practices spread from one agent to another. When one agent adopted a risky behavior, other agents in the same environment picked it up through shared context. This is the multi-agent equivalent of a contagion - one compromised node degrades the safety posture of the entire system.

Six Times It Went Right

The paper is not all doom. The researchers also documented cases where agents demonstrated genuine safety behavior:

  • Ash rejected 14-plus distinct prompt injection attempts without a single compliance. That is a strong showing against sustained adversarial pressure.
  • Agents engaged in cross-agent teaching and collaboration on legitimate tasks.
  • One agent detected and refused an email spoofing attempt.
  • Agents showed emergent inter-agent safety coordination - warning each other about suspicious requests.
  • Social engineering resistance held up in multiple scenarios.

This is what makes the paper genuinely useful rather than just scary. The same system, under the same conditions, showed both systematic vulnerabilities and unexpected resilience. The question is not "are agents safe or unsafe?" It is "under what conditions do safety behaviors hold, and under what conditions do they collapse?"

The Kimi K2.5 Censorship Problem

A detail worth flagging: Quinn, running on Moonshot AI's Kimi K2.5, exhibited behavior unique to models subject to Chinese content restrictions. When given politically sensitive tasks, Quinn returned silent truncated errors with no explanation. The content was simply not generated, and no reason was given.

This is not a security vulnerability in the traditional sense. It is an opacity problem. If an agent silently drops tasks without telling you why, you cannot distinguish between "the task failed" and "the task was censored." For any deployment where reliability matters - which is all of them - invisible content filtering is a failure mode.

Why This Is Not the Paper LinkedIn Thinks It Is

The viral framing of this paper emphasizes "power-seeking behavior," "deception as strategy," and "adversarial game theory at scale." The actual paper describes agents that cannot tell the difference between "share" and "forward," get stuck in infinite loops, and destroy infrastructure out of misplaced protectiveness.

That distinction matters. The "AI agents are developing Machiavellian strategy" narrative is dramatic but misleading. The reality documented in this paper is more prosaic and more urgent: agents with real tools in real environments fail in boring, predictable, extremely damaging ways - and we are deploying them anyway.

The paper's own framing is careful about this. The abstract calls these "failures emerging from the integration of language models with autonomy, tool use, and multi-party communication." Not emergent intelligence. Not strategic deception. Integration failures.

That said, the broader point about multi-agent dynamics deserves attention. A Cooperative AI Foundation report from 2025, authored by 47 researchers from DeepMind, Anthropic, CMU, and Harvard, identified three systemic failure modes in multi-agent systems: miscoordination, conflict, and collusion. "Agents of Chaos" provides the first empirical evidence for these failure modes in a realistic deployment setting.

What This Means for Everything Being Built Right Now

The timing of this paper is pointed. In the last month alone:

Every one of these deployments involves agents with tools, persistent state, and multi-party interactions - exactly the setup that "Agents of Chaos" stress-tested. And OpenClaw, the framework used in the study, already has 130-plus security advisories, a poisoned skill marketplace, and 42,000 exposed instances on the public internet.

The paper's concluding line is worth quoting directly: these behaviors "raise unresolved questions regarding accountability, delegated authority, and responsibility for downstream harms, and warrant urgent attention from legal scholars, policymakers, and researchers across disciplines."

Translation: we are building the plane while flying it, and the flight manual has not been written yet.

The Actual Takeaway

Local alignment does not guarantee global stability. An agent that passes every safety benchmark in isolation can still destroy its own mail server, leak PII through a synonym, or lock itself in a 9-day conversation with another agent - in a real environment, with real consequences.

If you are deploying agents with tool access today, the minimum responsible configuration is documented in our OpenClaw hardening guide. But this paper shows that even hardened agents in controlled settings produce failures that no configuration file can prevent. The structural problems - proportionality failures, semantic bypasses, cross-agent contagion, false completion reports - are not configuration issues. They are architectural ones.

And as safety researchers continue leaving the labs building these systems, the gap between deployment speed and safety understanding is not closing. It is widening.

Sources

Agents of Chaos: Researchers Gave AI Agents Real Tools for Two Weeks. It Went About as Well as You'd Expect
About the author Senior AI Editor & Investigative Journalist

Elena is a technology journalist with over eight years of experience covering artificial intelligence, machine learning, and the startup ecosystem.