News

Anthropic's New Study Reveals AI Agents Now Run 45 Minutes Without Human Input

Anthropic analyzed millions of Claude Code sessions and found AI agents are working autonomously for nearly twice as long as they did four months ago, with experienced users granting more trust over time.

Anthropic's New Study Reveals AI Agents Now Run 45 Minutes Without Human Input

Anthropic has published what may be the most comprehensive real-world study of AI agent autonomy to date. The research, released on February 18, analyzed millions of interactions from Claude Code and the company's public API to answer a deceptively simple question: how much freedom are people actually giving AI agents?

The headline finding is striking. The longest-running Claude Code sessions nearly doubled in duration - from under 25 minutes in October 2025 to over 45 minutes by January 2026. Humans are stepping back, and the agents are filling the gap.

The trust curve is real

The study, authored by Miles McCain, Thomas Millar, Saffron Huang, and over a dozen other Anthropic researchers, reveals a clear pattern: the more people use AI agents, the more autonomy they grant them.

Among new Claude Code users, roughly 20% of sessions use full auto-approve mode, which lets the agent execute actions without asking permission. By the time users hit 750 sessions, that number exceeds 40%. The shift is gradual, not sudden, suggesting that trust builds through accumulated positive experience rather than any single capability breakthrough.

But here is where the data gets interesting. Experienced users do not just approve more - they also interrupt more. New users intervene in about 5% of turns, while power users step in during roughly 9% of turns. Anthropic describes this as a "monitor-and-intervene" pattern: let the agent run freely, but watch closely and course-correct when something goes wrong.

It is a dynamic that anyone who has used Claude Code will recognize. You start cautious, approving every file edit and terminal command. After a few hundred sessions, you flip on auto-approve and keep one eye on the output stream instead.

The agent knows when it is unsure

Perhaps the study's most consequential finding concerns the agent itself. Claude pauses for clarification more often than humans interrupt it, exceeding human interruptions by more than 2x on complex tasks.

The breakdown of self-initiated pauses tells a story about model calibration: 35% of pauses are to present approach choices to the user, 21% to gather diagnostic information, and 13% to clarify ambiguous requests. In other words, the model is not just executing blindly - it recognizes when it lacks enough context to proceed safely.

This matters for the broader AI safety and alignment debate. If agents can reliably identify their own uncertainty boundaries, the case for rigid step-by-step human approval weakens. The Anthropic researchers make this argument explicitly: oversight requirements that mandate approval for every action "will create friction without necessarily producing safety benefits."

Software eats the agent world (for now)

The study's domain analysis shows that AI agents remain heavily concentrated in software engineering. Nearly 50% of all agentic tool calls on Anthropic's public API involve coding tasks. That concentration makes practical sense - code is testable, version-controlled, and relatively low-stakes if something breaks. You can always git revert.

But the researchers also flagged emerging experimental usage in healthcare, finance, and cybersecurity. These domains carry fundamentally different risk profiles. A coding agent that writes a buggy function wastes minutes. A healthcare agent that retrieves wrong patient records could cause real harm.

From a pure numbers perspective, the risk picture looks manageable today. Anthropic found that 80% of tool calls involve at least one safeguard mechanism such as permission restrictions or human approval. Seventy-three percent appear to have human involvement in some form. And only 0.8% of all observed actions appeared irreversible - things like sending customer emails that cannot be unsent.

What this means for regulation

The timing of this research is not accidental. As policymakers in the EU, US, and elsewhere draft rules for agentic AI systems, the question of how much human oversight to mandate is central. Anthropic's data provides ammunition for a specific position: focus on whether humans can effectively monitor and intervene, rather than requiring particular forms of involvement.

The researchers argue that human interventions per session actually decreased from 5.4 to 3.3 over the study period - not because users were careless, but because both the agent and the users got better at their respective roles. Prescriptive regulations that freeze current interaction patterns into law could prevent this natural optimization.

It is a nuanced argument, and one that conveniently serves Anthropic's commercial interests. A company selling AI agent frameworks and coding tools benefits from regulations that allow agents to operate with minimal friction. But the data itself is hard to dismiss - especially the finding that the agent's own uncertainty detection outperforms mandated checkpoints.

The autonomy gap

One number jumped out at me more than any other in the study: the median turn duration remains stable at roughly 45 seconds, even as the 99.9th percentile nearly doubled. Most agent interactions are still quick, discrete tasks. The 45-minute autonomous sessions are outliers, sitting at the extreme tail of the distribution.

This suggests we are in an early phase of a larger shift. The capability for extended autonomous operation exists, and some users are already exploiting it. But the average interaction still looks like a back-and-forth conversation, not an agent running free.

The study also revealed something the researchers themselves flagged as concerning: high-risk clusters that included implementing API exfiltration backdoors (risk score: 6.0 out of 10) and retrieving patient medical records (risk: 4.4). Many of these appear to be security evaluations rather than malicious deployments, but the boundary between testing and misuse is not always clear.

The bigger picture

Anthropic frames this research as a contribution to "the science of agent measurement" - a field they argue needs to move beyond idealized benchmarks toward real-world observation. Their core message: "autonomy is co-constructed by model, user, and product," and understanding it requires studying all three.

That framing is correct, as far as it goes. But it also highlights a fundamental limitation of the study. Anthropic is measuring autonomy on its own platform, with its own models, for its own users. The picture may look very different for open-source agents running locally, or for enterprise deployments with custom guardrails, or for the kind of adversarial usage that never shows up in well-behaved API logs.

The most important takeaway may be the simplest one. People are already trusting AI agents to work independently for significant stretches of time, and that trust is increasing. Whether the safety infrastructure can keep pace with the trust curve is the question that matters most - and one this study, for all its rigor, cannot fully answer.

Sources:

About the author Senior AI Editor & Investigative Journalist

Elena is a technology journalist with over eight years of experience covering artificial intelligence, machine learning, and the startup ecosystem.