News

Anthropic Drops Its Flagship Safety Pledge - The Hard Stop Is Gone

Anthropic's updated Responsible Scaling Policy removes the binding commitment to pause AI development if safety measures fall behind. The company will now only delay training if it simultaneously leads the AI race and judges catastrophic risk to be significant.

Anthropic Drops Its Flagship Safety Pledge - The Hard Stop Is Gone

Anthropic, the company that built its identity on being the safety-first AI lab, has quietly dismantled the central mechanism that made that claim credible. On February 24, 2026, the company published Version 3.0 of its Responsible Scaling Policy - replacing a categorical commitment to halt development if safety measures could not keep pace with a conditional promise to merely "delay" training, and only under circumstances that critics say are nearly impossible to trigger.

The timing is not subtle. The same week Anthropic dropped its safety pledge, Defense Secretary Pete Hegseth gave CEO Dario Amodei until Friday to grant the Pentagon unrestricted access to Claude or face blacklisting under the Defense Production Act. The company just closed a $30 billion funding round at a $380 billion valuation. And two weeks ago, the head of Anthropic's Safeguards Research team resigned, warning that "the world is in peril."

TL;DR

  • Anthropic published RSP Version 3.0, removing its binding commitment to pause AI development if safety measures were inadequate
  • The new policy only promises to "delay" training if Anthropic simultaneously (a) leads the AI race and (b) judges catastrophic risk to be significant - a dual condition critics say is effectively untriggerable
  • Quantitative safety thresholds for ASL levels have been replaced with vague qualitative descriptions
  • SaferAI downgraded Anthropic's safety score from 2.2 to 1.9, placing it in the "weak" category alongside OpenAI and Google DeepMind
  • The change arrives the same week as a Pentagon standoff over military access to Claude and days after a $30 billion funding round

What the Original Pledge Actually Said

In September 2023, Anthropic published its Responsible Scaling Policy - a framework modeled after the US government's biosafety level standards. The flagship commitment was unambiguous: Anthropic would never train an AI system unless it could guarantee in advance that its safety measures were adequate. If safety science fell behind capability advances, development would stop. Not slow down. Stop.

The policy defined AI Safety Levels (ASLs) with progressively stricter requirements. ASL-2 covered current models. ASL-3 - activated in May 2025 for the Claude Opus 4 launch - required enhanced security against model weight theft and deployment safeguards against catastrophic misuse. The thresholds for triggering each level were quantitative - specific benchmarks with defined success rates.

This was the structural mechanism that separated Anthropic's safety claims from the voluntary commitments made by OpenAI and Google DeepMind. It was not a promise to try hard. It was a promise to stop.

What Replaced It

RSP Version 3.0 removes the hard stop. In its place, Anthropic introduces a conditional "delay" promise with two requirements that must be met simultaneously: leadership must consider Anthropic to be the leader of the AI race, and leadership must judge the risks of catastrophe to be significant.

Both conditions must be true at once. If Anthropic's leadership decides that OpenAI or Google is ahead - a subjective judgment made by the same executives under $380 billion in valuation pressure - the delay trigger does not activate, regardless of how dangerous the technology might be.

The quantitative thresholds for ASL escalation are also gone. Where ASL-3 was previously defined with specific numeric benchmarks, the new policy describes capabilities in vaguer language such as "ability to either fully automate...or cause dramatic acceleration" without concrete metrics. This shift from numbers to words makes external accountability substantially harder.

The new RSP is structured into two tracks: "unilateral commitments" that Anthropic will pursue regardless of what competitors do, and "industry recommendations" that it believes the entire AI sector should adopt. The binding internal safety mechanism has been replaced by transparency mechanisms - public Frontier Safety Roadmaps and Risk Reports published every 3-6 months with third-party review.

Anthropic's Chief Science Officer Jared Kaplan told TIME: "We felt that it wouldn't actually help anyone for us to stop training AI models."

The Reasoning, and Its Problems

Anthropic cited three structural challenges. First, a "zone of ambiguity" - model capabilities "clearly approached" but had not definitively "passed" the original RSP thresholds, creating internal uncertainty about when to act. As Anthropic put it, "what they previously imagined might look like a bright red line came into focus as a fuzzy gradient."

Second, the political climate. Government action on AI safety has stalled, particularly under the Trump Administration's pivot toward AI competitiveness. Anthropic acknowledged that "safety-oriented discussions have yet to gain meaningful traction at the federal level."

Third, the impossibility of unilateral action at higher safety levels. Anthropic argued that ASL-4 and ASL-5 requirements "might prove outright impossible to implement without collective action" and require "assistance from the national security community."

The underlying logic, stated explicitly in the RSP v3 document: "If one AI developer paused development to implement safety measures while others moved forward training and deploying AI systems without strong mitigations, that could result in a world that is less safe."

This is a textbook race-to-the-bottom argument - and it is the exact argument that Anthropic's original RSP was designed to reject. The entire premise of the 2023 commitment was that at least one frontier lab should be willing to stop, even if competitors would not. That premise is now gone.

The Reactions

SaferAI, an independent safety evaluation organization, downgraded Anthropic's score from 2.2 to 1.9, placing it in the "weak" category alongside OpenAI and Google DeepMind. Their assessment was blunt: "We were expecting an improvement. Unfortunately, the results are disconcerting. By allowing more leeway to decide if a model meets thresholds, Anthropic risks prioritizing scaling over safety, especially as competitive pressures intensify."

Chris Painter, Director of Policy at METR - the AI evaluation nonprofit that reviewed an early draft of RSP v3 - called it "a bearish signal" for managing catastrophe risk. He said the update shows Anthropic "believes it needs to shift into triage mode with its safety plans, because methods to assess and mitigate risk are not keeping up with the pace of capabilities."

Holden Karnofsky, an Anthropic board member, defended the changes in a public post, acknowledging the anticipated backlash over "the move away from a 'hard commitments'/'binding ourselves to the mast' vibe" but arguing that the new framework enables more realistic safety targets.

The most pointed critique came from inside the company. Mrinank Sharma, who led Anthropic's Safeguards Research team, resigned on February 9 - two weeks before the RSP v3 announcement. In his resignation letter, he wrote: "The world is in peril. And not just from AI, or bioweapons, but from a whole series of interconnected crises unfolding in this very moment." He described employees who "constantly face pressures to set aside what matters most" and said he had "repeatedly seen how hard it is to truly let our values govern our actions."

The Context You Cannot Ignore

Three things happened in February 2026 that make the RSP v3 timing impossible to read in isolation.

The money. Anthropic closed a $30 billion Series G on February 12 at a $380 billion post-money valuation - the second-largest private tech financing round ever. Annualized revenue has reached $14 billion. Dario Amodei, in a podcast interview on February 17, admitted the pressure openly: "We're under an incredible amount of commercial pressure and make it even harder for ourselves because we have all this safety stuff we do." He added that if Anthropic "sits on the sidelines, we're just going to lose and stop existing as a company."

The military. On the same day RSP v3 was published, the Pentagon escalated a standoff over Anthropic's $200 million Department of Defense contract. Anthropic wants restrictions on mass surveillance and autonomous weapons applications. Defense Secretary Hegseth gave Amodei until the end of the week to grant unrestricted military access to Claude, threatening to invoke the Defense Production Act and designate Anthropic a "supply chain risk" - effectively blacklisting it from all government contracts. NPR reported Hegseth's concerns centered partly on what he called "woke AI" restrictions.

The departures. Sharma's resignation from the Safeguards Research team was not an isolated event. It followed a pattern of safety-oriented researchers leaving or being sidelined at frontier labs - a dynamic that has accelerated across the industry as commercial pressure intensifies.

What This Actually Means

The practical implication is straightforward: Claude model releases are unlikely to be delayed by safety concerns alone. The hard gate is gone. The decision to slow down is now at the discretion of leadership, conditional on competitive position, and framed as something that would only happen if Anthropic were clearly ahead of all rivals - a subjective judgment that the company's own commercial incentives make nearly impossible to reach.

Anthropic's RSP v3 does introduce some genuine improvements in transparency. The commitment to publish Risk Reports with third-party review, and to grade its own Frontier Safety Roadmap publicly, creates accountability mechanisms that did not exist before. These are not nothing. But they are disclosures, not constraints. They tell you what happened after the fact. The original RSP told you what would not happen in the first place.

The question the original RSP answered was: what happens when safety science cannot keep up with capability science? The answer was: you stop. The question RSP v3 answers is different: what happens when stopping is commercially inconvenient? The answer, now, is: you keep going and publish a report about it.

Anthropic is still, by most measures, the frontier lab that takes safety most seriously. That is a statement about the industry, not a compliment.


Sources:

Anthropic Drops Its Flagship Safety Pledge - The Hard Stop Is Gone
About the author AI Industry & Policy Reporter

Daniel is a tech reporter who covers the business side of artificial intelligence - funding rounds, corporate strategy, regulatory battles, and the power dynamics between the labs racing to build frontier models.