Anthropic's Claude Code Post-Mortem: Three Bugs Fixed

On April 23, Anthropic published a technical post-mortem admitting that Claude Code had measurably degraded since early March - and that three engineering decisions made at the app and SDK layer were responsible. The core models (Sonnet 4.6 and Opus 4.6) were not changed. The API was not affected. What broke was the product wrapper around them.

TL;DR

Anthropic confirmed three distinct app-layer changes degraded Claude Code between March 4 and April 20
Reasoning effort was silently lowered from high to medium on March 4, making outputs shallower
A caching bug introduced March 26 cleared session context every turn instead of once after idle time
A verbosity-reduction prompt added April 16 caused a 3% intelligence drop in internal evaluations
All three were fully reverted by April 20 in v2.1.116; usage limits were reset for all subscribers on April 23

What They Showed

The post-mortem walks through three separate incidents in chronological order. Each one degraded Claude Code in a different way, and their overlap in late March and early April explains why the complaints spiked sharply across developer forums during that window.

Issue 1 - Reasoning effort downgrade (March 4 to April 7). Anthropic lowered Claude Code's default reasoning effort from high to medium to reduce latency. Opus 4.6 and Sonnet 4.6 were both affected. The change was quiet - no changelog entry, no opt-in path. Users noticed Claude felt slower to engage with complex tasks and quicker to propose the simplest available fix. Anthropic heard the feedback and reverted on April 7, bumping Opus 4.7 to xhigh and restoring high for all other models.

Issue 2 - Caching bug (March 26 to April 10, introduced in v2.1.101). An optimization to clear older thinking from sessions idle for more than an hour had a defect: instead of running once after inactivity, it cleared built up context on every single turn for the rest of a session. The result was a model that appeared forgetful and repetitive, unable to retain what it had already reasoned through. Anthropic's own testing missed the regression because the bug only manifested in a specific session state. The post-mortem notes that Opus 4.7 caught the bug during Code Review testing while Opus 4.6 didn't.

Issue 3 - Verbosity reduction prompt (April 16 to April 20). A new system prompt instruction told Claude to keep text between tool calls to 25 words or fewer. Anthropic measured a 3% drop in intelligence evaluations from this change alone. Combined with other active prompt changes at the time, the coding depth took a real hit. The instruction was reverted four days after shipping.

Developer examining code and performance charts on multiple monitors By the time the post-mortem published, all three issues had already been fixed in v2.1.116. Source: unsplash.com

What We Tried

The degradation wasn't invisible or anecdotal. Stella Laurenzo, Senior Director at AMD's AI group, filed a detailed GitHub issue on April 2 documenting 6,852 Claude Code sessions, 17,871 thinking blocks, and 234,760 tool calls across three months of team usage.

Her numbers were specific. During the "good" window (January 30 through February 12), Claude read an average of 6.6 files before editing any one of them. By the degraded period (March 8 through March 23), that ratio had collapsed to 2.0. The share of edits made to files Claude hadn't recently read jumped from 6.2% to 33.7%. Median visible thinking length dropped from 2,200 characters in January to 600 by March - a 73% reduction. API call retry counts climbed by up to 80x over the same period.

"Claude cannot be trusted to perform complex engineering tasks," Laurenzo wrote, noting that every senior engineer on her team had reported similar experiences. Her conclusion wasn't impressionistic; it was backed by session logs Anthropic's own tooling produced.

The compound effects explain why developers kept describing something that felt like model-level regression even though, technically, the underlying weights didn't change. Lower reasoning effort means shallower planning. Session context clearing every turn means Claude re-derives state it already computed. A verbosity cap means it skips explanatory steps that also happen to constrain it toward correct solutions. Three independent decisions, all landing in the same six-week window, stacked badly.

"Six months ago, Claude stood alone in terms of reasoning quality and execution. We have switched to another provider which is doing superior quality work, but Claude has been good to us." - Stella Laurenzo, AMD

The Gap

The gap between what was claimed and what developers experienced isn't about the models being secretly nerfed. Anthropic's post-mortem is credible on that point. The API layer wasn't touched. But the product layer around a model can meaningfully change how it behaves in practice, and those changes got deployed without adequate regression testing against real-world usage patterns.

Anthropic's own evaluation suite didn't catch the caching bug. The 3% intelligence drop from the verbosity prompt showed up in evals, but shipped anyway. The reasoning effort downgrade didn't trigger any automated alert significant enough to stop the rollout.

What the post-mortem gets right is the transparency: Anthropic published exact dates, exact version numbers, exact prompt text, and internal evaluation scores. The post-mortem also acknowledged that the margin on the verbosity prompt wasn't compelling enough to justify the risk, with hindsight.

Team reviewing performance metrics and historical data analysis Stella Laurenzo's analysis of 6,852 sessions gave Anthropic specific data points to correlate against their own internal timeline. Source: unsplash.com

Requirements to Reproduce the Prior State

For developers who want to verify the fix works for their workflows:

Factor	Status as of v2.1.116
Default reasoning effort (Opus 4.7)	`xhigh`
Default reasoning effort (Sonnet 4.6 / Opus 4.6)	`high`
Session context clearing	Once after 1+ hour idle (bug fixed)
Verbosity system prompt cap	Reverted
Usage limits	Reset for all subscribers as of April 23

The fix is in. Whether it holds depends on what Anthropic does next. The company says it'll require a larger share of internal staff to use the exact public builds of Claude Code from now on - not internal builds - so regressions hit Anthropic's own engineers before they hit subscribers. A new @ClaudeDevs account on X will be used to explain product decisions in more depth as they ship.

Verdict

Anthropic shipped three separate changes without enough real-world validation, didn't catch the combination effects before users did, and then published one of the more specific incident post-mortems this industry tends to produce. The post-mortem is genuinely useful: dates, version numbers, evaluation scores, exact prompt text. That's more than most AI labs offer when something breaks.

The structural fix - staff on public builds, new evals for every system prompt change, staged rollouts - addresses the process failures. The open question is whether those commitments hold when teams are under pressure to ship the next capability. The previous Claude Code quota investigation and the March login outage both pointed to similar patterns: product changes outpacing validation. Anthropic has named the process problem clearly this time. Following through on it's the harder part.

Sources: