75% of AI Coding Agents Break Working Code Over Time

TL;DR

Alibaba's SWE-CI benchmark tracks AI coding agents across 100 real Python codebases, each spanning 233 days and 71 consecutive commits
75% of models break previously working code during long-term maintenance
Only two Claude Opus models exceed a 50% zero-regression rate - every other model falls below 25%
New EvoScore metric penalizes agents that sacrifice code quality for quick wins by weighting later iterations more heavily
18 models from 8 providers tested, including Claude, GPT, DeepSeek, Qwen, MiniMax, Kimi, GLM-5, and Doubao

AI coding agents are great at passing tests. They're terrible at keeping code working six months later.

That's the core finding from SWE-CI, a new benchmark from Alibaba researchers that shifts evaluation from one-shot bug fixes to long-term codebase maintenance. The paper, published March 4 on arXiv, tested 18 models from 8 providers against 100 real-world Python repositories - and the results aren't flattering for most of them.

What SWE-CI Measures

Existing benchmarks like SWE-bench test whether a model can solve a single isolated issue: here's a bug, fix it, run the tests. SWE-CI asks a harder question: can your agent maintain a codebase through months of real evolution without breaking things?

Each of the 100 tasks corresponds to a real open-source Python repository with an average history of 233 days and 71 consecutive commits. The agent doesn't fix one bug and walk away. It resolves dozens of rounds of changes, each building on the last, mirroring how actual codebases evolve. If the agent's fix for commit 15 breaks something that worked at commit 10, that counts against it.

The curation was rigorous. Researchers started with 4,923 repositories, filtered through multiple stages - requiring 3+ years of active maintenance, 500+ GitHub stars, permissive licenses, and at least 500 modified lines - and ended with 100 final samples.

The Results

Most models collapse under sustained maintenance. Three-quarters of tested agents break previously working code during the evaluation period. The zero-regression rate - the fraction of iterations where an agent's changes don't break anything that previously passed - tells the story:

Claude Opus models: The only ones exceeding 50% zero-regression rate
GLM-5: Identified as a strong secondary performer
Everyone else: Below 25% zero-regression

That means for the majority of AI coding agents, more than three out of every four maintenance iterations introduce some form of regression. The code passes today's tests but breaks yesterday's functionality. Technical debt compounds with each round.

The gap between Claude Opus and the field is large enough to be qualitatively different. Where other models treat each commit as an isolated problem, Opus appears to maintain more coherent awareness of the codebase's existing behavior. The paper notes that models within the same provider tend to show consistent patterns, suggesting training methodology matters more than model size for this task.

EvoScore: Penalizing Short-Term Thinking

The researchers also introduce EvoScore, a metric designed to catch agents that game snapshot benchmarks by writing brittle code that passes immediate tests but becomes unmaintainable.

The formula: e = sum(gamma^i * a(c_i)) / sum(gamma^i)

When gamma equals 1, this is just a simple average. But when gamma exceeds 1, later iterations count more than earlier ones. An agent that performs well on commits 1-20 but degrades on commits 50-71 gets a lower EvoScore than one that maintains steady performance throughout.

This matters because the failure mode of current AI coding agents is predictable: they optimize for the test in front of them without considering downstream consequences. A fix that patches the immediate failure but introduces a subtle dependency conflict won't show up until three commits later. EvoScore is designed to make that pattern visible in a single number.

Why This Matters for Real Codebases

The disconnect between benchmark performance and real-world utility is something practitioners have felt for a while. A model scoring 70%+ on SWE-bench can still produce code that becomes a maintenance burden within weeks. SWE-CI gives that intuition a number.

The implications are practical. Teams using AI coding agents for ongoing development - not just one-off fixes - need to know whether the agent accumulates technical debt over time. A tool that resolves today's ticket but introduces regressions you'll discover next sprint isn't saving engineering time. It's borrowing from the future at a high interest rate.

The Python-only scope is a real limitation. The benchmark doesn't yet cover JavaScript, TypeScript, Go, Rust, or other languages where maintenance patterns differ significantly. The researchers acknowledge this, and the framework is designed to extend.

The Benchmark Gap

SWE-CI fills a specific hole in how the industry evaluates coding agents:

Benchmark	What it measures	Time horizon
HumanEval	Isolated function generation	Single shot
SWE-bench	Bug fix in real repo	Single issue
SWE-bench Verified	Bug fix, human-validated	Single issue
SWE-CI	Codebase maintenance	233 days, 71 commits

The first three benchmarks answer "can it write code that works right now?" SWE-CI answers "can it write code that still works after six months of changes?" Those are different skills, and the gap in model performance between the two suggests that current training approaches optimize heavily for the former at the expense of the latter.

For teams evaluating which AI coding tool to integrate into their development workflow, the SWE-CI results add a dimension that SWE-bench alone can't provide. Passing tests once is table stakes. Not breaking everything over time is the actual job.

Sources:

SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration (arXiv, March 2026)

What SWE-CI Measures

The Results

EvoScore: Penalizing Short-Term Thinking

Why This Matters for Real Codebases

The Benchmark Gap

Google Analytics