METR: Half of SWE-Bench Passes Fail Real Code Review

SWE-bench Verified has become the standard scoreboard for AI coding agents. When a new model ships, the first question most engineers ask is whether it makes a difference on SWE-bench. OpenAI publishes it. Anthropic publishes it. Leaderboards aggregate it. Investors quote it in pitch decks.

A new study from METR is now asking whether any of those numbers reflect what actually happens when an AI agent submits code to a real repository. The answer is a clear no - and the gap is larger than most practitioners assumed.

TL;DR

Maintainers rejected roughly half of AI-generated PRs that passed SWE-bench automated grading
The automated grader overstates merge rates by an average 24 percentage points across frontier models
Claude 4.5's effective "time horizon" drops from 50 minutes (benchmark) to 8 minutes (maintainer review) - a 6x overestimation
Primary rejection reasons: code quality, breaking unrelated code, and core functionality failures
GPT-5 showed notably weaker code quality than Anthropic models in the study
OpenAI is already moving to SWE-bench Pro as its recommended frontier eval

The Gap Between Grader and Maintainer

Metric	Automated Grader	Maintainer Decision
Average pass/merge rate (frontier models)	~72%	~48%
Time horizon, Claude 4.5	~50 minutes	~8 minutes
Annual improvement rate	baseline	9.6 pp/yr slower
Golden patch (human-written) merge rate	n/a	68%

METR chart comparing SWE-bench automated grader pass rates vs maintainer merge rates over time Pass rates over time: the orange line (automated grader) consistently runs 24 percentage points above the blue line (actual maintainer decisions). Source: metr.org

The study, published March 10 by Parker Whitfill, Cheryl Wu, Joel Becker, and Nate Rush, recruited four active maintainers from three SWE-bench Verified repositories: two from scikit-learn, one from Sphinx, one from pytest. The maintainers reviewed 296 AI-generated pull requests that had already passed the automated grader - working in historical repository states via GitHub, blinded to the source of each patch.

"Those forecasting AI progress and its real-world impact should view benchmarks as one piece of evidence, rather than as decisive."
METR researchers, March 10 study

The 24 percentage point gap isn't a rounding error. On a benchmark that frontier models are now passing at 70-plus percent, a 24-point reduction in actual merge rates puts real-world performance at roughly the level the benchmark reported two years ago.

What They Measured

The Merge Rate Methodology

Each maintainer was paid hourly with quality bonuses and asked to give a binary accept/reject decision plus a structured reason for rejection. They weren't told whether a patch came from a human or a model. The baseline calibration ran against 47 golden patches - human-written PRs that were actually merged into main before the study. Maintainers accepted those at a 68% rate, which gives the experiment a real-world anchor.

The models tested span the recent generation of frontier coding agents: Claude 3.5 Sonnet (Old), Claude 3.7 Sonnet, Claude 4 Opus, Claude 4.5 Sonnet, and GPT-5. All had high automated pass rates before entering maintainer review.

Why Maintainers Rejected

The structured feedback identified three main buckets:

Code quality issues - the patch didn't follow repository conventions, style, or idioms. The automated grader checks whether tests pass, not whether the code looks like it belongs in the codebase.
Breaking other code - the patch fixed the target issue but caused failures elsewhere. SWE-bench only grades against the specific test suite for each issue.
Core functionality failure - the patch appeared to pass grading but didn't actually solve the problem. The tests passed; the behavior didn't.

Code quality and collateral breakage were the bigger problems. On the GPT-5 results specifically, the authors note that it showed substantially weaker code quality than the Anthropic models tested.

Stacked bar chart showing rejection reasons by model across Claude and GPT-5 Rejection breakdown by model: code quality (dark) and breaking other code (mid) account for most failures across the board. GPT-5 shows a notably larger code quality failure share. Source: metr.org

The Time Horizon Metric

METR uses a "time horizon" framing to measure effective agent capability - the duration of task a model can complete reliably. Against the automated grader, Claude 4.5 shows a ~50-minute time horizon. Under maintainer review, that collapses to ~8 minutes. The 6x overestimation means the benchmark is describing a meaningfully different (and more forgiving) environment than production code review.

A separate analysis by Entropic Thoughts applied Brier scoring to three models of AI coding progress using METR's data. The two models predicting constant or flat merge rates fit the data better than a linear growth trend - suggesting that despite benchmark scores rising through 2025, actual maintainer-accepted code quality may not have improved meaningfully.

What They Didn't

Coverage Is Narrow

The study covers 3 of 12 SWE-bench Verified repositories, across 95 of 500 issues. That's 19% of issues and 25% of repos. Scikit-learn, Sphinx, and pytest are all mature, style-conscious Python projects with established contribution conventions - which may make them harder on code quality than the full SWE-bench distribution. Different repos might show different gaps.

Single-Attempt Submissions

Screenshot of a maintainer code review on an AI-generated PR flagging code quality issues A real maintainer review on an AI-generated PR from the study - the patch passed the automated grader but was rejected for code quality concerns. Source: metr.org

AI agents in this study submitted a patch once. Human developers iterate - they respond to reviewer comments, fix CI failures, refactor based on feedback. The comparison isn't fully symmetric. Joel Becker acknowledged this directly: an agent given reviewer feedback and the ability to resubmit might close a meaningful portion of the gap. That's also an argument for building agentic code review loops into evaluation methodology.

No CI Was Run

The maintainer review process didn't include running CI or verifying the full test suite. Some rejections may have flagged patches that would actually pass full automated testing. The authors note this as a limitation.

For context on what more realistic CI-integrated evals look like, Alibaba's SWE-CI benchmark tested 18 models across 233 days of maintenance and found that most agents build up technical debt over time - a related finding about the gap between pass-once and maintain-over-time performance.

Should You Care?

If you're building a product that relies on AI-generated code being merged into production repositories, yes. SWE-bench scores are a reasonable signal for comparing models against each other on a consistent task, but they don't tell you what fraction of patches will make it through human code review.

The practical implication isn't to stop tracking SWE-bench. It's to add a human-review layer to any evaluation pipeline that matters - or at minimum, to treat a model's SWE-bench score as the ceiling of its coding usefulness, not the floor.

OpenAI seems to be reading the same signal. The company recently announced it's moving from SWE-bench Verified to SWE-bench Pro as its recommended frontier coding eval - a harder benchmark with more realistic tasks. Whether Pro fares better on maintainer review than Verified isn't yet known.

This isn't an isolated finding. Gemini 3.1 Pro leads 13 benchmarks but runs into serious reliability problems in production, and SkillsBench found that small models with expert guides outperform frontier models on real-world tasks. The pattern is consistent: automated evals measure what automated evals measure, and real usage is something else.

For teams that have already started deploying AI coding agents on pull requests, Anthropic's multi-agent code review dispatches parallel agents to filter and rank patches before human review - which may be closer to how these tools need to operate to hit acceptable merge rates in practice.

METR's study covers a narrow slice of the full SWE-bench distribution. The 24-point gap may shrink across different repository types or widen further in codebases with stricter standards. What it removes is the ability to treat a high SWE-bench score as equivalent to high production code quality. Those are different things, and now there's data to show how different.

Sources: