Anthropic Ships Multi-Agent Code Review for PRs

Anthropic just added a multi-agent code review system to Claude Code. When a developer opens a pull request on GitHub, Code Review dispatches a team of AI agents that scan the diff in parallel, cross-verify each other's findings, and post a single summary comment plus inline annotations ranked by severity. It's available now in research preview for Team and Enterprise customers.

TL;DR

Code Review sends parallel AI agents to scan every PR for logic bugs, not style nits
Large PRs (1,000+ lines) trigger findings 84% of the time, averaging 7.5 issues
Reviews cost $15-25 on token-based billing with org-level spending caps
Internal testing saw substantive review comments jump from 16% to 54% of PRs

How It Works Under the Hood

The Agent Pipeline

Code Review isn't a single model making a single pass. Multiple specialized agents analyze the diff and surrounding code simultaneously on Anthropic's infrastructure. Each agent hunts for a different class of issue - one might focus on type mismatches, another on concurrency bugs, a third on security vulnerabilities. A verification step then checks flagged candidates against actual code behavior to filter out false positives before anything gets posted.

The system scales effort based on PR complexity. A 20-line dependency bump gets a lightweight scan. A 2,000-line refactor gets the full treatment with additional agents and deeper analysis. Average review time is roughly 20 minutes for a typical PR.

What It Catches

Anthropic is explicitly positioning this as a bug finder, not a linter. The focus is logic errors, behavioral regressions, and latent issues in touched-but-unchanged code - the categories human reviewers are statistically most likely to miss when scanning diffs.

One example from Anthropic's testing: during a TrueNAS ZFS encryption refactor, Code Review flagged a pre-existing type mismatch that was silently wiping the encryption key cache on every sync operation. That bug lived in adjacent code that the PR touched but didn't modify. A human reviewer focused on the diff would likely never catch it.

The Numbers

Anthropic's internal testing produced some sharp results:

PR Size	% Receiving Findings	Avg Issues Found
Large (1,000+ lines)	84%	7.5
Small (under 50 lines)	31%	0.5

Before Code Review, 16% of PRs at Anthropic received substantive review comments. After enabling it, that number jumped to 54%. The false positive rate - findings marked incorrect by reviewers - sits under 1%.

That last number matters most. Developers abandon review tools that cry wolf. Anthropic's approach of using agent-to-agent verification before posting seems designed specifically to avoid the noise problem that killed earlier automated review tools.

Setup and Cost Controls

Getting Started

Admins enable Code Review through Claude Code settings, install the GitHub App, and select which repositories to activate it on. Once enabled, reviews run automatically on new PRs. Developers don't need to change their workflow - findings show up as standard GitHub review comments.

Pricing

Reviews are billed on token usage. Anthropic quotes $15-25 per review as the typical range, scaling with PR size and codebase complexity. For teams processing hundreds of PRs per week, that adds up. Anthropic provides three cost controls:

Monthly organization spending caps
Repository-level activation (skip it on low-risk repos)
An analytics dashboard tracking reviews, acceptance rates, and total spend

Where It Falls Short

The $15-25 per review price will give smaller teams pause. A team merging 20 PRs per day is looking at $300-500 daily, or $6,000-10,000 monthly - on top of existing Claude Code subscriptions. That's a hard sell for startups, even if the bug catch rate is genuine.

There's also the dependency risk. Teams that lean heavily on AI review may gradually lose the habit of careful human review. Anthropic addresses this by framing Code Review as augmentation, not replacement, but the incentive structure pushes the other direction. If the AI catches 84% of bugs in large PRs, why would a human reviewer invest the same effort?

The 20-minute review time also won't work for every workflow. Teams practicing continuous deployment with rapid merge cycles may find the latency disruptive, though small PRs should clear faster.

Finally, this is a research preview. Pricing, availability, and capabilities may shift. Teams building it into their workflow should plan for that.

Who Should Care

Code Review makes the most sense for large engineering organizations already using Claude Code at scale - Anthropic names Uber, Salesforce, and Accenture as target customers. If your team is generating more code with AI assistance than your human reviewers can keep up with, paying $15-25 per PR to have agents verify the output is a reasonable trade.

For smaller teams or open-source projects, the math doesn't work yet. But the architecture - parallel agents with cross-verification to suppress false positives - is the interesting part. It's the first commercial deployment of multi-agent review where the agents are checking each other's work, not just running independent passes. Whether that pattern holds up outside Anthropic's own codebase is the real test.

Code Review is available now for Claude Code Team and Enterprise plans. Details at claude.com/blog/code-review.

Sources: