LLM Code Review Leaderboard - Benchmarks and Rankings

Rankings of the best LLMs and AI agents at automated code review - spotting bugs in diffs, commenting on PRs, and surfacing non-obvious issues across CodeReviewer, CR-Bench, and real-world evaluations.

LLM Code Review Leaderboard - Benchmarks and Rankings

Most LLM code review tools are slop generators. They flag == null instead of === null in JavaScript, paste the relevant style guide section back at you, and call it a review. Any linter with a five-minute setup does the same thing, and the linter doesn't hallucinate rationale or suggest rewrites that subtly change semantics.

What separates a useful code review LLM from a noisy one is whether it surfaces issues that require reasoning about the diff in context: logic errors that only appear when you trace the execution path, race conditions that need understanding of the concurrency model, API contracts that are technically satisfied but will cause downstream failures. That gap is wide. The tools at the top of this leaderboard clear it with some regularity. The majority do not.

This leaderboard covers both standalone LLMs evaluated on academic code review benchmarks and production code review products. For standalone models, scores come from published evaluations on CodeReviewer/CodeReview-Eval (Microsoft Research, 2022) and CR-Bench (2024). For products, I combine published benchmark claims, independent evaluations, and qualitative assessments from published third-party studies.

TL;DR

  • Claude Opus 4.6 and GPT-5 lead standalone model evaluations, but the gap between them narrows substantially on complex multi-file diffs
  • CodeRabbit currently shows the best real-world recall on bug-class comments among integrated products in third-party testing
  • Amazon CodeGuru Reviewer handles Java and Python security patterns well but lags on logic errors in dynamic languages
  • PR-Agent (open source) punches well above its weight for a self-hosted option - strong on structured comment generation
  • Most tools over-generate style and documentation comments; the useful signal is in security and logic categories
  • "Estimated" scores marked with asterisks - CR-Bench full rankings are not publicly available for all products

The Benchmark Landscape

Code review evaluation is fragmented. No single benchmark dominates, and no vendor publishes numbers on a standard held-out test set. Here is what I actually trust.

CodeReviewer (Microsoft Research, 2022) built the earliest serious dataset: 150K real review comments from GitHub across Java, Python, C++, C#, and JavaScript. Quality is scored with BLEU against human ground truth - which penalizes accurate comments that differ in phrasing from the reference. BLEU correlates poorly with human preference. I use it as a rough signal of whether a model understands reviewer concerns, not as a direct quality metric.

CR-Bench (2024) is the more meaningful evaluation. It tests models on 100 curated real-world PRs with expert-annotated issues across four categories: security vulnerabilities, logic errors, performance problems, and style/quality. Both precision (noise avoidance) and recall (catching real issues) are measured. CR-Bench also tracks false negative rate on critical issues - how often the model claims nothing is wrong when something is. That FNR metric is the most important single number in this table for production use. A tool that misses bugs silently is worse than no review tool at all.

CodeQL comparisons consistently show the same pattern: CodeQL wins on known vulnerability classes it has full dataflow models for (SQL injection, path traversal, deserialization). LLMs win on semantic logic errors that require business-logic context. They are complements, not substitutes. CRScore is a newer neural metric that correlates better with human judgment than BLEU; I include it where published figures are available.


The Leaderboard

Scores are from published papers and product documentation unless marked with * (estimated from available benchmark subsets and third-party evaluations). CR-Bench F1 is reported separately for security/logic (S/L) and style/quality (St/Q) because collapsing them hides the signal. Lower false-negative rate (FNR) is better.

RankSystemTypeCR-Bench F1 (S/L)CR-Bench F1 (St/Q)FNR (Critical)CodeReviewer BLEUNotes
1Claude Opus 4.6 (direct API)Base LLM0.710.689%14.2Highest S/L F1 in published evals; strong multi-file reasoning
2GPT-5 (direct API)Base LLM0.680.71*11%14.8*Strong overall; edges ahead on style/quality; comparable S/L to Opus
3CodeRabbit (with Claude backend)Product0.64*0.60*13%*-Best in class for integrated PR tools; smart deduplication; $15/user/mo
4Claude Sonnet 4.6 (direct API)Base LLM0.620.6514%13.1Strong cost-adjusted performance; recommended for high-volume review
5PR-Agent (open source, GPT-5 backend)Product0.59*0.61*16%*-Best open-source option; configurable; github.com/The-PR-Agent/pr-agent
6GitHub Copilot Code ReviewProduct0.56*0.62*18%*-Tightly IDE-integrated; strong on idiomatic issues; weaker on security
7GPT-4.1 (direct API)Base LLM0.540.6020%13.6Prior-gen but still solid baseline; strong BLEU from fine-tuning on review data
8Amazon CodeGuru ReviewerProduct0.58*0.41*15%*-Excellent Java/Python security detection; poor on logic errors elsewhere
9Greptile (with Claude backend)Product0.53*0.55*19%*-Strong on codebase-wide cross-file reasoning; newer product
10Gemini 2.5 Pro (direct API)Base LLM0.510.5722%12.8Good long-context performance on large PRs; logic F1 trails leaders
11Sourcegraph Cody (with Claude backend)Product0.49*0.52*23%*-Better at code navigation than pure review; context retrieval a strength
12Sweep AIProduct0.44*0.49*28%*-Primarily a fix-generation tool; review mode is secondary; github.com/sweepai/sweep
13Graphite Diamond (AI review)Product0.41*0.58*26%*-Strong style/consistency; logic detection weak; CI workflow integration good
14GPT-4.1 mini (direct API)Base LLM0.330.4838%11.2Acceptable style comments; logic F1 drops significantly vs. full GPT-4.1
15Llama 4 Maverick (direct API)Base LLM0.29*0.43*42%*10.4*Best open-weight result; usable for style reviews; security miss rate too high for production
16CodeBERT-based fine-tunesBase LLM0.240.3851%12.9Microsoft's 2022 baseline; still cited; outperformed by all frontier models

Estimated from available benchmark subsets, third-party evaluations, and published vendor claims. "-" in CodeReviewer BLEU means no published score (product integrations do not report BLEU). Table last updated April 19, 2026.


Reading the Table

The S/L vs. St/Q F1 split is not cosmetic. A tool that scores 0.71 on style and 0.34 on logic is an opinionated linter, not a code reviewer. Amazon CodeGuru Reviewer is the sharpest example: excellent on known security patterns in Java (SQL injection, SSRF, hardcoded credentials), poor on logic errors in Python where its static analysis lacks the type information it needs. Claude Opus 4.6 at 0.71/0.68 is the only system where the security/logic score leads the style score. Every other high-ranking system is more confident on style than on substance - a direct consequence of training data distribution, where GitHub has orders of magnitude more style commentary than expert security annotations.

A 9% FNR means the model misses roughly 1 in 11 confirmed bugs. A 42% FNR - Llama 4 Maverick's estimated rate - means it misses nearly half. FNR is the metric I would require in any procurement evaluation. A tool that misses bugs silently generates false confidence, which is worse than having no review tool at all. GPT-5 at 11% vs. GPT-4.1 at 20% is a real and meaningful improvement - not sampling noise.

Most products are wrappers over the same frontier models. CodeRabbit and Greptile use Claude backends; GitHub Copilot Code Review uses OpenAI. The scaffold layer - diff chunking, comment deduplication, severity classification - creates meaningful variation within a base model tier. CodeRabbit's deduplication is worth something over a raw API call. But the ceiling is the base model. A product on Claude Sonnet 4.6 will not outperform a well-prompted Claude Opus 4.6 call on genuinely hard bugs.


Methodology

CR-Bench F1 figures for base LLMs come from the CR-Bench paper and the authors' published evaluation scripts. CodeReviewer BLEU scores come from the original CodeReviewer paper or subsequent fine-tuning papers using the same benchmark.

Scores marked * are estimates anchored to at least two independent data points - vendor-published benchmark claims, third-party comparisons (primarily Liang et al. 2025), or interpolation from related benchmark results on similar models.

I aggregate CR-Bench categories as S/L (security + logic) and St/Q (performance + style). Security and logic errors have real production consequence. Performance and style issues usually do not, so collapsing them into a single number obscures the signal that matters. Products do not publish standardized CR-Bench numbers. Treat the * rows as directional estimates and run your own bakeoff on representative PRs before making a tool selection.

This leaderboard covers issue identification only - not code rewriting or autonomous fixing. That belongs in the SWE-Bench Coding Agent Leaderboard.


Key Product Notes

CodeRabbit is the current leader among integrated products. Its comment deduplication - tracking flagged issues across multiple commits in the same PR and suppressing repeats - is what separates a usable tool from an annoying one. Uses a Claude backend, integrates with GitHub and GitLab. $15/user/month, free tier for open source. Third-party testing from Liang et al. 2025 found it doing substantially better than CodeGuru on logic errors in Python.

PR-Agent (now at The-PR-Agent org, formerly codium-ai) is the best open-source option. Configurable backend, structured JSON output, plugins for GitHub/GitLab/Bitbucket. Separate passes for code suggestions, security review, and PR description generation. Requires your own API key - cost visibility upside, cost risk at scale.

GitHub Copilot Code Review is the convenient choice for GitHub-native teams. Strong on idiomatic issues within a language community. Weak on cross-file logic where the diff interacts with callers not in the diff view. No published CR-Bench numbers.

Amazon CodeGuru Reviewer has a narrow but real strength: Java and Python security pattern detection backed by dataflow analysis. OWASP Top 10, AWS API misuse, known Java anti-patterns - it catches these reliably. Outside that niche, it falls apart. The 0.41 St/Q logic F1 estimate is consistent with a tool designed for pattern-matching on known vulns, not semantic reasoning.

Greptile is newer and has less external validation, but its codebase-aware review - indexing the full repo to evaluate whether a diff is consistent with existing patterns - is a real differentiator on large, long-lived codebases. The 0.53 S/L F1 estimate reflects limited data; this number should move.

What the Research Shows

The most relevant external study is Liang et al. 2025, which compared automated tools and LLMs against professional developer review on 234 real open-source PRs. LLMs generated more total comments than human reviewers but with substantially lower precision on security and logic issues. Human reviewers caught critical issues at roughly 2.3x the rate of the best-performing LLM tested (GPT-4.1, predating GPT-5 and Claude Opus 4.6). The gap was smaller on style, where LLMs matched human precision in several categories.

A 0.71 S/L F1 still means 30% of security and logic issues go undetected. That is not a passing grade for autonomous operation on a codebase with real security exposure.

Every tool in this table also generates too many comments per PR. Comment volume above a threshold decreases PR author receptiveness - the reviewee disengages when facing 40 comments, most of which are nitpicks. Noise suppression is where products differentiate most clearly: CodeRabbit's deduplication, PR-Agent's severity filtering, GitHub Copilot's comment collapsing. A raw Claude API call with a naive "review this diff" prompt will have higher signal in individual comments and worse signal-to-noise overall.

One more caveat: CodeReviewer and CR-Bench are weighted toward Java, Python, TypeScript, C++. The benchmarks underrepresent Rust, Go, Kotlin, and Ruby. Quality degrades noticeably on languages with smaller representation - Llama 4 Maverick especially. Teams building on Go or Rust codebases should treat these numbers as directional and run their own bakeoff.


Practical Guidance

For teams using GitHub or GitLab with budget for a managed tool: CodeRabbit at $15/user/month is the best current option for integrated PR review. It handles the noise problem better than any other product and uses a strong Claude backend. For security-sensitive codebases with significant Java, add CodeGuru Reviewer alongside it - they cover different issue classes.

For teams that want open-source or self-hosted: PR-Agent with a GPT-5 or Claude Opus 4.6 API key is the correct choice. It is genuinely close to commercial quality, configurable, and auditable.

For using base LLMs directly via API: Claude Opus 4.6 is the strongest reviewer for complex logic issues. For high-volume review where cost matters, Claude Sonnet 4.6 is the right tradeoff. Use a structured prompt that explicitly requests security, logic, and style comments in separate sections - unstructured review prompts generate noisier output with lower precision.

For security-first review: Amazon CodeGuru Reviewer in Java/Python, combined with any of the top-3 LLMs for general review. Do not rely on pure LLM tools alone for security-sensitive code paths.

For open-weight/self-hosted deployments: Llama 4 Maverick can handle style and documentation review adequately. Its logic and security miss rate is too high for sole reliance. Treat it as a first-pass filter, not a final reviewer, and plan for a higher rate of human follow-through.

For related rankings, see the SWE-Bench Coding Agent Leaderboard for autonomous bug-fixing, the Coding Benchmarks Leaderboard for HumanEval and MBPP-style generation, and the AI Safety Leaderboard for models used in security-sensitive environments.


Sources

LLM Code Review Leaderboard - Benchmarks and Rankings
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.