Most coding benchmarks measure whether a model can write a function given a docstring. SWE-Bench measures whether an agent can fix a real GitHub issue in a real open-source codebase. That distinction matters enormously. Writing a correct binary_search implementation in isolation is a solved problem. Navigating a 40,000-line Django codebase, understanding what broke, and producing a patch that passes the project's existing test suite is not.

This leaderboard ranks the top LLM-powered software engineering agents by their measured pass rate on SWE-Bench Verified - the 500-task subset that has been human-reviewed to confirm the issues are actually solvable and the acceptance criteria are unambiguous. Where data is available, I also include SWE-Bench Full (2,294 tasks) and SWE-Bench Multimodal scores.

If you are looking for HumanEval, MBPP, and LiveCodeBench-style rankings, see the Coding Benchmarks Leaderboard. For broader agentic capability rankings, see the Agentic AI Benchmarks Leaderboard.

TL;DR

The top SWE-Bench Verified scores now exceed 60%, with Claude-backed agents leading at 65-72%
The scaffold matters as much as the base model - the same Claude Sonnet 4.5 powering different agents can swing 15+ percentage points
Devin 2.0 sits at 45.8% in standard unassisted mode; carefully engineered scaffolds using the same underlying models score much higher
Open-source agents (OpenHands, SWE-agent, Moatless) close the gap significantly - top open-source setups now reach 55%+
Contamination risk on the original SWE-Bench Full is real; Verified is the number to trust

What Is SWE-Bench?

SWE-Bench was introduced by Princeton NLP researchers in late 2023. Each task presents an agent with a real GitHub issue from a popular Python open-source project - Django, Flask, scikit-learn, matplotlib, sympy, and others - along with a snapshot of the codebase at the time the issue was filed. The agent must produce a patch (a unified diff) that resolves the issue. Success is measured by running the project's existing test suite: the patch must pass all tests that were failing due to the bug and not break any tests that were previously passing.

The benchmark tests capabilities that pure code-generation benchmarks miss entirely: repository-level navigation, reading unfamiliar code at scale, understanding the implicit conventions of a specific codebase, and producing minimal, targeted changes rather than wholesale rewrites.

SWE-Bench Verified

SWE-Bench Verified was created by OpenAI in collaboration with the SWE-Bench authors as a response to concerns about task quality in the original benchmark. A team of human software engineers reviewed all 2,294 tasks and flagged issues with ambiguous specifications, under-specified acceptance criteria, or flawed test harnesses. The result is a 500-task subset where every problem is confirmed solvable and the evaluation is reliable. Verified is now the standard reported number for any serious comparison.

SWE-Bench Full

The original 2,294-task benchmark is still useful for tracking relative progress and for training data generation, but it carries two important caveats. First, some tasks have contested correctness - a patch can fail its tests while being a legitimate fix to the described issue, or pass tests while fixing something slightly different. Second, because SWE-Bench Full has been public since late 2023, there is meaningful contamination risk: open-source models trained after that date have had exposure to the tasks. Treat Full scores with more skepticism than Verified.

SWE-Bench Multimodal

SWE-Bench Multimodal extends the benchmark to include JavaScript repositories (React, Vue.js, D3, etc.) and introduces visual elements - some issues include screenshots of broken UI behavior that the agent must interpret. As of April 2026, relatively few agents report scores here, and the numbers are significantly lower than Verified across the board. I include available data in the main table.

The Leaderboard

Scores are from official publications, lab announcements, or the SWE-Bench leaderboard unless otherwise noted. "Agent" refers to the full system including scaffold, tools, and any retrieval or planning layers. "Base model" is the underlying LLM doing the reasoning and code generation. Pricing reflects the cost to the end user, not inference cost to the provider.

Rank	Agent	Base Model	SWE-Bench Verified	SWE-Bench Full	SWE-M	Pricing	Notes
1	Augment Code SWE-Agent	Claude Opus 4.6	72.0%	54.1%	-	Subscription	Best verified score as of Apr 2026; internal scaffold
2	OpenHands + CodeAct v3	Claude Opus 4.6	68.4%	51.2%	-	Open source	Community-run; reproduces within ~1%
3	Cursor Background Agent	Claude Sonnet 4.6	65.7%	48.9%	-	$20-40/mo	IDE-integrated; Sonnet 4.6 base
4	Composio SWE-Kit	Claude Sonnet 4.6	62.3%	45.8%	Pay-per-use	Enterprise tooling layer over Claude
5	Cline (Autonomous Mode)	Claude Sonnet 4.6	59.8%	44.1%	Free (BYOK)	VS Code extension; bring-your-own key
6	Factory Droid	GPT-5.3-Codex	58.1%	43.2%	Enterprise	Factory.ai CI/CD integration
7	Devin 2.0	Proprietary	45.8%	35.6%	$500/mo	Cognition AI; unassisted standard eval
8	OpenHands + CodeAct v2	GPT-5.2	44.7%	33.9%	Open source	Older CodeAct scaffold, GPT-5.2 base
9	SWE-agent v1	Claude Sonnet 4.5	43.2%	31.1%	Open source	Princeton NLP; ACI scaffold
10	AutoCodeRover v2	GPT-5.2	38.6%	29.4%	Research	Singapore NUS; program analysis guided
11	Moatless Tools	Claude Haiku 4.5	35.9%	27.8%	Open source	Strong cost-per-resolved-issue ratio
12	Agentless v2	Claude Sonnet 4.5	34.2%	26.1%	Open source	Minimal orchestration; no tool calls
13	Aider (architect mode)	Claude Sonnet 4.5	31.4%	24.0%	Free (BYOK)	Two-model architect-editor pipeline
14	SWE-agent v1	GPT-5.2	29.7%	22.5%	Open source	Same scaffold; weaker base model here
15	Continue (Agent Mode)	Llama 4 Maverick	18.3%	14.1%	Free (BYOK)	Open-source; best local-model result
16	Moatless Tools	Llama 4 Maverick	14.7%	11.2%	Open source	Useful for air-gapped deployments

Scores represent pass@1 (single attempt) unless noted. Table last updated April 19, 2026. "-" in SWE-M column means no published Multimodal score.

Reading the Table

Scaffold vs. Base Model

The most important number on this table is not any individual score. It is the spread between rows that share the same base model. SWE-agent v1 on Claude Sonnet 4.5 scores 43.2% on Verified. Cline in autonomous mode on the same model scores 59.8%. That is a 16-point gap from orchestration alone - how the agent navigates the codebase, how it structures its edits, how it handles test failures, and how many retries it takes.

The Augment Code result (72.0%) using Claude Opus 4.6 shows what happens when a funded team builds a purpose-designed software engineering scaffold on top of the strongest available base model. The open-source OpenHands + CodeAct v3 result (68.4%) on the same base model is remarkably close, showing that the open-source community has matched proprietary scaffolds nearly point-for-point once the base model is held constant.

This has a practical implication: if you are building on top of these APIs, your retrieval strategy, context management, and error-recovery logic will drive most of the performance delta. The model vendor's leaderboard numbers tell you the ceiling, not what you will get out of the box with naive integration.

The Devin Situation

Devin 2.0's 45.8% on Verified deserves context. Cognition AI runs what they describe as a "standard" evaluation - single-agent, no human-in-the-loop, no best-of-N voting. The score is real and reproducible. The gap between Devin and the Claude-based top of the table is not a sign that Cognition's underlying model is weak - it reflects two things: first, Devin uses a proprietary model that Cognition has not disclosed, which may be smaller or differently optimized than Opus 4.6; second, the evaluation methodology differs from how some of the top-scoring agents report their numbers.

When Cognition published their original SWE-Bench technical report in early 2024, Devin scored 13.86% - well above the state of the art at the time. Devin 2.0 at 45.8% represents genuine progress, not a stagnant product. But the headline number should be compared to other pass@1 unassisted results, not to best-of-N scaffolded results from academic papers.

Test-Time Compute Tradeoffs

Several entries in this table can be configured to use more compute per task. The practical levers are:

Best-of-N / majority voting. Run the same task N times and take the result that appears most often (for tasks with deterministic outputs) or the highest-confidence result (where a critic model scores candidate patches). OpenHands with 5-sample majority voting on Verified jumps approximately 8-12 points over pass@1, depending on the model. The cost multiplier is linear.

Critic models. Some scaffolds (including Augment Code's) add a separate model pass that reviews candidate patches before submission - looking for obvious regressions, missed edge cases, or incomplete fixes. This adds roughly 15-20% to the token cost of each task and reliably adds 3-6 points on Verified in the published ablations.

Extended context windows. SWE-Bench tasks often require reading substantial portions of large codebases. Models with larger context windows can ingest more of the relevant files in a single pass rather than relying on retrieval. Moving from 32K to 128K context adds noticeable improvement on tasks involving large interconnected modules, at the cost of proportionally higher token spend.

The table above shows pass@1 standard-eval numbers throughout. If a provider has only published best-of-N numbers, I have noted that in the Notes column and not included those scores in the ranking.

A SWE-Bench task example showing a GitHub issue, the codebase context, and an agent producing a unified diff patch that resolves the issue and passes tests Each SWE-Bench task gives the agent a real GitHub issue and codebase snapshot. Success requires navigating the repo, understanding the bug, and producing a minimal patch that passes all tests. Illustration for conceptual purposes.

Benchmark Mechanics and Methodology

How Evaluation Works

The SWE-Bench evaluation harness clones the repository at the issue's commit, applies the agent's patch, and runs the test suite in an isolated environment. "Resolved" means: every test that was failing due to the reported bug now passes, and no previously-passing test has been broken. Partial credit does not exist. A patch that fixes 90% of the issue and introduces a minor regression scores the same as a patch that does nothing.

The harness is open source and runnable by anyone. Independent reproduction of published scores is possible and has been performed by several research groups. The numbers in this table come from either direct reproductions, the official leaderboard, or official lab announcements with clearly stated methodology.

Contamination Concerns

SWE-Bench Full tasks have been public since late 2023. Any open-weight model trained after that point may have seen the task descriptions, the correct patches (which are GitHub commits, also public), or both. The contamination is impossible to fully audit because training data composition for most models is not disclosed.

SWE-Bench Verified uses the same underlying tasks, so it has the same contamination exposure. The human-review process reduces false positives in the evaluation (tasks where a model gets "credit" for unrelated reasons) but does not address data leakage.

The most contamination-resistant evaluation is running on tasks that postdate the model's training cutoff. Until a genuinely held-out rolling benchmark with temporal lockout ships, treat all SWE-Bench numbers as an upper bound on real-world generalization, not a lower bound.

Agent Scaffold Components

A competitive SWE-Bench scaffold in 2026 typically includes:

Retrieval layer. BM25 or dense vector search over repository files, issue text, and git history to identify relevant files before any LLM call.
Context management. Strategies for fitting large codebases into context windows - typically a combination of file-level relevance scoring and chunk-level deduplication.
Tool definitions. File read, file write (or apply-diff), shell execution for running tests, and git operations. The specific tool schemas significantly affect how models structure their actions.
Error recovery. Loops that detect test failures after patch application and re-attempt with the failure output in context. The number of retries allowed and the retry prompt structure affect final scores substantially.
Linting and static analysis. Some scaffolds (AutoCodeRover notably) integrate program analysis tools - call graphs, symbol resolution, import tracing - to guide retrieval and reduce the search space before making LLM calls.

The Agentless approach (rank 12 in the table) is a deliberate counter-example: no tool calls, no iterative repair, just a structured three-stage pipeline (localization, repair, validation) that makes one shot at the patch. Its 34.2% at a fraction of the token cost of iterative agents makes it relevant for cost-sensitive use cases.

SWE-Bench Verified vs. Full: Which Number to Trust

For comparing agents, always use Verified. The correlation between Full and Verified is high but not perfect, and several historical cases exist where a model's Full score was inflated by flawed task formulations that the Verified review process caught. The 2,294-task Full benchmark is most useful as a training signal and for historical trend analysis going back to 2023.

What the Numbers Actually Mean

The 60% Wall Is Not a Wall Anymore

When SWE-Bench launched, 50% was a moonshot. The first time any system crossed 50% on Verified was early 2025. As of April 2026, the top four systems are above 60%, and the leader is at 72.0%. The benchmark is not saturated - human software engineers working on the same tasks score around 90% with reasonable time budgets - but the distance from "useless" to "better than most junior engineers on this specific task type" has been crossed.

What this does not mean: that you can drop a coding agent into your production codebase and expect 72% of its PRs to merge without review. SWE-Bench Verified is curated tasks with clear issue descriptions and known-good test suites. Real development work involves ambiguous specifications, undocumented behavior, flaky tests, and stakeholder expectations that are not written down anywhere. The benchmark measures a real and important capability, but the translation to production utility requires significant additional scaffolding and human oversight.

Base Model Quality Is the Ceiling

The range in this table from 14.7% (Moatless + Llama 4 Maverick) to 72.0% (Augment + Claude Opus 4.6) shows both how much the scaffold matters and how much the base model matters. The best open-source scaffold (OpenHands CodeAct v3) on a strong model (Claude Opus 4.6) reaches 68.4%. The same scaffold on GPT-5.2 reaches 44.7%. Scaffold optimization on a weaker model has diminishing returns beyond a certain point.

This creates a clear decision tree for teams building coding agents:

If you are using a cloud API: pick the strongest model your budget allows and invest in scaffold quality second. Claude Opus 4.6 and GPT-5.3-Codex are the obvious base models.
If you are using open-source models for on-prem or cost reasons: Llama 4 Maverick is currently the strongest option, but accept a 40-50 point performance gap versus the frontier. Moatless is the most efficient scaffold for cost-per-resolved-task on smaller models.
If you are building for automated CI/CD (zero human review): budget for best-of-N sampling. The pass@1 numbers in the table are too low for many production use cases; pass@3 or pass@5 with a critic model brings the reliability to a more defensible level.

Open Source Is Competitive on Scaffold, Not on Base Model

OpenHands + CodeAct v3 at 68.4% is within 4 points of the best proprietary system. That gap comes from the base model, not the scaffold. The open-source community has built orchestration that is essentially as good as anything proprietary. The remaining gap between open-source deployments and frontier results is almost entirely base model quality, which for now means API access to Anthropic or OpenAI models.

For teams that need fully self-hosted deployments, Moatless with Llama 4 Maverick at 14.7% is the best available option but falls far short of API-backed systems. Purpose-fine-tuned coding models like SWE-Lama or equivalent domain-adapted open models narrow the gap somewhat but are not yet tracked in this leaderboard due to limited independent verification.

Agent Profiles

Augment Code SWE-Agent

Augment Code is the enterprise coding assistant from the team that includes former members of Sourcegraph. Their SWE-Bench submission uses Claude Opus 4.6 with a proprietary scaffold focused on large-codebase navigation. The 72.0% Verified score is the highest in this table and is notable for being achieved on a standard pass@1 evaluation without best-of-N tricks. Augment has not published full architectural details, but their blog describes a hybrid BM25 + embedding retrieval system with a code-graph layer for tracking symbol dependencies across files.

OpenHands (formerly OpenDevin)

OpenHands is the most active open-source coding agent project as of April 2026. The CodeAct scaffold defines agent actions as Python code rather than structured tool calls - the agent writes actual Python to read files, run tests, and apply changes, which runs in a sandboxed environment. This turns out to significantly outperform JSON-schema tool calling on complex multi-step tasks. The project publishes their SWE-Bench evaluation scripts, and their Verified scores are independently reproducible. If you are building a coding agent and want a starting point, OpenHands is where to look.

Cursor Background Agent

Cursor's Background Agent mode takes their IDE-integrated assistant and runs it autonomously on a task specification - no human in the loop. It scores 65.7% on Verified using Claude Sonnet 4.6, which is strong given that it uses a cheaper model than the top two entries. Cursor's scaffold has been optimized over several years of real-world IDE usage, which likely explains the strong efficiency. Background Agent is available on the $40/month plan.

Factory Droid

Factory.ai's Droid is notable for being the only top-10 entry built primarily on GPT-5.3-Codex rather than a Claude model. Factory integrates Droid into CI/CD pipelines, triggering automated fix attempts when tests fail in PRs. Their 58.1% Verified score reflects a GPT-5.3-Codex base with a scaffold optimized for the automated CI context rather than interactive development.

Devin 2.0

Devin from Cognition AI launched in early 2024 as the first publicly marketed "AI software engineer." The current Devin 2.0 scores 45.8% on SWE-Bench Verified in their standard evaluation. The score is genuine and the methodology is conservative (pass@1, no human assistance). At $500/month, it targets enterprise teams that want a fully managed agent with a web interface, not a raw API. Cognition continues to publish detailed technical reports, and their transparency about methodology is among the best in the category.

SWE-agent

SWE-agent is from the Princeton NLP lab, the same group that created SWE-Bench. The Agent-Computer Interface (ACI) - a set of custom tools designed specifically for software engineering tasks like view_file, create_file, and search_dir - is their key architectural contribution. SWE-agent proved that purpose-designed tools significantly outperform general-purpose tool sets. Most commercial scaffolds have since adopted similar patterns. At 43.2% on Verified with Claude Sonnet 4.5, it is not at the frontier, but it is a clean reference implementation that is easy to study and extend.

Moatless Tools

Moatless Tools is a lightweight open-source scaffold with an emphasis on cost efficiency. Using Claude Haiku 4.5 (the cheapest Anthropic model), it achieves 35.9% on Verified - a pass-rate-per-dollar that beats most larger setups. The Moatless philosophy is minimal context: fetch only the specific code locations that are directly relevant rather than loading large file chunks. For high-volume automated applications where cost per task is the primary constraint, Moatless is worth evaluating seriously.

Agentless

Agentless is a research scaffold built on the premise that less orchestration is sometimes more. It uses a fixed three-stage pipeline - fault localization (identifying which files and lines are relevant), patch generation (producing multiple candidate patches in parallel), and patch selection (ranking candidates by test results). There are no iterative tool calls, no error recovery loops, no multi-turn conversations. The simplicity makes it reproducible and cheap. At 34.2% on Verified, it significantly underperforms iterative agents, but its cost profile is radically different.

Aider

Aider is an open-source command-line coding assistant that focuses on interactive pair programming rather than fully autonomous operation. Its "architect mode" (a two-model pipeline where a stronger model plans changes and a weaker model implements them) is what produces the 31.4% Verified score. Aider's strength is in developer-in-the-loop workflows rather than pure autonomy, which is worth noting when interpreting the leaderboard position.

Practical Guidance

For cloud-API-backed development assistants: Cursor Background Agent and Cline in autonomous mode offer the best value for individual developers. Both use Claude Sonnet 4.6 and sit above 59% on Verified. Cline is free with your own API key; Cursor charges a subscription that includes the API cost.

For enterprise deployments with budget for the best model: Augment Code's scaffold or OpenHands + CodeAct v3 on Claude Opus 4.6 are essentially equivalent and both above 68%. OpenHands is free and auditable; Augment is managed and enterprise-supported.

For CI/CD automation with zero human review: Budget for best-of-3 or best-of-5 sampling. The pass@1 numbers in this table are not sufficient for no-review merging in most codebases. Factory Droid's CI-native integration is worth evaluating if you are on a GPT-based stack.

For fully self-hosted open-source models: Moatless with Llama 4 Maverick is the best current option at 14.7%. Expect significant performance limitations versus API-backed systems and plan for higher human review rates. This will improve as open-weight coding models mature.

For research and prototyping: OpenHands is the most reproducible and well-documented open-source scaffold. SWE-agent and Agentless are useful references for understanding the design space.

For broader tool comparisons, see Best AI Coding Assistants 2026 and Best AI Coding CLI Tools 2026.

FAQ

What is SWE-Bench Verified?

SWE-Bench Verified is a 500-task subset of the full SWE-Bench benchmark, human-reviewed by OpenAI engineers to confirm that every task is solvable and the evaluation criteria are unambiguous. It is the standard comparison point for agent rankings as of 2025 onward.

What is the difference between a "scaffold" and the base model?

The base model is the LLM doing the reasoning - Claude Opus 4.6, GPT-5.3-Codex, etc. The scaffold is everything around it: how the agent retrieves relevant code, what tools it has access to, how it recovers from failed patches, how many retries it takes. The same base model in different scaffolds can vary by 15+ percentage points on SWE-Bench Verified.

Is Devin the best coding agent?

No. Devin 2.0 scores 45.8% on SWE-Bench Verified in standard unassisted evaluation. Several agents with better-optimized scaffolds on stronger base models score significantly higher. Devin was the first widely marketed autonomous coding agent, but the field has advanced considerably since its launch.

Can any of these agents be used unsupervised in production?

With caution and appropriate safety layers. Even the 72.0% leader fails on more than a quarter of tasks, and the failure modes - incorrect patches that pass tests, patches that fix the stated symptom but introduce a subtle regression - are not always easy to detect automatically. Best practice for production CI/CD is to use agents for first-draft patches, run comprehensive test suites, and keep a human in the review loop for non-trivial changes.

Why do scores differ between labs and papers?

Evaluation methodology varies. Key differences: pass@1 vs. best-of-N, with or without a critic model, context window size, number of retries allowed, and which version of the evaluation harness was used. The table above uses pass@1 standard-eval numbers throughout and notes any exceptions.

Is SWE-Bench Multimodal harder?

Yes. SWE-Bench Multimodal adds JavaScript repositories and visual elements (screenshots of broken UI behavior). Most agents that report Verified scores above 40% drop 15-25 points on Multimodal. The visual understanding requirement is a genuine bottleneck for current-generation agents.

Sources: