Coding Benchmarks Leaderboard: SWE-Bench, Terminal-Bench, and LiveCodeBench

Coding ability has become the most important differentiator among frontier AI models. As knowledge benchmarks saturate and reasoning scores converge, the ability to write, debug, and maintain real software is where models truly separate themselves. This leaderboard tracks performance across three major coding benchmarks: SWE-Bench (real-world software engineering), Terminal-Bench (autonomous terminal usage), and LiveCodeBench (competitive programming problems released after training cutoffs).

The Coding Benchmarks Explained

SWE-Bench is the gold standard for evaluating AI on real software engineering tasks. Each problem is a real GitHub issue from a popular open-source project (like Django, Flask, or scikit-learn), paired with the codebase at that point in time. The model must read the issue, navigate the codebase, and produce a patch that resolves the issue and passes the project's test suite. SWE-Bench comes in three variants: the original SWE-Bench (2,294 tasks), SWE-Bench Verified (500 human-validated tasks), and SWE-Bench Pro (harder subset requiring multi-file changes and deeper reasoning).

Terminal-Bench evaluates a model's ability to operate autonomously within a terminal environment. Tasks range from system administration to data processing pipelines, requiring models to chain together shell commands, handle errors, and manage file systems. This benchmark tests practical DevOps and systems programming ability that SWE-Bench does not capture.

LiveCodeBench collects competitive programming problems from Codeforces, LeetCode, and AtCoder that were published after model training cutoffs. This makes it a clean test of genuine algorithmic reasoning rather than memorization of training data. Problems span easy to hard difficulty, and scoring accounts for both correctness and efficiency.

Overall Coding Rankings

Rank	Model	Provider	SWE-Bench Verified	SWE-Bench Pro	LiveCodeBench	Terminal-Bench
1	Claude Opus 4.6	Anthropic	72.5%	45.2%	68.1%	58.3%
2	GLM-5	Zhipu AI	77.8%	38.6%	55.4%	42.1%
3	GPT-5.3-Codex	OpenAI	68.9%	56.4%	71.2%	61.5%
4	DeepSeek V3.2-Speciale	DeepSeek	77.8%	40.1%	62.3%	48.7%
5	Claude Opus 4.5	Anthropic	68.4%	41.8%	64.5%	55.2%
6	GPT-5.2 Pro	OpenAI	65.8%	48.3%	66.8%	54.1%
7	Gemini 3 Pro	Google DeepMind	63.2%	42.5%	60.1%	50.3%
8	Grok 4 Heavy	xAI	61.0%	39.8%	58.2%	46.5%
9	Qwen 3.5	Alibaba	62.5%	35.2%	54.8%	41.8%
10	Llama 4 Maverick	Meta	55.8%	30.1%	48.3%	38.2%

Key Takeaways

Claude Opus 4.6 Leads the Overall Coding Pack

When you look across all four coding benchmarks, Claude Opus 4.6 emerges as the most consistently strong performer. Its 72.5% on SWE-Bench Verified is the highest among models from major Western labs, and its strong Terminal-Bench score reflects genuine ability to operate as an autonomous coding agent. Developers using Claude for software engineering tasks have consistently reported that it excels at understanding large codebases and producing well-structured, production-ready patches.

GLM-5 and DeepSeek V3.2 Are SWE-Bench Verified Champions

Both GLM-5 and DeepSeek V3.2-Speciale hit 77.8% on SWE-Bench Verified, the highest scores on this particular benchmark variant. However, their performance on the harder SWE-Bench Pro subset trails the leaders significantly, suggesting strength on more straightforward issues but limitations on complex multi-file changes. This highlights why looking at a single benchmark number can be misleading.

GPT-5.3-Codex Is Built for Code

OpenAI's dedicated coding model, GPT-5.3-Codex, leads on SWE-Bench Pro at 56.4%, demonstrating particular strength on the hardest software engineering problems. It also tops LiveCodeBench, indicating strong algorithmic reasoning. This model is specifically fine-tuned for code generation and understanding, trading some general-purpose ability for coding specialization.

The SWE-Bench Verified vs. Pro Gap

The difference between SWE-Bench Verified and Pro scores is revealing. Verified tasks are human-validated to be solvable and tend to involve single-file changes with clear specifications. Pro tasks require navigating ambiguous requirements, modifying multiple files, and understanding complex codebases at a deeper level. The best models see their scores drop by 25-30 percentage points from Verified to Pro, indicating that truly complex software engineering remains a major challenge.

Why Coding Benchmarks Matter

Coding ability is a proxy for some of the most valuable capabilities an AI model can have. Writing correct code requires understanding precise specifications, reasoning about edge cases, maintaining internal consistency across a complex output, and grounding responses in concrete, testable behavior. A model that writes good code is usually also strong at structured reasoning, careful analysis, and following detailed instructions.

For businesses evaluating AI models, coding benchmarks offer the most reliable signal of practical utility. Unlike knowledge benchmarks where the difference between 85% and 88% is hard to feel in practice, the difference between a model that resolves 55% of real GitHub issues versus one that resolves 72% translates directly into developer productivity.

Choosing the Right Model for Coding

For general software engineering (fixing bugs, adding features, code review), Claude Opus 4.6 and GPT-5.3-Codex are the strongest choices. For competitive programming and algorithmic challenges, GPT-5.3-Codex and Claude Opus 4.6 lead on LiveCodeBench. For cost-sensitive coding tasks, DeepSeek V3.2-Speciale delivers near-frontier performance at a fraction of the price. And for self-hosted coding assistants, GLM-5 and Llama 4 Maverick offer strong open-weight options.

The coding benchmark landscape continues to evolve. We expect new benchmarks targeting agentic coding, multi-step debugging, and full-stack development to emerge in the coming months, giving an even richer picture of AI coding capability.