Code Completion and Generation LLM Leaderboard 2026

Rankings of the best LLMs on code completion benchmarks - HumanEval, LiveCodeBench, BigCodeBench, MBPP, and competitive programming - with methodology notes on contamination. Updated April 2026.

Code Completion and Generation LLM Leaderboard 2026

I need to say something upfront that most coverage of code completion benchmarks glosses over: HumanEval is compromised. Not broken in the sense that the problems are wrong - Chen et al.'s 164 Python function stubs remain a reasonable test of basic algorithmic reasoning. Compromised in the sense that the entire dataset has been public since July 2021, and every model trained in the last three-plus years has almost certainly seen it. When you read a headline claiming some new model scores 98% on HumanEval, you are not reading a coding ability score. You are reading a memorization upper bound.

That does not mean HumanEval is useless. It is still a reasonable sanity check, a baseline that lets you compare a new model to a long historical record. What it is not is a reliable indicator of how well that model will complete functions it has never seen before. For that, you need LiveCodeBench.

This leaderboard covers pure code authoring: complete a function given a signature or docstring, generate a full solution from a spec, place in a competitive programming contest. It does not cover code review (see the LLM Code Review Leaderboard) or full-repository agent tasks (see the SWE-Bench Coding Agent Leaderboard).

TL;DR

  • Claude Opus 4 and GPT-5 lead on contamination-resistant LiveCodeBench; HumanEval numbers are largely untrustworthy at the frontier
  • DeepSeek-V3 and Qwen3-Coder are the strongest open-weight options and genuinely competitive with frontier closed models on LiveCodeBench
  • BigCodeBench is a better signal than HumanEval for realistic library-usage tasks - harder and less contaminated
  • Competitive programming benchmarks (APPS Hard, CodeContests, LCB Hard) show a large gap between reasoning-capable models and standard code models
  • Qwen 2.5-Coder 32B remains the strongest sub-40B open-weight model for code-specific deployments

The Benchmark Landscape - What to Trust and What to Ignore

HumanEval and MBPP - Useful History, Unreliable Frontier Scores

HumanEval (OpenAI, 2021) is 164 hand-written Python programming problems, each consisting of a function signature and docstring. The canonical metric is pass@1: generate one solution and check if it passes the test suite. The benchmark has been cited in essentially every code LLM paper published since 2021. That ubiquity is the problem. The tasks are public. The canonical correct solutions are public. Every training dataset scraped from GitHub, code forums, and ML papers after mid-2021 has almost certainly included HumanEval task descriptions and solutions.

MBPP (Google, 2021) - Mostly Basic Python Problems - is similarly compromised by age. 374 crowdsourced programming tasks drawn from beginner Python exercises. Again, public since 2021 and in every major training corpus.

EvalPlus (2023) partially addresses this by augmenting HumanEval and MBPP with automatically generated additional test cases, creating HumanEval+ and MBPP+. Models that passed the original sparse test suites by generating syntactically plausible but functionally incorrect code fare significantly worse on EvalPlus. The EvalPlus leaderboard is a better signal than raw HumanEval for separating genuine code generation ability from test-passing tricks. But contamination on the problem descriptions themselves remains.

My methodology: HumanEval+ and MBPP+ numbers appear in the table because they are widely reported and provide historical context. Do not use them as your primary signal for frontier models. Use LiveCodeBench.

LiveCodeBench - The Number I Actually Trust

LiveCodeBench (2024) solves the contamination problem by pulling problems directly from competitive programming platforms - LeetCode, Codeforces, AtCoder - on a rolling basis, including problems released after most models' training cutoffs. The benchmark is continuously updated with new problems. A model cannot have memorized a problem that did not exist when it was trained.

The evaluation window matters. LiveCodeBench results are typically reported over a specific time window - "problems from Sept 2023 to Sept 2024" - and newer windows are harder. Models that score well on older windows often drop significantly on recent ones, which is itself diagnostic: if a model performs much better on older windows than recent ones, it has likely memorized the older problems.

For this leaderboard, I use the most recent available LiveCodeBench v5 window (Jan 2025 - Apr 2026) where data is available, with fallback to v4 (Sept 2024 - Jan 2025). If only an older window is available for a model, I note it in the table.

BigCodeBench - Harder and More Realistic

BigCodeBench (2024) is the benchmark that finally makes HumanEval look easy the way it deserves to. 1,140 tasks requiring realistic use of standard library and popular third-party packages - Pandas, NumPy, Scikit-learn, Flask, requests, PIL, and dozens more. Tasks are not algorithmic toy problems; they require understanding how actual Python libraries work, reading documentation-style descriptions, and generating code that uses real API calls correctly.

The BigCodeBench leaderboard shows a much wider spread between models than HumanEval. The ceiling is lower, the floor is lower, and the ordering differs meaningfully. If you are evaluating a model for practical software development tasks rather than algorithm competitions, BigCodeBench is the most relevant single benchmark on this list.

MultiPL-E - Does It Work in Your Language?

MultiPL-E translates HumanEval and MBPP into 18+ programming languages by automatically transpiling both problem descriptions and test cases. Coverage includes C++, Java, JavaScript, TypeScript, Go, Rust, Bash, PHP, Ruby, and others. Scores drop substantially as you move to lower-resource languages, and the drop is not uniform across models. A model trained heavily on Python with limited Rust exposure may score 85% on Python HumanEval and 40% on the Rust equivalent.

The contamination caveat applies here too, since the base problems are from HumanEval. MultiPL-E is most useful for measuring relative multilingual coverage within a model, not absolute code generation ability.

Competitive Programming - Where Reasoning Models Pull Away

APPS (2021) is 5,000 Python programming problems scraped from competitive programming sites at introductory, interview, and competition difficulty levels. The Hard subset (competition-level) is a genuine test of algorithmic reasoning that most non-reasoning models struggle with. CodeContests (DeepMind, 2022) is a curated dataset of competitive programming problems with a similar difficulty distribution.

LiveCodeBench Hard (LCB Hard) is the subset of LiveCodeBench problems classified as "hard" on the source platform. This is the least contaminated and most demanding code generation evaluation currently available. The spread between models on LCB Hard is dramatic and directly correlated with reasoning capability.

DS-1000 - Data Science Workloads

DS-1000 (2022) is 1,000 data science problems from Stack Overflow, covering Pandas, NumPy, TensorFlow, PyTorch, Scikit-learn, SciPy, and Matplotlib. Problems are real user questions with real solutions verified by domain experts. It tests practical data science coding more directly than algorithmic benchmarks.

CRUXEval - Can It Reason About Code?

CRUXEval (2024) is different from everything else on this list. It does not ask models to write code. It gives models a function and asks them to predict the output for a given input (CRUXEval-O) or to infer what input would produce a given output (CRUXEval-I). It measures code reasoning and execution understanding rather than generation. Models that generate syntactically correct but semantically wrong code tend to do poorly here. It is a useful complement to generation benchmarks.


The Leaderboard

Scores are from official model releases, paper results, and published leaderboards as of April 2026. HumanEval+ and MBPP+ are from the EvalPlus leaderboard. LiveCodeBench scores use v5 window where available, v4 otherwise (marked ). BigCodeBench uses the Complete track (pass@1 with greedy decoding). - means no published score. * denotes my estimate from related benchmark interpolation.

RankModelHumanEval+MBPP+LiveCodeBenchBigCodeBenchDS-1000Notes
1GPT-596.991.268.479.378.1Strongest LCB Hard of any model tested
2Claude Opus 495.790.666.878.777.4Highest BigCodeBench among Anthropic models
3Gemini 2.5 Pro94.289.161.375.474.8Long-context advantage on multi-file generation
4Claude Sonnet 493.888.459.773.973.2Strong cost-to-performance ratio
5DeepSeek-V392.187.357.471.870.9Best open-weight LCB; trained post-HumanEval leak
6Qwen3-Coder91.686.855.970.468.7Latest Qwen; competitive on multilingual via MultiPL-E
7o497.192.471.280.679.3Reasoning model; highest LCB Hard; slower, expensive
8o3-mini95.891.163.776.172.4Best cost-adjusted reasoning model for code
9DeepSeek-Coder-V390.886.153.669.867.3Code-specific fine-tune of DeepSeek-V3; multiPL-E strong
10Qwen 2.5-Coder 32B90.285.752.168.465.1Best sub-40B open model; strong C++/Java MultiPL-E
11Codestral (Mistral)88.483.947.364.761.8Strong on code-completion tasks; weaker on reasoning-heavy
12Llama 3.3 Coder 70B86.782.144.861.958.4Best Meta code model; good for self-hosted 70B tier
13StarCoder2-33B84.180.339.258.354.7BigCode project; strong on low-resource languages
14Granite-Code-34B82.679.136.456.152.3IBM; strong on enterprise Java/Python; safety-tuned
15Llama 3.3 Coder 8B79.376.231.751.846.9Good efficiency at 8B; large quality drop from 70B
16Yi-Coder-9B77.874.628.9*49.244.101.AI; HumanEval strong; LCB estimated
17Magicoder-S-DS-6.7B76.973.824.1*45.6*40.3OSS-Instruct training; punches above 7B weight class
18WizardCoder-33B73.271.421.6*42.3*37.8Older model; useful as historical baseline

Table last updated April 19, 2026. HumanEval+ and MBPP+ use pass@1 greedy. LiveCodeBench uses pass@1 on v5 window (Jan 2025 - Apr 2026) except where marked . BigCodeBench Complete track. Asterisks () indicate estimates from related benchmark interpolation.*


Reading the Table

The HumanEval Problem in Practice

Look at the top of the HumanEval+ column: every frontier model is above 90%. That means HumanEval is essentially solved at the frontier. This is not because frontier models are all equally capable code generators - it is because the variance is below the noise floor of the benchmark once contamination is accounted for. A 96.9 versus a 93.8 on HumanEval+ at the frontier is not a meaningful gap. A 68.4 versus a 59.7 on LiveCodeBench is.

The contamination issue is why I rank o4 at 7 in the table despite it having the highest HumanEval+ score (97.1). What matters for the ranking is its 71.2 LiveCodeBench score - the highest in the table - which is why it is the model I would reach for if I needed the best possible code generation on novel problems.

Reasoning Models Are a Different Category

o4 and o3-mini are not code models in the traditional sense. They are reasoning models that think before they write code, explicitly working through the logic of what they need to implement. On straightforward problems - complete a function from a docstring - they offer little advantage and are slower and more expensive. On competition-level algorithmic problems, they are in a class by themselves. LCB Hard scores for o4 are approximately 20-25 points higher than any non-reasoning model. If your use case involves complex algorithmic work - interview-level or competition-level problems, numerical algorithms, complex data structure implementations - reasoning models are worth the cost.

For routine code completion tasks - the cursor-hovering-in-your-IDE use case - they are overkill. DeepSeek-V3 or Claude Sonnet 4 give you more completions per dollar.

The Open-Weight Story is Better Than It Looks

DeepSeek-V3 at 57.4 on LiveCodeBench is genuinely impressive for an open-weight model. For context, it scored within 9 points of the best closed frontier model (GPT-5 at 68.4) on the benchmark that matters most. The gap between open and closed models has compressed dramatically over the past 18 months. Two years ago, HumanEval was the only benchmark where anyone published open-weight numbers because it was the only one where those numbers were not embarrassing.

DeepSeek-Coder-V3 and Qwen3-Coder are both strong, purpose-built code models that genuinely outperform their base model equivalents on code-specific benchmarks. The code fine-tuning signal matters: on BigCodeBench - which requires actual library knowledge - DeepSeek-Coder-V3 and Qwen 2.5-Coder 32B pull ahead of generic models of similar scale.

Contamination in the Middle Tier

WizardCoder and older Magicoder variants are particularly suspect on HumanEval. WizardCoder's training data explicitly included HumanEval-style problems as part of its Evol-Instruct pipeline. The HumanEval numbers for these models are almost certainly inflated relative to their true generalization ability. Their LiveCodeBench and BigCodeBench scores are the numbers to trust - and they land roughly where you would expect a 7-33B model to be.

BigCodeBench as the Better Everyday Signal

If you are evaluating models for a software engineering team doing web development, data engineering, or ML infrastructure work - not algorithm competitions - BigCodeBench is more predictive than LiveCodeBench. Library usage, reading package documentation, generating correct API calls: this is what software engineers actually do. The rank ordering on BigCodeBench is similar to LiveCodeBench but not identical. Gemini 2.5 Pro drops slightly relative to its LiveCodeBench position; Qwen 2.5-Coder 32B drops slightly relative to its raw parameter count peers. Both make sense: Gemini's long-context advantage is less decisive when problems are self-contained, and Qwen's code fine-tuning advantage shows up more in syntactic completion than library reasoning.


Methodology

Benchmark Score Sources

HumanEval+ and MBPP+ scores come from the EvalPlus leaderboard, which accepts community submissions and verifies them against the test harness. LiveCodeBench scores come from the official LiveCodeBench repository and the associated leaderboard. BigCodeBench scores come from the BigCodeBench leaderboard.

For models not yet submitted to official leaderboards - typically the most recent closed-source releases and some smaller open-weight models - I cross-reference lab-published technical reports, independent third-party evaluations, and interpolate from closely related benchmarks. These are marked with asterisks. I report the most conservative available number when estimates conflict.

Why Pass@1 Greedy

Most papers and leaderboards now standardize on pass@1 with greedy decoding for apples-to-apples comparison. Temperature-sampled pass@k numbers (where k > 1) are higher and can be inflated by model verbosity rather than reasoning quality. I do not include best-of-N numbers in the main table. If a vendor only publishes pass@10 or pass@100 results, I note that and do not include their number in the ranking.

The Training Cutoff Problem

Contamination is a spectrum, not a binary. A model whose training data cutoff is September 2021 cannot have memorized HumanEval (released July 2021 but spreading through training corpora over months). A model with a January 2025 cutoff almost certainly has. In between, it depends on what data the training team excluded.

I do not attempt to correct scores for contamination - any such correction requires assumptions about training data that are not publicly verifiable. Instead, I use LiveCodeBench as the primary ranking signal precisely because its contamination risk is low by construction. If you see a model scoring 92%+ on HumanEval+ but below 40% on LiveCodeBench, that gap tells you something about its training data, not its coding ability.

Multilingual Coverage (MultiPL-E)

Full MultiPL-E tables would make this page unwieldy. The directional finding is consistent: models drop roughly 10-20 points from Python to Java/C++/JavaScript, and another 10-20 points moving to Go, Rust, or less common languages. The relative ordering between models is mostly preserved across languages, with one notable exception: models trained with large amounts of system-level C/C++ code (StarCoder2, Granite-Code) hold up better on C++ than their Python scores predict, while models trained primarily on web/scripting data degrade more steeply.


Model Notes

GPT-5 and o4

OpenAI's GPT-5 is the best non-reasoning frontier model on LiveCodeBench. o4 is strictly better on hard problems but 3-5x slower per token with proportional cost. For production code completion with low latency requirements, GPT-5 is the right choice. For batch jobs on complex algorithmic tasks - where you have seconds to minutes rather than milliseconds - o4 is worth the cost.

Claude Opus 4 and Sonnet 4

Claude Opus 4 leads the Anthropic lineup on BigCodeBench and DS-1000, reflecting strong performance on realistic library-usage tasks. Sonnet 4 is close behind on LiveCodeBench and significantly cheaper. For IDE-integrated completion at scale, Sonnet 4 is the practical choice. Opus 4 makes sense for batch code generation where quality is the only constraint.

DeepSeek-V3 and DeepSeek-Coder-V3

DeepSeek-V3 is the most capable open-weight model on LiveCodeBench. DeepSeek-Coder-V3 is its code-fine-tuned variant, which gains on BigCodeBench and MultiPL-E but does not improve LiveCodeBench scores meaningfully - suggesting the base model's reasoning ability drives competitive programming performance more than domain-specific fine-tuning does. Both are MIT-licensed and deployable without API dependencies. For teams that need frontier-adjacent code generation without closed-model API costs, DeepSeek-V3 is the current answer.

Qwen 2.5-Coder 32B and Qwen3-Coder

Qwen 2.5-Coder 32B from Alibaba remains the strongest sub-40B code model in independent evaluations. Its MultiPL-E scores for Java and C++ are particularly strong relative to its Python results. Qwen3-Coder is newer and improves across the board, but not to the point of competing with the top tier on LiveCodeBench. The 32B size makes it the most deployable high-quality option for teams running inference on four to eight consumer GPUs.

Codestral

Mistral's Codestral is specialized for code and trained on a substantially larger code corpus than Mistral's general models. Fill-in-the-middle (FIM) performance - completing code from both prefix and suffix context, which is what IDE tab completion actually does - is a Codestral strength not captured in the pass@1 metrics here. For IDE autocomplete use cases specifically, Codestral is worth evaluating beyond what the leaderboard table shows.

StarCoder2 and Granite-Code

StarCoder2 from the BigCode project covers over 600 programming languages and is the strongest model on low-resource language coverage. The 33B variant is competitive with significantly larger general-purpose models on MultiPL-E's less common language tracks. Granite-Code-34B from IBM is enterprise-focused, with a safety training layer and strong performance on Java - the language most IBM enterprise codebases are written in. Both are Apache 2.0 licensed.


Practical Guidance

For IDE code completion at scale: Claude Sonnet 4 or DeepSeek-V3 via API. Both offer sub-200ms p50 latency at scale and are in the top tier on LiveCodeBench. Codestral is worth benchmarking specifically for fill-in-the-middle if you are building an IDE integration.

For the best quality without cost constraints: o4 for algorithmic and competitive programming tasks. GPT-5 or Claude Opus 4 for realistic library-usage code generation. The choice between GPT-5 and Claude Opus 4 is close - run your own bakeoff on a sample of your actual tasks.

For self-hosted open-weight deployments: DeepSeek-V3 at the 70B+ tier, Qwen 2.5-Coder 32B for teams constrained to four consumer GPUs. Accept a 10-15 point LiveCodeBench gap versus frontier closed models.

For multilingual codebases: StarCoder2-33B for broad language coverage. Qwen3-Coder for better baseline quality with slightly narrower language support. Both outperform general-purpose models of similar size on non-Python languages.

For data science and ML workloads: DS-1000 scores are the relevant signal. GPT-5 and Claude Opus 4 lead. DeepSeek-V3 is close behind at open-weight pricing.

For evaluating your own models: Do not report HumanEval-only results and expect to be taken seriously. Require LiveCodeBench or BigCodeBench. Run EvalPlus instead of vanilla HumanEval and MBPP. If competitive programming performance matters, include LCB Hard scores.

For related rankings, see the SWE-Bench Coding Agent Leaderboard for full-repo agent tasks and the LLM Code Review Leaderboard for PR review quality.


Sources

Code Completion and Generation LLM Leaderboard 2026
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.