GLM-5.1 Review: Open-Source Model Tops SWE-Bench Pro
Z.ai's GLM-5.1 is a 754B open-weight model that claims the top spot on SWE-Bench Pro without a single NVIDIA chip - here's how it holds up in practice.

On April 7, 2026, Z.ai (formerly Zhipu AI) released the weights of GLM-5.1 under the MIT license and posted a benchmark table that made a lot of AI engineers do a double take. A 754-billion-parameter open-weight model, trained completely on Huawei Ascend chips with no US silicon, had taken the top score on SWE-Bench Pro - beating GPT-5.4 and Claude Opus 4.6. The two-point margin is modest, but the symbolic weight is not.
TL;DR
- 8.1/10 - the best open-weight model for long-horizon agentic coding, with caveats
- Tops SWE-Bench Pro (58.4) ahead of GPT-5.4 (57.7) and Claude Opus 4.6 (57.3); also leads CyberGym with 68.7
- Text-only (no vision), slower generation speed, and key benchmark numbers come from Z.ai's own testing
- Use it if you're building autonomous coding agents or long-running engineering pipelines; skip it if you need real-time IDE completions or multimodal workflows
Where GLM-5.1 Comes From
Z.ai is a Beijing-based lab spun out of Tsinghua University's KEG research group. The company appeared on the US Entity List in January 2025, cutting off legal access to Nvidia's H100, H200, and B200 GPUs. Most frontier AI labs outside China treat those chips as oxygen. Z.ai had to build without them.
The entire GLM-5 family - including 5.1 - was trained on around 100,000 Huawei Ascend 910B chips running MindSpore, Huawei's homegrown deep learning framework. The decision wasn't ideological. It was the only available path under US export controls. That constraint makes the benchmark results much more interesting than they'd be otherwise.
On January 8, 2026, Zhipu completed a Hong Kong IPO - the first publicly traded foundation model company in the world - raising roughly HKD 4.35 billion (around $558 million USD) at a valuation near $31.3 billion. The GLM-5.1 release came three months later, and the timing isn't coincidental. The company needs to justify that valuation with technical output.
The GLM-5 base model launched in February 2026. GLM-5.1 is a post-training refinement focused specifically on agentic coding tasks and long-horizon execution. The architecture is unchanged; the improvements come from training methodology.
Architecture and Specifications
GLM-5.1 uses a Mixture-of-Experts design the company calls GLM_MOE_DSA (Dynamic Sparse Attention), with 754 billion total parameters and 40 billion active per forward pass. That active-parameter count places it in the same inference-compute neighborhood as models like DeepSeek V3.2 and Kimi K2.5, which both use similar MoE approaches.
Key Specs
| Spec | Value |
|---|---|
| Parameters (total) | 754B (MoE) |
| Active parameters | 40B per token |
| Context window | 200,000 tokens |
| Max output tokens | 128,000 |
| Modalities | Text only |
| License | MIT |
| Training hardware | Huawei Ascend 910B |
| Release date | April 7, 2026 |
The 200K context window with 128K maximum output is a truly generous combination - most frontier models cap output at 32K or 64K. For long-horizon agentic tasks that generate large code artifacts or iterate across many files, that ceiling matters.
Supported inference frameworks include SGLang (v0.5.10+), vLLM (v0.19.0+), Transformers, KTransformers, and xLLM. Ollama also has the model in its library. The API supports function calling, streaming, structured outputs, context caching, and a thinking mode for extended reasoning.
GLM-5.1's training ran on Huawei Ascend 910B chips - no NVIDIA hardware anywhere in the stack.
Source: pexels.com
Benchmarks
The headline number is SWE-Bench Pro, but GLM-5.1's benchmark profile is more varied than the press releases suggest.
| Benchmark | GLM-5.1 | GPT-5.4 | Claude Opus 4.6 | Gemini 3.1 Pro |
|---|---|---|---|---|
| SWE-Bench Pro | 58.4 | 57.7 | 57.3 | 54.2 |
| Terminal-Bench 2.0 | 63.5 | - | 68.5 | - |
| CyberGym | 68.7 | - | 66.6 | - |
| GPQA-Diamond | 86.2 | - | - | 94.3 |
| AIME 2026 | 95.3 | 98.7 | 98.2 | - |
On coding-specific work, the picture is positive. SWE-Bench Pro tests real software engineering tasks - applying patches, fixing bugs, implementing features from natural language descriptions - and GLM-5.1's 58.4 is currently the highest public score on that leaderboard. It also leads CyberGym at 68.7, a security task completion benchmark across 1,507 tasks.
The gaps appear on general reasoning. AIME 2026 (advanced math competition problems) has GPT-5.4 at 98.7% and Claude Opus 4.6 at 98.2%, while GLM-5.1 comes in at 95.3. GPQA-Diamond (graduate-level science reasoning) shows an even wider gap against Gemini 3.1 Pro's 94.3%. If your workload is heavy on mathematical derivations or scientific research, this isn't the model to reach for.
One thing worth noting: the SWE-Bench Pro results come from Z.ai's internal testing. Arena.ai independently placed GLM-5.1 third on their Code Arena agentic webdev leaderboard with an Elo rating of 1530, which provides at least partial external validation for the coding performance. But a fully independent evaluation on SWE-Bench Pro from a third-party lab hasn't been published as of this writing. Treat the exact margin over Claude and GPT-5.4 as preliminary.
Agentic Capabilities
The most compelling part of GLM-5.1 isn't any individual benchmark score. It's the claim that the model can run autonomously for up to eight hours without human checkpoints, completing a full plan-execute-analyze-optimize loop across hundreds of iterations.
Z.ai demonstrated this with two tasks. First: building a complete Linux desktop environment from scratch - file browser, terminal, text editor, system monitor, and playable games - through 655 autonomous iterations in a single session. Second: optimizing a vector database over 178 rounds of autonomous tuning, improving throughput to 6.9 times the original baseline. In a separate CUDA kernel optimization task, the model improved speedup from 2.6x to 35.7x through sustained iterative refinement.
These are company-produced demos, so the usual skepticism applies. But the underlying technical approach - maintaining goal alignment across hundreds of tool calls without strategy drift - addresses a real failure mode of current agentic systems, where models lose context or start tuning for the wrong objective after a few dozen steps.
The practical implication: GLM-5.1 makes more sense as the backbone of an autonomous coding agent than as an interactive assistant. Plug it into a CI/CD pipeline, give it a ticket, and come back later. For real-time IDE completions where latency matters, Cursor or Claude Code with a faster model will feel more responsive.
GLM-5.1 is built for long-running agentic loops, not fast interactive completions.
Source: pexels.com
GLM-5.1 makes more sense as the backbone of an autonomous agent than as an interactive assistant. Give it a ticket and come back later.
Pricing and Access
The model weights are on Hugging Face at zai-org/GLM-5.1 under a MIT license, which means unrestricted commercial use and self-hosting.
For API access, Z.ai prices GLM-5.1 at $0.95 per million input tokens and $3.15 per million output tokens, with cached inputs at $0.26 per million. Those prices sit comfortably below what Anthropic and OpenAI charge for their frontier models.
Z.ai also offers a subscription-tier Coding Plan aimed at developers who want integrated tooling rather than raw API access:
| Tier | Price | Notes |
|---|---|---|
| Lite | ~$10/month | Basic access to GLM-5.1 and GLM-5-Turbo |
| Pro | ~$30/month | Full model suite, priority throughput |
| Max | ~$80/month | Near-unlimited usage of premium models |
Free-tier users get access to GLM-4.7-Flash and GLM-4.5-Flash at no cost, along with a small allocation of premium model credits.
Local deployment works with standard frameworks. On a NVIDIA DGX Spark or equivalent multi-GPU setup, you can run the FP8-quantized variant (zai-org/GLM-5.1-FP8) with reduced memory requirements. The full BF16 weights need substantially more VRAM - this isn't a laptop model.
Strengths
- SWE-Bench Pro leader: 58.4 is the highest public score on the benchmark that best predicts real-world software engineering performance
- Genuinely long autonomous execution: eight-hour uninterrupted agentic sessions with maintained goal alignment aren't something most models can do
- MIT license with commercial use: no restrictions, self-hostable, no vendor lock-in
- Competitive API pricing: at $0.95/$3.15 per million tokens, it undercuts frontier closed-source models
- Deep engineering toolkit: function calling, context caching, MCP integration, structured outputs, thinking mode - all there
- CyberGym leader: 68.7 on security task completion ahead of Claude Opus 4.6's 66.6
Weaknesses
- Text-only: no image, audio, or video input while GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro are all multimodal
- Slower generation speed: 40-44 tokens/second against a competitive median above 53 t/s, which is noticeable in interactive use
- Verbose output: the model produces significantly more tokens than peers to reach equivalent answers, which inflates costs and latency
- Math and science gaps: AIME 2026 (95.3) and GPQA-Diamond (86.2) trail the frontier, making it a weaker choice for research or quantitative tasks
- Self-reported benchmarks: the SWE-Bench Pro score hasn't been independently verified by a third-party lab
- Young ecosystem: fewer community integrations and third-party tooling than Claude or OpenAI-compatible APIs
Verdict
GLM-5.1 is the most capable open-weight model available for agentic software engineering work. The SWE-Bench Pro result is real - partially backed up by Arena.ai's independent Code Arena rankings - and the sustained eight-hour autonomous execution is a genuine differentiator for teams building long-running coding agents. The MIT license and competitive API pricing remove the usual friction of adopting a less-established model.
The weaknesses are real too. No multimodal input, slower token generation, and benchmark claims that haven't all been independently verified mean this isn't a drop-in replacement for Claude or GPT-5.4 across general workloads. For coding agents specifically - especially those that need to run overnight on complex engineering tasks - GLM-5.1 earns serious consideration.
The detail that makes the result interesting beyond the benchmark number: it was built entirely on Huawei silicon, by a lab that's been cut off from US chips for over a year. Whether that changes anything about how US policymakers think about export controls is a separate question. The technical fact is that the gap between Western and Chinese frontier models on coding tasks is now measured in decimal points.
Score: 8.1 / 10
Sources
- GLM-5.1 Official Documentation - Z.ai
- zai-org/GLM-5.1 on Hugging Face
- Z.AI Introduces GLM-5.1 - MarkTechPost
- GLM-5.1 - Artificial Analysis
- GLM-5.1 API Pricing - OpenRouter
- GLM-5.1 Tops SWE-Bench Pro - Dataconomy
- GLM-5.1 Open Source #1 on SWE-Bench Pro - ModemGuides
- GLM-5.1 Full Review - BuildFastWithAI
- GLM-5.1 Benchmarks Breakdown - Lushbinary
- Z.ai Coding Plan Pricing
