GLM-5.1
Z.ai's GLM-5.1 is an open-weight 754B MoE model that tops SWE-Bench Pro with 58.4, sustains 8-hour autonomous coding sessions, and runs under MIT license at $0.95/M input tokens.

Z.ai's GLM-5.1 is a post-training refinement of GLM-5 tuned specifically for agentic software engineering. Released April 7, 2026 under a MIT license, it currently holds the top position on SWE-Bench Pro at 58.4 - ahead of GPT-5.4 (57.7) and Claude Opus 4.6 (57.3). That's a modest margin, but the architectural foundation makes the result significant: the model was trained completely on Huawei Ascend 910B chips, with no US silicon anywhere in the stack.
TL;DR
- Current #1 on SWE-Bench Pro (58.4), beating GPT-5.4 and Claude Opus 4.6 - and open-weight under MIT
- 754B total / 40B active MoE, 200K context, 128K max output, text-only input
- $0.95/M input, $3.15/M output at Z.ai - 5-15x cheaper than comparable closed models
The upgrade from GLM-5 to GLM-5.1 isn't an architecture change. Z.ai kept the same GLM_MOE_DSA (Dynamic Sparse Attention) backbone - 754 billion total parameters, 256 experts, 8 active per forward pass - and applied focused post-training on agentic coding tasks, long-horizon execution, and tool use. The result is a 28% improvement on coding benchmarks over GLM-5, plus a new CyberGym-leading score of 68.7 on security task completion.
For teams building autonomous coding agents, the defining capability is sustained execution. Z.ai has demonstrated 655-iteration agentic sessions - roughly eight hours of uninterrupted autonomous work - covering full Linux desktop environment construction from scratch. For a comparison point: most frontier models lose track of objectives or start optimizing for the wrong metric well before that mark.
Key Specifications
| Specification | Details |
|---|---|
| Provider | Zhipu AI (Z.ai) |
| Model Family | GLM |
| Total Parameters | 754B (256 experts, 8 active per token) |
| Active Parameters | ~40B per forward pass |
| Context Window | 200K tokens |
| Max Output Tokens | 128K |
| Input Modality | Text only |
| Input Price (Z.ai API) | $0.95/M tokens |
| Output Price (Z.ai API) | $3.15/M tokens |
| Release Date | April 7, 2026 |
| License | MIT |
| Training Hardware | Huawei Ascend 910B (100K chips) |
| Weights | HuggingFace (BF16 + FP8) |
Benchmark Performance
GLM-5.1's strength is concentrated in coding and agentic tasks. Across general reasoning benchmarks, it falls behind the frontier.
| Benchmark | GLM-5.1 | GPT-5.4 | Claude Opus 4.6 | Gemini 3.1 Pro |
|---|---|---|---|---|
| SWE-Bench Pro | 58.4 | 57.7 | 57.3 | 54.2 |
| Terminal-Bench 2.0 | 63.5 | - | 68.5 | - |
| CyberGym | 68.7 | - | 66.6 | - |
| NL2Repo | 42.7 | - | - | - |
| GPQA-Diamond | 86.2 | - | - | 94.3 |
| AIME 2026 | 95.3 | 98.7 | 98.2 | - |
| AA Intelligence Index | 51 | - | - | - |
The SWE-Bench Pro result is the headline, but it carries a caveat: the score comes from Z.ai's internal testing. Partial external validation exists through Arena.ai's Code Arena, where GLM-5.1 holds an Elo of 1530 and ranks third on the agentic webdev leaderboard. A fully independent third-party SWE-Bench Pro run has not been published as of this writing - treat the exact margin over Claude and GPT-5.4 as preliminary. See the coding benchmarks leaderboard for full context.
On reasoning, the gap is real. AIME 2026 (95.3 versus GPT-5.4's 98.7) and GPQA-Diamond (86.2 versus Gemini 3.1 Pro's 94.3) show that GLM-5.1 isn't a general-purpose frontier replacement. If your workload involves mathematical proofs, scientific research, or graduate-level reasoning, you should look elsewhere. If it involves writing and debugging large codebases autonomously, this is currently the strongest open-weight option available.
GLM-5.1 runs on infrastructure trained entirely on Huawei Ascend 910B chips - the same hardware as its predecessor GLM-5.
Source: pexels.com
Key Capabilities
The most differentiated capability in GLM-5.1 is sustained agentic execution. Z.ai's internal benchmarks show the model completing a full plan-execute-analyze-optimize loop across hundreds of tool calls without strategy drift - a failure mode that affects most current frontier models around the 30-50 iteration mark. In a CUDA kernel optimization task, GLM-5.1 improved speedup from 2.6x to 35.7x through 178 rounds of autonomous iteration.
The API surface covers what production agent pipelines need: function calling with parallel tool use, streaming, structured outputs with JSON schema validation, context caching, and a thinking mode for extended reasoning chains. Supported inference frameworks include SGLang (v0.5.10+), vLLM (v0.19.0+), KTransformers, Transformers, and xLLM. Ollama ships the model in its library for local deployment.
One notable omission is multimodal input. GLM-5.1 accepts text only - no images, PDFs, or audio. Competitors including Claude Opus 4.6 and GPT-5.4 handle vision natively. If your agent pipeline involves reading screenshots, diagrams, or documents, that's a hard constraint.
GLM-5.1 targets long-horizon autonomous coding workflows rather than interactive IDE completions.
Source: pexels.com
Pricing and Availability
Z.ai's direct API is the most affordable access point. OpenRouter and Hugging Face Inference add overhead, pushing input costs to $1.05-$1.40/M and output costs to $3.50-$4.40/M depending on provider.
| Provider | Input | Output |
|---|---|---|
| Z.ai (official) | $0.95/M | $3.15/M |
| OpenRouter | $1.05/M | $3.50/M |
| Hugging Face Inference | $1.40/M | $4.40/M |
| Claude Opus 4.6 (for reference) | $5.00/M | $25.00/M |
| GPT-5.4 (for reference) | $2.50/M | $15.00/M |
Cache reads on the Z.ai API come in at $0.26/M tokens, which substantially lowers effective cost on repeated context. Z.ai also offers a subscription Coding Plan with tiered monthly access to GLM-5.1 and related models.
Self-hosting requires significant hardware. The FP8-quantized variant (zai-org/GLM-5.1-FP8) runs on 8x H200 GPUs at minimum. The full BF16 weights need much more. This isn't a model for consumer hardware.
The model is listed at ollama.com/library/glm-5.1 and the weights are on HuggingFace at zai-org/GLM-5.1 under a MIT license - unrestricted commercial use, no attribution required beyond the license file.
Strengths and Weaknesses
Strengths
- #1 SWE-Bench Pro: 58.4 is the highest publicly reported score on the benchmark most predictive of real software engineering performance
- Sustained agentic execution: eight-hour uninterrupted sessions with maintained goal alignment across hundreds of tool calls
- MIT license: fully open weights, commercial use allowed, no vendor lock-in
- Competitive API pricing: 5x cheaper than Claude Opus 4.6 and 2.5x cheaper than GPT-5.4 at the input token level
- CyberGym leader: 68.7 on security task completion, ahead of Claude Opus 4.6's 66.6
- Strong inference tooling: SGLang, vLLM, KTransformers support with first-class function calling and caching
Weaknesses
- Text-only input: no image, PDF, or audio input while top closed models are all multimodal
- Self-reported benchmark lead: SWE-Bench Pro score from Z.ai's own testing; independent verification pending
- Slower generation: ~47 tokens/second on typical hardware vs. a competitive median above 53 t/s
- Verbose outputs: produces around 110M tokens during standard evaluations against a peer median of 15M - inflates costs in practice
- Reasoning gaps: AIME 2026 and GPQA-Diamond trail GPT-5.4 and Gemini 3.1 Pro by meaningful margins
- High self-hosting bar: 8x H200 minimum for FP8; not practical for most developers outside enterprise settings
Related Coverage
- GLM-5.1 Review: Open-Source Model Tops SWE-Bench Pro - our hands-on evaluation
- GLM-5 - the base model, released February 2026
- Open-Source LLM Leaderboard - where GLM-5.1 ranks among open models
- SWE-Bench Coding Agent Leaderboard - full SWE-Bench rankings
- Coding Benchmarks Leaderboard - broader coding benchmark comparison
- GLM-4.7-Flash - lighter predecessor in the GLM family
Sources
✓ Last verified April 22, 2026
