GLM-5.1

Z.ai's GLM-5.1 is an open-weight 754B MoE model that tops SWE-Bench Pro with 58.4, sustains 8-hour autonomous coding sessions, and runs under MIT license at $0.95/M input tokens.

GLM-5.1

Z.ai's GLM-5.1 is a post-training refinement of GLM-5 tuned specifically for agentic software engineering. Released April 7, 2026 under a MIT license, it currently holds the top position on SWE-Bench Pro at 58.4 - ahead of GPT-5.4 (57.7) and Claude Opus 4.6 (57.3). That's a modest margin, but the architectural foundation makes the result significant: the model was trained completely on Huawei Ascend 910B chips, with no US silicon anywhere in the stack.

TL;DR

  • Current #1 on SWE-Bench Pro (58.4), beating GPT-5.4 and Claude Opus 4.6 - and open-weight under MIT
  • 754B total / 40B active MoE, 200K context, 128K max output, text-only input
  • $0.95/M input, $3.15/M output at Z.ai - 5-15x cheaper than comparable closed models

The upgrade from GLM-5 to GLM-5.1 isn't an architecture change. Z.ai kept the same GLM_MOE_DSA (Dynamic Sparse Attention) backbone - 754 billion total parameters, 256 experts, 8 active per forward pass - and applied focused post-training on agentic coding tasks, long-horizon execution, and tool use. The result is a 28% improvement on coding benchmarks over GLM-5, plus a new CyberGym-leading score of 68.7 on security task completion.

For teams building autonomous coding agents, the defining capability is sustained execution. Z.ai has demonstrated 655-iteration agentic sessions - roughly eight hours of uninterrupted autonomous work - covering full Linux desktop environment construction from scratch. For a comparison point: most frontier models lose track of objectives or start optimizing for the wrong metric well before that mark.

Key Specifications

SpecificationDetails
ProviderZhipu AI (Z.ai)
Model FamilyGLM
Total Parameters754B (256 experts, 8 active per token)
Active Parameters~40B per forward pass
Context Window200K tokens
Max Output Tokens128K
Input ModalityText only
Input Price (Z.ai API)$0.95/M tokens
Output Price (Z.ai API)$3.15/M tokens
Release DateApril 7, 2026
LicenseMIT
Training HardwareHuawei Ascend 910B (100K chips)
WeightsHuggingFace (BF16 + FP8)

Benchmark Performance

GLM-5.1's strength is concentrated in coding and agentic tasks. Across general reasoning benchmarks, it falls behind the frontier.

BenchmarkGLM-5.1GPT-5.4Claude Opus 4.6Gemini 3.1 Pro
SWE-Bench Pro58.457.757.354.2
Terminal-Bench 2.063.5-68.5-
CyberGym68.7-66.6-
NL2Repo42.7---
GPQA-Diamond86.2--94.3
AIME 202695.398.798.2-
AA Intelligence Index51---

The SWE-Bench Pro result is the headline, but it carries a caveat: the score comes from Z.ai's internal testing. Partial external validation exists through Arena.ai's Code Arena, where GLM-5.1 holds an Elo of 1530 and ranks third on the agentic webdev leaderboard. A fully independent third-party SWE-Bench Pro run has not been published as of this writing - treat the exact margin over Claude and GPT-5.4 as preliminary. See the coding benchmarks leaderboard for full context.

On reasoning, the gap is real. AIME 2026 (95.3 versus GPT-5.4's 98.7) and GPQA-Diamond (86.2 versus Gemini 3.1 Pro's 94.3) show that GLM-5.1 isn't a general-purpose frontier replacement. If your workload involves mathematical proofs, scientific research, or graduate-level reasoning, you should look elsewhere. If it involves writing and debugging large codebases autonomously, this is currently the strongest open-weight option available.

Close-up macro photograph of electronic circuit board components GLM-5.1 runs on infrastructure trained entirely on Huawei Ascend 910B chips - the same hardware as its predecessor GLM-5. Source: pexels.com

Key Capabilities

The most differentiated capability in GLM-5.1 is sustained agentic execution. Z.ai's internal benchmarks show the model completing a full plan-execute-analyze-optimize loop across hundreds of tool calls without strategy drift - a failure mode that affects most current frontier models around the 30-50 iteration mark. In a CUDA kernel optimization task, GLM-5.1 improved speedup from 2.6x to 35.7x through 178 rounds of autonomous iteration.

The API surface covers what production agent pipelines need: function calling with parallel tool use, streaming, structured outputs with JSON schema validation, context caching, and a thinking mode for extended reasoning chains. Supported inference frameworks include SGLang (v0.5.10+), vLLM (v0.19.0+), KTransformers, Transformers, and xLLM. Ollama ships the model in its library for local deployment.

One notable omission is multimodal input. GLM-5.1 accepts text only - no images, PDFs, or audio. Competitors including Claude Opus 4.6 and GPT-5.4 handle vision natively. If your agent pipeline involves reading screenshots, diagrams, or documents, that's a hard constraint.

Code displayed on a monitor screen in a dark developer setup GLM-5.1 targets long-horizon autonomous coding workflows rather than interactive IDE completions. Source: pexels.com

Pricing and Availability

Z.ai's direct API is the most affordable access point. OpenRouter and Hugging Face Inference add overhead, pushing input costs to $1.05-$1.40/M and output costs to $3.50-$4.40/M depending on provider.

ProviderInputOutput
Z.ai (official)$0.95/M$3.15/M
OpenRouter$1.05/M$3.50/M
Hugging Face Inference$1.40/M$4.40/M
Claude Opus 4.6 (for reference)$5.00/M$25.00/M
GPT-5.4 (for reference)$2.50/M$15.00/M

Cache reads on the Z.ai API come in at $0.26/M tokens, which substantially lowers effective cost on repeated context. Z.ai also offers a subscription Coding Plan with tiered monthly access to GLM-5.1 and related models.

Self-hosting requires significant hardware. The FP8-quantized variant (zai-org/GLM-5.1-FP8) runs on 8x H200 GPUs at minimum. The full BF16 weights need much more. This isn't a model for consumer hardware.

The model is listed at ollama.com/library/glm-5.1 and the weights are on HuggingFace at zai-org/GLM-5.1 under a MIT license - unrestricted commercial use, no attribution required beyond the license file.

Strengths and Weaknesses

Strengths

  • #1 SWE-Bench Pro: 58.4 is the highest publicly reported score on the benchmark most predictive of real software engineering performance
  • Sustained agentic execution: eight-hour uninterrupted sessions with maintained goal alignment across hundreds of tool calls
  • MIT license: fully open weights, commercial use allowed, no vendor lock-in
  • Competitive API pricing: 5x cheaper than Claude Opus 4.6 and 2.5x cheaper than GPT-5.4 at the input token level
  • CyberGym leader: 68.7 on security task completion, ahead of Claude Opus 4.6's 66.6
  • Strong inference tooling: SGLang, vLLM, KTransformers support with first-class function calling and caching

Weaknesses

  • Text-only input: no image, PDF, or audio input while top closed models are all multimodal
  • Self-reported benchmark lead: SWE-Bench Pro score from Z.ai's own testing; independent verification pending
  • Slower generation: ~47 tokens/second on typical hardware vs. a competitive median above 53 t/s
  • Verbose outputs: produces around 110M tokens during standard evaluations against a peer median of 15M - inflates costs in practice
  • Reasoning gaps: AIME 2026 and GPQA-Diamond trail GPT-5.4 and Gemini 3.1 Pro by meaningful margins
  • High self-hosting bar: 8x H200 minimum for FP8; not practical for most developers outside enterprise settings

Sources

✓ Last verified April 22, 2026

GLM-5.1
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.