Z.ai's GLM-5.1 is a post-training refinement of GLM-5 tuned specifically for agentic software engineering. Released April 7, 2026 under a MIT license, it currently holds the top position on SWE-Bench Pro at 58.4 - ahead of GPT-5.4 (57.7) and Claude Opus 4.6 (57.3). That's a modest margin, but the architectural foundation makes the result significant: the model was trained completely on Huawei Ascend 910B chips, with no US silicon anywhere in the stack.

TL;DR

Current #1 on SWE-Bench Pro (58.4), beating GPT-5.4 and Claude Opus 4.6 - and open-weight under MIT
754B total / 40B active MoE, 200K context, 128K max output, text-only input
$0.95/M input, $3.15/M output at Z.ai - 5-15x cheaper than comparable closed models

The upgrade from GLM-5 to GLM-5.1 isn't an architecture change. Z.ai kept the same GLM_MOE_DSA (Dynamic Sparse Attention) backbone - 754 billion total parameters, 256 experts, 8 active per forward pass - and applied focused post-training on agentic coding tasks, long-horizon execution, and tool use. The result is a 28% improvement on coding benchmarks over GLM-5, plus a new CyberGym-leading score of 68.7 on security task completion.

For teams building autonomous coding agents, the defining capability is sustained execution. Z.ai has demonstrated 655-iteration agentic sessions - roughly eight hours of uninterrupted autonomous work - covering full Linux desktop environment construction from scratch. For a comparison point: most frontier models lose track of objectives or start optimizing for the wrong metric well before that mark.

Key Specifications

Specification	Details
Provider	Zhipu AI (Z.ai)
Model Family	GLM
Total Parameters	754B (256 experts, 8 active per token)
Active Parameters	~40B per forward pass
Context Window	200K tokens
Max Output Tokens	128K
Input Modality	Text only
Input Price (Z.ai API)	$0.95/M tokens
Output Price (Z.ai API)	$3.15/M tokens
Release Date	April 7, 2026
License	MIT
Training Hardware	Huawei Ascend 910B (100K chips)
Weights	HuggingFace (BF16 + FP8)

Benchmark Performance

GLM-5.1's strength is concentrated in coding and agentic tasks. Across general reasoning benchmarks, it falls behind the frontier.

Benchmark	GLM-5.1	GPT-5.4	Claude Opus 4.6	Gemini 3.1 Pro
SWE-Bench Pro	58.4	57.7	57.3	54.2
Terminal-Bench 2.0	63.5	-	68.5	-
CyberGym	68.7	-	66.6	-
NL2Repo	42.7	-	-	-
GPQA-Diamond	86.2	-	-	94.3
AIME 2026	95.3	98.7	98.2	-
AA Intelligence Index	51	-	-	-

The SWE-Bench Pro result is the headline, but it carries a caveat: the score comes from Z.ai's internal testing. Partial external validation exists through Arena.ai's Code Arena, where GLM-5.1 holds an Elo of 1530 and ranks third on the agentic webdev leaderboard. A fully independent third-party SWE-Bench Pro run has not been published as of this writing - treat the exact margin over Claude and GPT-5.4 as preliminary. See the coding benchmarks leaderboard for full context.

On reasoning, the gap is real. AIME 2026 (95.3 versus GPT-5.4's 98.7) and GPQA-Diamond (86.2 versus Gemini 3.1 Pro's 94.3) show that GLM-5.1 isn't a general-purpose frontier replacement. If your workload involves mathematical proofs, scientific research, or graduate-level reasoning, you should look elsewhere. If it involves writing and debugging large codebases autonomously, this is currently the strongest open-weight option available.

Close-up macro photograph of electronic circuit board components GLM-5.1 runs on infrastructure trained entirely on Huawei Ascend 910B chips - the same hardware as its predecessor GLM-5. Source: pexels.com

Key Capabilities

The most differentiated capability in GLM-5.1 is sustained agentic execution. Z.ai's internal benchmarks show the model completing a full plan-execute-analyze-optimize loop across hundreds of tool calls without strategy drift - a failure mode that affects most current frontier models around the 30-50 iteration mark. In a CUDA kernel optimization task, GLM-5.1 improved speedup from 2.6x to 35.7x through 178 rounds of autonomous iteration.

The API surface covers what production agent pipelines need: function calling with parallel tool use, streaming, structured outputs with JSON schema validation, context caching, and a thinking mode for extended reasoning chains. Supported inference frameworks include SGLang (v0.5.10+), vLLM (v0.19.0+), KTransformers, Transformers, and xLLM. Ollama ships the model in its library for local deployment.

One notable omission is multimodal input. GLM-5.1 accepts text only - no images, PDFs, or audio. Competitors including Claude Opus 4.6 and GPT-5.4 handle vision natively. If your agent pipeline involves reading screenshots, diagrams, or documents, that's a hard constraint.

Code displayed on a monitor screen in a dark developer setup GLM-5.1 targets long-horizon autonomous coding workflows rather than interactive IDE completions. Source: pexels.com

Pricing and Availability

Z.ai's direct API is the most affordable access point. OpenRouter and Hugging Face Inference add overhead, pushing input costs to $1.05-$1.40/M and output costs to $3.50-$4.40/M depending on provider.

Provider	Input	Output
Z.ai (official)	$0.95/M	$3.15/M
OpenRouter	$1.05/M	$3.50/M
Hugging Face Inference	$1.40/M	$4.40/M
Claude Opus 4.6 (for reference)	$5.00/M	$25.00/M
GPT-5.4 (for reference)	$2.50/M	$15.00/M

Cache reads on the Z.ai API come in at $0.26/M tokens, which substantially lowers effective cost on repeated context. Z.ai also offers a subscription Coding Plan with tiered monthly access to GLM-5.1 and related models.

Self-hosting requires significant hardware. The FP8-quantized variant (zai-org/GLM-5.1-FP8) runs on 8x H200 GPUs at minimum. The full BF16 weights need much more. This isn't a model for consumer hardware.

The model is listed at ollama.com/library/glm-5.1 and the weights are on HuggingFace at zai-org/GLM-5.1 under a MIT license - unrestricted commercial use, no attribution required beyond the license file.

Strengths and Weaknesses

Strengths

#1 SWE-Bench Pro: 58.4 is the highest publicly reported score on the benchmark most predictive of real software engineering performance
Sustained agentic execution: eight-hour uninterrupted sessions with maintained goal alignment across hundreds of tool calls
MIT license: fully open weights, commercial use allowed, no vendor lock-in
Competitive API pricing: 5x cheaper than Claude Opus 4.6 and 2.5x cheaper than GPT-5.4 at the input token level
CyberGym leader: 68.7 on security task completion, ahead of Claude Opus 4.6's 66.6
Strong inference tooling: SGLang, vLLM, KTransformers support with first-class function calling and caching

Weaknesses

Text-only input: no image, PDF, or audio input while top closed models are all multimodal
Self-reported benchmark lead: SWE-Bench Pro score from Z.ai's own testing; independent verification pending
Slower generation: ~47 tokens/second on typical hardware vs. a competitive median above 53 t/s
Verbose outputs: produces around 110M tokens during standard evaluations against a peer median of 15M - inflates costs in practice
Reasoning gaps: AIME 2026 and GPQA-Diamond trail GPT-5.4 and Gemini 3.1 Pro by meaningful margins
High self-hosting bar: 8x H200 minimum for FP8; not practical for most developers outside enterprise settings

GLM-5.1 Review: Open-Source Model Tops SWE-Bench Pro - our hands-on evaluation
GLM-5 - the base model, released February 2026
Open-Source LLM Leaderboard - where GLM-5.1 ranks among open models
SWE-Bench Coding Agent Leaderboard - full SWE-Bench rankings
Coding Benchmarks Leaderboard - broader coding benchmark comparison
GLM-4.7-Flash - lighter predecessor in the GLM family