Item: GLM-5.1
Author: Elena Marchetti

On April 7, 2026, Z.ai (formerly Zhipu AI) released the weights of GLM-5.1 under the MIT license and posted a benchmark table that made a lot of AI engineers do a double take. A 754-billion-parameter open-weight model, trained completely on Huawei Ascend chips with no US silicon, had taken the top score on SWE-Bench Pro - beating GPT-5.4 and Claude Opus 4.6. The two-point margin is modest, but the symbolic weight is not.

TL;DR

8.1/10 - the best open-weight model for long-horizon agentic coding, with caveats
Tops SWE-Bench Pro (58.4) ahead of GPT-5.4 (57.7) and Claude Opus 4.6 (57.3); also leads CyberGym with 68.7
Text-only (no vision), slower generation speed, and key benchmark numbers come from Z.ai's own testing
Use it if you're building autonomous coding agents or long-running engineering pipelines; skip it if you need real-time IDE completions or multimodal workflows

Where GLM-5.1 Comes From

Z.ai is a Beijing-based lab spun out of Tsinghua University's KEG research group. The company appeared on the US Entity List in January 2025, cutting off legal access to Nvidia's H100, H200, and B200 GPUs. Most frontier AI labs outside China treat those chips as oxygen. Z.ai had to build without them.

The entire GLM-5 family - including 5.1 - was trained on around 100,000 Huawei Ascend 910B chips running MindSpore, Huawei's homegrown deep learning framework. The decision wasn't ideological. It was the only available path under US export controls. That constraint makes the benchmark results much more interesting than they'd be otherwise.

On January 8, 2026, Zhipu completed a Hong Kong IPO - the first publicly traded foundation model company in the world - raising roughly HKD 4.35 billion (around $558 million USD) at a valuation near $31.3 billion. The GLM-5.1 release came three months later, and the timing isn't coincidental. The company needs to justify that valuation with technical output.

The GLM-5 base model launched in February 2026. GLM-5.1 is a post-training refinement focused specifically on agentic coding tasks and long-horizon execution. The architecture is unchanged; the improvements come from training methodology.

Architecture and Specifications

GLM-5.1 uses a Mixture-of-Experts design the company calls GLM_MOE_DSA (Dynamic Sparse Attention), with 754 billion total parameters and 40 billion active per forward pass. That active-parameter count places it in the same inference-compute neighborhood as models like DeepSeek V3.2 and Kimi K2.5, which both use similar MoE approaches.

Key Specs

Spec	Value
Parameters (total)	754B (MoE)
Active parameters	40B per token
Context window	200,000 tokens
Max output tokens	128,000
Modalities	Text only
License	MIT
Training hardware	Huawei Ascend 910B
Release date	April 7, 2026

The 200K context window with 128K maximum output is a truly generous combination - most frontier models cap output at 32K or 64K. For long-horizon agentic tasks that generate large code artifacts or iterate across many files, that ceiling matters.

Supported inference frameworks include SGLang (v0.5.10+), vLLM (v0.19.0+), Transformers, KTransformers, and xLLM. Ollama also has the model in its library. The API supports function calling, streaming, structured outputs, context caching, and a thinking mode for extended reasoning.

Close-up macro photograph of intricate electronic circuit board components GLM-5.1's training ran on Huawei Ascend 910B chips - no NVIDIA hardware anywhere in the stack. Source: pexels.com

Benchmarks

The headline number is SWE-Bench Pro, but GLM-5.1's benchmark profile is more varied than the press releases suggest.

Benchmark	GLM-5.1	GPT-5.4	Claude Opus 4.6	Gemini 3.1 Pro
SWE-Bench Pro	58.4	57.7	57.3	54.2
Terminal-Bench 2.0	63.5	-	68.5	-
CyberGym	68.7	-	66.6	-
GPQA-Diamond	86.2	-	-	94.3
AIME 2026	95.3	98.7	98.2	-

On coding-specific work, the picture is positive. SWE-Bench Pro tests real software engineering tasks - applying patches, fixing bugs, implementing features from natural language descriptions - and GLM-5.1's 58.4 is currently the highest public score on that leaderboard. It also leads CyberGym at 68.7, a security task completion benchmark across 1,507 tasks.

The gaps appear on general reasoning. AIME 2026 (advanced math competition problems) has GPT-5.4 at 98.7% and Claude Opus 4.6 at 98.2%, while GLM-5.1 comes in at 95.3. GPQA-Diamond (graduate-level science reasoning) shows an even wider gap against Gemini 3.1 Pro's 94.3%. If your workload is heavy on mathematical derivations or scientific research, this isn't the model to reach for.

One thing worth noting: the SWE-Bench Pro results come from Z.ai's internal testing. Arena.ai independently placed GLM-5.1 third on their Code Arena agentic webdev leaderboard with an Elo rating of 1530, which provides at least partial external validation for the coding performance. But a fully independent evaluation on SWE-Bench Pro from a third-party lab hasn't been published as of this writing. Treat the exact margin over Claude and GPT-5.4 as preliminary.

Agentic Capabilities

The most compelling part of GLM-5.1 isn't any individual benchmark score. It's the claim that the model can run autonomously for up to eight hours without human checkpoints, completing a full plan-execute-analyze-optimize loop across hundreds of iterations.

Z.ai demonstrated this with two tasks. First: building a complete Linux desktop environment from scratch - file browser, terminal, text editor, system monitor, and playable games - through 655 autonomous iterations in a single session. Second: optimizing a vector database over 178 rounds of autonomous tuning, improving throughput to 6.9 times the original baseline. In a separate CUDA kernel optimization task, the model improved speedup from 2.6x to 35.7x through sustained iterative refinement.

These are company-produced demos, so the usual skepticism applies. But the underlying technical approach - maintaining goal alignment across hundreds of tool calls without strategy drift - addresses a real failure mode of current agentic systems, where models lose context or start tuning for the wrong objective after a few dozen steps.

The practical implication: GLM-5.1 makes more sense as the backbone of an autonomous coding agent than as an interactive assistant. Plug it into a CI/CD pipeline, give it a ticket, and come back later. For real-time IDE completions where latency matters, Cursor or Claude Code with a faster model will feel more responsive.

Abstract close-up of electronic circuit pathways glowing in blue light GLM-5.1 is built for long-running agentic loops, not fast interactive completions. Source: pexels.com

GLM-5.1 makes more sense as the backbone of an autonomous agent than as an interactive assistant. Give it a ticket and come back later.

Pricing and Access

The model weights are on Hugging Face at zai-org/GLM-5.1 under a MIT license, which means unrestricted commercial use and self-hosting.

For API access, Z.ai prices GLM-5.1 at $0.95 per million input tokens and $3.15 per million output tokens, with cached inputs at $0.26 per million. Those prices sit comfortably below what Anthropic and OpenAI charge for their frontier models.

Z.ai also offers a subscription-tier Coding Plan aimed at developers who want integrated tooling rather than raw API access:

Tier	Price	Notes
Lite	~$10/month	Basic access to GLM-5.1 and GLM-5-Turbo
Pro	~$30/month	Full model suite, priority throughput
Max	~$80/month	Near-unlimited usage of premium models

Free-tier users get access to GLM-4.7-Flash and GLM-4.5-Flash at no cost, along with a small allocation of premium model credits.

Local deployment works with standard frameworks. On a NVIDIA DGX Spark or equivalent multi-GPU setup, you can run the FP8-quantized variant (zai-org/GLM-5.1-FP8) with reduced memory requirements. The full BF16 weights need substantially more VRAM - this isn't a laptop model.

Strengths

SWE-Bench Pro leader: 58.4 is the highest public score on the benchmark that best predicts real-world software engineering performance
Genuinely long autonomous execution: eight-hour uninterrupted agentic sessions with maintained goal alignment aren't something most models can do
MIT license with commercial use: no restrictions, self-hostable, no vendor lock-in
Competitive API pricing: at $0.95/$3.15 per million tokens, it undercuts frontier closed-source models
Deep engineering toolkit: function calling, context caching, MCP integration, structured outputs, thinking mode - all there
CyberGym leader: 68.7 on security task completion ahead of Claude Opus 4.6's 66.6

Weaknesses

Text-only: no image, audio, or video input while GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro are all multimodal
Slower generation speed: 40-44 tokens/second against a competitive median above 53 t/s, which is noticeable in interactive use
Verbose output: the model produces significantly more tokens than peers to reach equivalent answers, which inflates costs and latency
Math and science gaps: AIME 2026 (95.3) and GPQA-Diamond (86.2) trail the frontier, making it a weaker choice for research or quantitative tasks
Self-reported benchmarks: the SWE-Bench Pro score hasn't been independently verified by a third-party lab
Young ecosystem: fewer community integrations and third-party tooling than Claude or OpenAI-compatible APIs

Verdict

GLM-5.1 is the most capable open-weight model available for agentic software engineering work. The SWE-Bench Pro result is real - partially backed up by Arena.ai's independent Code Arena rankings - and the sustained eight-hour autonomous execution is a genuine differentiator for teams building long-running coding agents. The MIT license and competitive API pricing remove the usual friction of adopting a less-established model.

The weaknesses are real too. No multimodal input, slower token generation, and benchmark claims that haven't all been independently verified mean this isn't a drop-in replacement for Claude or GPT-5.4 across general workloads. For coding agents specifically - especially those that need to run overnight on complex engineering tasks - GLM-5.1 earns serious consideration.

The detail that makes the result interesting beyond the benchmark number: it was built entirely on Huawei silicon, by a lab that's been cut off from US chips for over a year. Whether that changes anything about how US policymakers think about export controls is a separate question. The technical fact is that the gap between Western and Chinese frontier models on coding tasks is now measured in decimal points.

Score: 8.1 / 10

GLM-5.1 Review: Open-Source Model Tops SWE-Bench Pro