GLM-5 - China's 744B Open-Source Frontier Model
Zhipu AI's GLM-5 is a 744B MoE model with 40B active parameters, trained on 100K Huawei Ascend chips, scoring 77.8% SWE-bench and 50 on Artificial Analysis Intelligence Index - MIT licensed.

GLM-5 is Zhipu AI's flagship large language model - a 744-billion-parameter Mixture-of-Experts model that activates just 40 billion parameters per token. What makes this model historically significant isn't just its raw performance, which rivals Claude Opus 4.6 and GPT-5.2 on key benchmarks. It is the first frontier-class model trained completely on Chinese hardware - 100,000 Huawei Ascend 910B processors with zero Nvidia silicon involved.
Released on February 11, 2026 under an MIT license, GLM-5 marks a turning point in the global AI race. Zhipu AI - China's first publicly traded AI-native company after its $558 million Hong Kong IPO in January 2026 - has proven that frontier performance no longer requires American chips.
TL;DR
- First open-source model to score 50 on Artificial Analysis Intelligence Index, rivaling closed models from OpenAI and Anthropic
- 744B MoE, 40B active, 200K context, $1.00/$3.20 per million tokens - 5x cheaper than Claude Opus 4.6 on input
- Trades blows with GPT-5.2 on SWE-bench (77.8%) and beats it on Humanity's Last Exam with tools (50.4% vs 47.8%)
Key Specifications
| Specification | Details |
|---|---|
| Provider | Zhipu AI (Z.ai) |
| Model Family | GLM |
| Total Parameters | 744B (256 experts, 8 active per token) |
| Active Parameters | 40B per token |
| Context Window | 200K tokens (202,752 max in SFT) |
| Input Price | $1.00/M tokens |
| Output Price | $3.20/M tokens |
| Release Date | February 11, 2026 |
| License | MIT |
| Training Data | 28.5T tokens |
| Training Hardware | 100,000 Huawei Ascend 910B |
| Framework | MindSpore |
| Weights | HuggingFace (BF16 + FP8) |
Benchmark Performance
GLM-5 competes directly with the most expensive closed models at a fraction of the cost. Here is how it stacks up against the current top contenders:
| Benchmark | GLM-5 | GPT-5.2 | Claude Opus 4.6 | Gemini 3.1 Pro |
|---|---|---|---|---|
| SWE-bench Verified | 77.8% | 76.2% | 79.4% | 72.3% |
| Humanity's Last Exam (tools) | 50.4% | 47.8% | 46.2% | - |
| GPQA Diamond | 86.0% | 71.5% | - | 94.3% |
| AIME 2026 I | 92.7% | 100% | - | - |
| BrowseComp | 75.9 | 72.1 | 68.4 | - |
| Terminal-Bench 2.0 | 56.2 | - | 69.9 | - |
| MMLU-Pro | 70.4% | - | - | - |
| AA Intelligence Index | 50 | 51 | 53 | 48 |
The 1.6 percentage point gap between first and third place on SWE-bench Verified would have been unthinkable six months ago - especially considering GLM-5 was trained on hardware that most Western engineers have never touched. On Humanity's Last Exam with tool use, GLM-5 actually leads the field at 50.4%, outpacing both GPT-5.2 and Claude Opus 4.6.
Where GLM-5 falls short is speed. At 72.5 tokens per second on the Z.ai API, it is faster than average but still behind the throughput of more mature inference stacks. It is also notably verbose - creating 110 million tokens during evaluation compared to the median of 15 million. That verbosity inflates costs in production workloads.
GLM-5 was trained on 100,000 Huawei Ascend 910B processors - the largest non-Nvidia training cluster ever assembled for a frontier model.
Architecture and Technical Innovations
GLM-5 doubles its predecessor GLM-4.7-Flash across every dimension: from 355B to 744B total parameters, from 32B to 40B active parameters, and from 23T to 28.5T training tokens.
The architecture builds on proven components. Multi-head Latent Attention (MLA) reduces memory by roughly 33% versus standard multi-head attention. The MLA-256 variant increases the head dimension from 192 to 256 while cutting attention heads by a third - a trade-off that reduces decoding compute without sacrificing training efficiency.
Three innovations stand out:
DeepSeek Sparse Attention (DSA) reduces attention computation by 1.5-2x for long sequences through dynamic token importance selection rather than fixed patterns. This was adapted during continued pre-training on 20 billion tokens.
Multi-Token Prediction shares parameters across three MTP layers during training, improving speculative decoding acceptance length from 2.55 to 2.76 tokens. This boosts inference throughput without adding proportional parameters.
Slime RL Framework is Zhipu's custom asynchronous reinforcement learning infrastructure for post-training. It features Token-in-Token-out (TITO) gateways that eliminate re-tokenization mismatches, FP8 rollouts for speed, and heartbeat-driven fault tolerance. The framework produces 3,000 to 6,000 messages per run, honing the model's long-range planning and tool use capabilities.
Context extension was progressive: 32K (1T tokens), 128K (500B tokens), and finally 200K (50B tokens) through mid-training stages.
GLM-5's MoE architecture uses 256 experts with 8 active per token - reaching frontier performance while keeping inference costs manageable.
Key Capabilities
GLM-5 was built with an explicit focus on agentic tasks - not just answering questions but executing multi-step workflows involving code, tools, and web search.
On software engineering benchmarks, it scores 77.8% on SWE-bench Verified across real GitHub issues and 73.3% on SWE-bench Multilingual, covering nine programming languages. The model supports three thinking modes - interleaved (reasoning before each action), preserved (carrying thinking blocks across turns), and turn-level (per-turn control) - giving developers fine-grained control over when and how much the model reasons.
The model's web search capabilities are particularly strong. On BrowseComp, it scores 75.9 - beating every other model in the comparison. Zhipu's keep-recent-k context management strategy lets GLM-5 maintain relevant information across long search sessions without blowing through its context window.
For developers wanting to self-host, the FP8-quantized version cuts memory requirements roughly in half. Deployment requires at least 8 H200 or H20 GPUs with tensor parallelism, and both vLLM and SGLang provide first-class support. The community has already produced GGUF quantizations through Unsloth for those running on consumer hardware.
Pricing and Availability
GLM-5 occupies a unique position: frontier-class performance at mid-tier pricing, with full open weights for those who prefer self-hosting.
| Provider | Input | Output | Notes |
|---|---|---|---|
| Z.ai API | $1.00/M | $3.20/M | Official hosted API |
| Self-hosted (MIT) | Hardware cost | Hardware cost | 8x H200 minimum (FP8) |
| Claude Opus 4.6 | $5.00/M | $25.00/M | 5x / 7.8x more expensive |
| GPT-5.2 | $15.00/M | $75.00/M | 15x / 23x more expensive |
| Gemini 3.1 Pro | $2.00/M | $8.00/M | 2x / 2.5x more expensive |
At $1.00 per million input tokens, GLM-5 undercuts Claude Opus 4.6 by 5x and GPT-5.2 by 15x. The blended rate works out to roughly $1.55 per million tokens at a 3:1 input/output ratio - making it one of the most cost-effective frontier models available.
The February 2026 launch included a 30-60% price increase on Zhipu's coding plans, reflecting heavy demand that kept the platform under constant load. Even after the increase, GLM-5 remains dramatically cheaper than Western alternatives.
The model is available through the Z.ai API, HuggingFace (both BF16 and FP8 weights), ModelScope, and third-party providers like OpenRouter and Novita.
Strengths and Weaknesses
Strengths
- Frontier performance at open-source pricing - 77.8% SWE-bench and 50 Intelligence Index score under MIT license
- Trained entirely on Chinese hardware - proves frontier AI does not require Nvidia GPUs, a geopolitically significant milestone
- Strong agentic capabilities - three thinking modes, web search, and multi-step tool use out of the box
- Full deployment flexibility - MIT license with BF16 and FP8 weights on HuggingFace, plus multiple inference framework support
- Aggressive API pricing - 5-15x cheaper than comparable closed models
- Low hallucination rate - 34% hallucination rate vs 90% in predecessor GLM-4.7, according to Zhipu's internal evaluations
Weaknesses
- Self-verification needed on benchmarks - most published numbers come from Zhipu's own evaluations; independent verification is still catching up
- Verbose outputs - produces 7x more tokens than median models during evaluation, inflating effective costs
- Inference speed is moderate - 72.5 tokens/second is acceptable but behind optimized closed-model inference stacks
- High self-hosting requirements - 8x H200 GPUs minimum for FP8, putting local deployment out of reach for most developers
- Limited multimodal support - text-only input/output for now, unlike competitors offering native vision
- Availability constraints - heavy demand has caused capacity issues and forced pricing increases
Related Coverage
- GLM-5 Arrives: 744B-Parameter Open-Source Model Built for Agents - our initial coverage of the release
- China's GLM-5 Rivals GPT-5.2 on Zero Nvidia Silicon - deep dive on the Huawei Ascend training story
- Open-Source LLM Leaderboard - where GLM-5 ranks among open models
- Coding Benchmarks Leaderboard - SWE-bench and Terminal-Bench rankings
- Agentic AI Benchmarks Leaderboard - agent task performance rankings
- GLM-4.7-Flash - the predecessor model profile
Sources
- GLM-5: from Vibe Coding to Agentic Engineering (arXiv technical report)
- GLM-5 on HuggingFace
- GLM-5 Intelligence, Performance & Price Analysis (Artificial Analysis)
- Zhipu AI Launches GLM-5 with 30% Price Increase (Creati.ai)
- GLM-5: China's First Public AI Company Ships a Frontier Model (Maxime Labonne)
- Z.ai's GLM-5 Achieves Record Low Hallucination Rate (VentureBeat)
- China's AI Tiger Goes Public (CNBC)
- Z.ai's GLM-5 Model Boasts Top Open-Weights Intelligence Index Score (The Batch)
