GLM-5 is Zhipu AI's flagship large language model - a 744-billion-parameter Mixture-of-Experts model that activates just 40 billion parameters per token. What makes this model historically significant isn't just its raw performance, which rivals Claude Opus 4.6 and GPT-5.2 on key benchmarks. It is the first frontier-class model trained completely on Chinese hardware - 100,000 Huawei Ascend 910B processors with zero Nvidia silicon involved.

Released on February 11, 2026 under an MIT license, GLM-5 marks a turning point in the global AI race. Zhipu AI - China's first publicly traded AI-native company after its $558 million Hong Kong IPO in January 2026 - has proven that frontier performance no longer requires American chips.

TL;DR

First open-source model to score 50 on Artificial Analysis Intelligence Index, rivaling closed models from OpenAI and Anthropic
744B MoE, 40B active, 200K context, $1.00/$3.20 per million tokens - 5x cheaper than Claude Opus 4.6 on input
Trades blows with GPT-5.2 on SWE-bench (77.8%) and beats it on Humanity's Last Exam with tools (50.4% vs 47.8%)

Key Specifications

Specification	Details
Provider	Zhipu AI (Z.ai)
Model Family	GLM
Total Parameters	744B (256 experts, 8 active per token)
Active Parameters	40B per token
Context Window	200K tokens (202,752 max in SFT)
Input Price	$1.00/M tokens
Output Price	$3.20/M tokens
Release Date	February 11, 2026
License	MIT
Training Data	28.5T tokens
Training Hardware	100,000 Huawei Ascend 910B
Framework	MindSpore
Weights	HuggingFace (BF16 + FP8)

Benchmark Performance

GLM-5 competes directly with the most expensive closed models at a fraction of the cost. Here is how it stacks up against the current top contenders:

Benchmark	GLM-5	GPT-5.2	Claude Opus 4.6	Gemini 3.1 Pro
SWE-bench Verified	77.8%	76.2%	79.4%	72.3%
Humanity's Last Exam (tools)	50.4%	47.8%	46.2%	-
GPQA Diamond	86.0%	71.5%	-	94.3%
AIME 2026 I	92.7%	100%	-	-
BrowseComp	75.9	72.1	68.4	-
Terminal-Bench 2.0	56.2	-	69.9	-
MMLU-Pro	70.4%	-	-	-
AA Intelligence Index	50	51	53	48

The 1.6 percentage point gap between first and third place on SWE-bench Verified would have been unthinkable six months ago - especially considering GLM-5 was trained on hardware that most Western engineers have never touched. On Humanity's Last Exam with tool use, GLM-5 actually leads the field at 50.4%, outpacing both GPT-5.2 and Claude Opus 4.6.

Where GLM-5 falls short is speed. At 72.5 tokens per second on the Z.ai API, it is faster than average but still behind the throughput of more mature inference stacks. It is also notably verbose - creating 110 million tokens during evaluation compared to the median of 15 million. That verbosity inflates costs in production workloads.

Data center server racks representing the 100,000 Huawei Ascend 910B chips used to train GLM-5 GLM-5 was trained on 100,000 Huawei Ascend 910B processors - the largest non-Nvidia training cluster ever assembled for a frontier model.

Architecture and Technical Innovations

GLM-5 doubles its predecessor GLM-4.7-Flash across every dimension: from 355B to 744B total parameters, from 32B to 40B active parameters, and from 23T to 28.5T training tokens.

The architecture builds on proven components. Multi-head Latent Attention (MLA) reduces memory by roughly 33% versus standard multi-head attention. The MLA-256 variant increases the head dimension from 192 to 256 while cutting attention heads by a third - a trade-off that reduces decoding compute without sacrificing training efficiency.

Three innovations stand out:

DeepSeek Sparse Attention (DSA) reduces attention computation by 1.5-2x for long sequences through dynamic token importance selection rather than fixed patterns. This was adapted during continued pre-training on 20 billion tokens.

Multi-Token Prediction shares parameters across three MTP layers during training, improving speculative decoding acceptance length from 2.55 to 2.76 tokens. This boosts inference throughput without adding proportional parameters.

Slime RL Framework is Zhipu's custom asynchronous reinforcement learning infrastructure for post-training. It features Token-in-Token-out (TITO) gateways that eliminate re-tokenization mismatches, FP8 rollouts for speed, and heartbeat-driven fault tolerance. The framework produces 3,000 to 6,000 messages per run, honing the model's long-range planning and tool use capabilities.

Context extension was progressive: 32K (1T tokens), 128K (500B tokens), and finally 200K (50B tokens) through mid-training stages.

Abstract neural network visualization representing GLM-5's Mixture-of-Experts architecture GLM-5's MoE architecture uses 256 experts with 8 active per token - reaching frontier performance while keeping inference costs manageable.

Key Capabilities

GLM-5 was built with an explicit focus on agentic tasks - not just answering questions but executing multi-step workflows involving code, tools, and web search.

On software engineering benchmarks, it scores 77.8% on SWE-bench Verified across real GitHub issues and 73.3% on SWE-bench Multilingual, covering nine programming languages. The model supports three thinking modes - interleaved (reasoning before each action), preserved (carrying thinking blocks across turns), and turn-level (per-turn control) - giving developers fine-grained control over when and how much the model reasons.

The model's web search capabilities are particularly strong. On BrowseComp, it scores 75.9 - beating every other model in the comparison. Zhipu's keep-recent-k context management strategy lets GLM-5 maintain relevant information across long search sessions without blowing through its context window.

For developers wanting to self-host, the FP8-quantized version cuts memory requirements roughly in half. Deployment requires at least 8 H200 or H20 GPUs with tensor parallelism, and both vLLM and SGLang provide first-class support. The community has already produced GGUF quantizations through Unsloth for those running on consumer hardware.

Pricing and Availability

GLM-5 occupies a unique position: frontier-class performance at mid-tier pricing, with full open weights for those who prefer self-hosting.

Provider	Input	Output	Notes
Z.ai API	$1.00/M	$3.20/M	Official hosted API
Self-hosted (MIT)	Hardware cost	Hardware cost	8x H200 minimum (FP8)
Claude Opus 4.6	$5.00/M	$25.00/M	5x / 7.8x more expensive
GPT-5.2	$15.00/M	$75.00/M	15x / 23x more expensive
Gemini 3.1 Pro	$2.00/M	$8.00/M	2x / 2.5x more expensive

At $1.00 per million input tokens, GLM-5 undercuts Claude Opus 4.6 by 5x and GPT-5.2 by 15x. The blended rate works out to roughly $1.55 per million tokens at a 3:1 input/output ratio - making it one of the most cost-effective frontier models available.

The February 2026 launch included a 30-60% price increase on Zhipu's coding plans, reflecting heavy demand that kept the platform under constant load. Even after the increase, GLM-5 remains dramatically cheaper than Western alternatives.

The model is available through the Z.ai API, HuggingFace (both BF16 and FP8 weights), ModelScope, and third-party providers like OpenRouter and Novita.

Strengths and Weaknesses

Strengths

Frontier performance at open-source pricing - 77.8% SWE-bench and 50 Intelligence Index score under MIT license
Trained entirely on Chinese hardware - proves frontier AI does not require Nvidia GPUs, a geopolitically significant milestone
Strong agentic capabilities - three thinking modes, web search, and multi-step tool use out of the box
Full deployment flexibility - MIT license with BF16 and FP8 weights on HuggingFace, plus multiple inference framework support
Aggressive API pricing - 5-15x cheaper than comparable closed models
Low hallucination rate - 34% hallucination rate vs 90% in predecessor GLM-4.7, according to Zhipu's internal evaluations

Weaknesses

Self-verification needed on benchmarks - most published numbers come from Zhipu's own evaluations; independent verification is still catching up
Verbose outputs - produces 7x more tokens than median models during evaluation, inflating effective costs
Inference speed is moderate - 72.5 tokens/second is acceptable but behind optimized closed-model inference stacks
High self-hosting requirements - 8x H200 GPUs minimum for FP8, putting local deployment out of reach for most developers
Limited multimodal support - text-only input/output for now, unlike competitors offering native vision
Availability constraints - heavy demand has caused capacity issues and forced pricing increases

GLM-5 Arrives: 744B-Parameter Open-Source Model Built for Agents - our initial coverage of the release
China's GLM-5 Rivals GPT-5.2 on Zero Nvidia Silicon - deep dive on the Huawei Ascend training story
Open-Source LLM Leaderboard - where GLM-5 ranks among open models
Coding Benchmarks Leaderboard - SWE-bench and Terminal-Bench rankings
Agentic AI Benchmarks Leaderboard - agent task performance rankings
GLM-4.7-Flash - the predecessor model profile

GLM-5 - China's 744B Open-Source Frontier Model

Key Specifications

Benchmark Performance

Architecture and Technical Innovations

Key Capabilities

Pricing and Availability

Strengths and Weaknesses

Strengths

Weaknesses

Sources

Key Specifications

Benchmark Performance

Architecture and Technical Innovations

Key Capabilities

Pricing and Availability

Strengths and Weaknesses

Strengths

Weaknesses

Related Coverage

Sources

Google Analytics