GLM-5 - China's 744B Open-Source Frontier Model

Zhipu AI's GLM-5 is a 744B MoE model with 40B active parameters, trained on 100K Huawei Ascend chips, scoring 77.8% SWE-bench and 50 on Artificial Analysis Intelligence Index - MIT licensed.

GLM-5 - China's 744B Open-Source Frontier Model

GLM-5 is Zhipu AI's flagship large language model - a 744-billion-parameter Mixture-of-Experts model that activates just 40 billion parameters per token. What makes this model historically significant isn't just its raw performance, which rivals Claude Opus 4.6 and GPT-5.2 on key benchmarks. It is the first frontier-class model trained completely on Chinese hardware - 100,000 Huawei Ascend 910B processors with zero Nvidia silicon involved.

Released on February 11, 2026 under an MIT license, GLM-5 marks a turning point in the global AI race. Zhipu AI - China's first publicly traded AI-native company after its $558 million Hong Kong IPO in January 2026 - has proven that frontier performance no longer requires American chips.

TL;DR

  • First open-source model to score 50 on Artificial Analysis Intelligence Index, rivaling closed models from OpenAI and Anthropic
  • 744B MoE, 40B active, 200K context, $1.00/$3.20 per million tokens - 5x cheaper than Claude Opus 4.6 on input
  • Trades blows with GPT-5.2 on SWE-bench (77.8%) and beats it on Humanity's Last Exam with tools (50.4% vs 47.8%)

Key Specifications

SpecificationDetails
ProviderZhipu AI (Z.ai)
Model FamilyGLM
Total Parameters744B (256 experts, 8 active per token)
Active Parameters40B per token
Context Window200K tokens (202,752 max in SFT)
Input Price$1.00/M tokens
Output Price$3.20/M tokens
Release DateFebruary 11, 2026
LicenseMIT
Training Data28.5T tokens
Training Hardware100,000 Huawei Ascend 910B
FrameworkMindSpore
WeightsHuggingFace (BF16 + FP8)

Benchmark Performance

GLM-5 competes directly with the most expensive closed models at a fraction of the cost. Here is how it stacks up against the current top contenders:

BenchmarkGLM-5GPT-5.2Claude Opus 4.6Gemini 3.1 Pro
SWE-bench Verified77.8%76.2%79.4%72.3%
Humanity's Last Exam (tools)50.4%47.8%46.2%-
GPQA Diamond86.0%71.5%-94.3%
AIME 2026 I92.7%100%--
BrowseComp75.972.168.4-
Terminal-Bench 2.056.2-69.9-
MMLU-Pro70.4%---
AA Intelligence Index50515348

The 1.6 percentage point gap between first and third place on SWE-bench Verified would have been unthinkable six months ago - especially considering GLM-5 was trained on hardware that most Western engineers have never touched. On Humanity's Last Exam with tool use, GLM-5 actually leads the field at 50.4%, outpacing both GPT-5.2 and Claude Opus 4.6.

Where GLM-5 falls short is speed. At 72.5 tokens per second on the Z.ai API, it is faster than average but still behind the throughput of more mature inference stacks. It is also notably verbose - creating 110 million tokens during evaluation compared to the median of 15 million. That verbosity inflates costs in production workloads.

Data center server racks representing the 100,000 Huawei Ascend 910B chips used to train GLM-5 GLM-5 was trained on 100,000 Huawei Ascend 910B processors - the largest non-Nvidia training cluster ever assembled for a frontier model.

Architecture and Technical Innovations

GLM-5 doubles its predecessor GLM-4.7-Flash across every dimension: from 355B to 744B total parameters, from 32B to 40B active parameters, and from 23T to 28.5T training tokens.

The architecture builds on proven components. Multi-head Latent Attention (MLA) reduces memory by roughly 33% versus standard multi-head attention. The MLA-256 variant increases the head dimension from 192 to 256 while cutting attention heads by a third - a trade-off that reduces decoding compute without sacrificing training efficiency.

Three innovations stand out:

DeepSeek Sparse Attention (DSA) reduces attention computation by 1.5-2x for long sequences through dynamic token importance selection rather than fixed patterns. This was adapted during continued pre-training on 20 billion tokens.

Multi-Token Prediction shares parameters across three MTP layers during training, improving speculative decoding acceptance length from 2.55 to 2.76 tokens. This boosts inference throughput without adding proportional parameters.

Slime RL Framework is Zhipu's custom asynchronous reinforcement learning infrastructure for post-training. It features Token-in-Token-out (TITO) gateways that eliminate re-tokenization mismatches, FP8 rollouts for speed, and heartbeat-driven fault tolerance. The framework produces 3,000 to 6,000 messages per run, honing the model's long-range planning and tool use capabilities.

Context extension was progressive: 32K (1T tokens), 128K (500B tokens), and finally 200K (50B tokens) through mid-training stages.

Abstract neural network visualization representing GLM-5's Mixture-of-Experts architecture GLM-5's MoE architecture uses 256 experts with 8 active per token - reaching frontier performance while keeping inference costs manageable.

Key Capabilities

GLM-5 was built with an explicit focus on agentic tasks - not just answering questions but executing multi-step workflows involving code, tools, and web search.

On software engineering benchmarks, it scores 77.8% on SWE-bench Verified across real GitHub issues and 73.3% on SWE-bench Multilingual, covering nine programming languages. The model supports three thinking modes - interleaved (reasoning before each action), preserved (carrying thinking blocks across turns), and turn-level (per-turn control) - giving developers fine-grained control over when and how much the model reasons.

The model's web search capabilities are particularly strong. On BrowseComp, it scores 75.9 - beating every other model in the comparison. Zhipu's keep-recent-k context management strategy lets GLM-5 maintain relevant information across long search sessions without blowing through its context window.

For developers wanting to self-host, the FP8-quantized version cuts memory requirements roughly in half. Deployment requires at least 8 H200 or H20 GPUs with tensor parallelism, and both vLLM and SGLang provide first-class support. The community has already produced GGUF quantizations through Unsloth for those running on consumer hardware.

Pricing and Availability

GLM-5 occupies a unique position: frontier-class performance at mid-tier pricing, with full open weights for those who prefer self-hosting.

ProviderInputOutputNotes
Z.ai API$1.00/M$3.20/MOfficial hosted API
Self-hosted (MIT)Hardware costHardware cost8x H200 minimum (FP8)
Claude Opus 4.6$5.00/M$25.00/M5x / 7.8x more expensive
GPT-5.2$15.00/M$75.00/M15x / 23x more expensive
Gemini 3.1 Pro$2.00/M$8.00/M2x / 2.5x more expensive

At $1.00 per million input tokens, GLM-5 undercuts Claude Opus 4.6 by 5x and GPT-5.2 by 15x. The blended rate works out to roughly $1.55 per million tokens at a 3:1 input/output ratio - making it one of the most cost-effective frontier models available.

The February 2026 launch included a 30-60% price increase on Zhipu's coding plans, reflecting heavy demand that kept the platform under constant load. Even after the increase, GLM-5 remains dramatically cheaper than Western alternatives.

The model is available through the Z.ai API, HuggingFace (both BF16 and FP8 weights), ModelScope, and third-party providers like OpenRouter and Novita.

Strengths and Weaknesses

Strengths

  • Frontier performance at open-source pricing - 77.8% SWE-bench and 50 Intelligence Index score under MIT license
  • Trained entirely on Chinese hardware - proves frontier AI does not require Nvidia GPUs, a geopolitically significant milestone
  • Strong agentic capabilities - three thinking modes, web search, and multi-step tool use out of the box
  • Full deployment flexibility - MIT license with BF16 and FP8 weights on HuggingFace, plus multiple inference framework support
  • Aggressive API pricing - 5-15x cheaper than comparable closed models
  • Low hallucination rate - 34% hallucination rate vs 90% in predecessor GLM-4.7, according to Zhipu's internal evaluations

Weaknesses

  • Self-verification needed on benchmarks - most published numbers come from Zhipu's own evaluations; independent verification is still catching up
  • Verbose outputs - produces 7x more tokens than median models during evaluation, inflating effective costs
  • Inference speed is moderate - 72.5 tokens/second is acceptable but behind optimized closed-model inference stacks
  • High self-hosting requirements - 8x H200 GPUs minimum for FP8, putting local deployment out of reach for most developers
  • Limited multimodal support - text-only input/output for now, unlike competitors offering native vision
  • Availability constraints - heavy demand has caused capacity issues and forced pricing increases

Sources

GLM-5 - China's 744B Open-Source Frontier Model
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.