GLM-4.7-Flash is the model that makes you reconsider what a "small" model can do on coding benchmarks. Zhipu AI - the Beijing-based lab now operating as Z.AI - released this 30B-A3B MoE model in January 2026 as the free-tier counterpart to their full GLM-4.7 flagship, and it posts numbers on SWE-bench Verified (59.2%) that embarrass models with ten times its active parameter count.

TL;DR

30B total / ~3B active MoE with 64 routed experts (4 active + 1 shared per token)
59.2% SWE-bench Verified - nearly 3x Qwen3-30B-A3B's 22.0% on the same benchmark
79.5% on tau2-Bench for agentic tool use, 91.6% on AIME 2025
Runs on 24GB GPUs (RTX 3090/4090) with Q4 quantization at 60-80 tokens/sec
MIT licensed with free API tier on Z.AI - no usage caps reported

Overview

The architecture is a straightforward MoE design - 47 layers, 2048 hidden dimension, 64 routed experts with 4 selected per token plus 1 shared expert that is always active. This gives you roughly 3B active parameters per forward pass from a 30B total pool. Nothing exotic in the architecture itself - no Mamba layers, no hybrid attention schemes. The performance gains come from training methodology and data quality, not architectural novelty.

What makes GLM-4.7-Flash interesting is the gap between its coding benchmarks and everything else in the 30B-A3B weight class. SWE-bench Verified at 59.2% is not just good for a 3B-active model - it beats most models regardless of size. For context, the full GLM-4.7 (358B parameters) scores 73.8% on the same benchmark, so Flash retains nearly 80% of the flagship's coding capability at a fraction of the compute. The tau2-Bench score of 79.5% tells a similar story for agentic workflows involving multi-step tool invocation.

Z.AI positioned this explicitly as a developer tool. The model ships with a "preserved thinking" mode that maintains reasoning context across long multi-turn conversations, and a step-by-step thinking mode for debugging. It understands shell syntax, bash scripting, and frontend frameworks (React, Vue) well enough to be practically useful as a coding assistant. The paper reference (arXiv 2508.06471) describes this as part of the "ARC" - Agentic, Reasoning, and Coding - foundation model line.

Key Specifications

Specification	Details
Provider	Z.AI (Zhipu AI)
Model Family	GLM-4.7
Architecture	Sparse Mixture-of-Experts
Total Parameters	30B
Active Parameters	~3B
Layers	47
Hidden Dimension	2,048
MoE Intermediate Size	1,536
Attention Heads	20
Experts	64 routed + 1 shared (4 active per token)
Routed Scaling Factor	1.8
Context Window	128K tokens (max 202,752)
Max Output	131,072 tokens
Input Modalities	Text
Activation Function	SiLU
Languages	English, Chinese
Release Date	January 19, 2026
License	MIT

Benchmark Performance

Benchmark	GLM-4.7-Flash	Qwen3-30B-A3B	GPT-OSS-20B
AIME 2025	91.6	85.0	91.7
GPQA Diamond	75.2	73.4	71.5
LiveCodeBench v6	64.0	66.0	61.0
HLE	14.4	9.8	10.9
SWE-bench Verified	59.2	22.0	34.0
tau2-Bench	79.5	49.0	47.7
BrowseComp	42.8	2.29	28.3

The SWE-bench result demands scrutiny. GLM-4.7-Flash scores 59.2% versus Qwen3-30B-A3B's 22.0% - that is a 2.7x gap on the same benchmark between models with similar active parameter counts. Either Zhipu has discovered something genuinely effective in their training pipeline for code repair tasks, or there are evaluation methodology differences worth investigating. The BrowseComp score (42.8 vs 2.29) shows a similar pattern - massive gaps that suggest different training emphasis rather than strictly better architecture.

On pure reasoning (AIME 91.6, GPQA 75.2) the model is competitive with but does not clearly dominate the field. LiveCodeBench v6 at 64.0 actually trails Qwen3-30B-A3B's 66.0, which is an interesting divergence from the SWE-bench pattern - possibly reflecting different aspects of coding ability (code generation vs. code repair).

Key Capabilities

The primary use case is clear: coding assistance and agentic tool use. The 59.2% SWE-bench Verified score means the model can successfully diagnose and fix real-world GitHub issues nearly 6 out of 10 times. The tau2-Bench score of 79.5% shows it can sequence multi-step tool calls reliably - critical for any agentic framework that needs to chain API calls, file operations, and code modifications.

For local deployment, GLM-4.7-Flash fits on consumer hardware. A Q4_K_M quantized version consumes approximately 16GB VRAM, leaving room for the 128K context window on a 24GB card (RTX 3090/4090). Reported inference speeds of 60-80+ tokens per second on these cards make it genuinely usable for interactive coding workflows. Full BF16 precision requires around 60GB VRAM, so a single A100-80GB or two consumer GPUs in a split configuration.

The model supports both vLLM and SGLang for serving, with MTP (Multi-step Token Prediction) speculative decoding for additional throughput. The GLM47 tool-call parser and GLM45 reasoning parser are available as plugins for structured agentic deployments.

Pricing and Availability

GLM-4.7-Flash is MIT licensed - the most permissive license you can get. Weights are on HuggingFace and ModelScope. Z.AI offers a free API tier with no reported daily usage caps, though it is limited to one concurrent request. Third-party providers like Novita offer the model at $0.07/M input tokens and $0.40/M output tokens for higher throughput needs.

For self-hosting, the model is available through Ollama, vLLM, SGLang, and standard HuggingFace Transformers. GGUF quantizations from the community (Unsloth and others) provide the easiest path to local deployment.

Strengths

59.2% SWE-bench Verified is exceptional for any model, let alone one with 3B active parameters
MIT license - no restrictions, no fine print, full commercial freedom
Free API tier with no daily caps from Z.AI
Runs on consumer GPUs (16GB VRAM quantized) at interactive speeds
Strong agentic tool use (tau2-Bench 79.5%) for multi-step workflows

Weaknesses

English and Chinese only - no multilingual breadth
128K context is modest compared to 1M+ competitors in the same weight class
LiveCodeBench v6 (64.0) trails Qwen3-30B-A3B despite dominating SWE-bench - inconsistent coding profile
Text-only model - no vision or multimodal capabilities
Self-reported benchmarks from Z.AI - the SWE-bench gap over competitors warrants independent verification
Chinese AI lab with limited transparency into training data composition

Sources: