Models

GLM-4.7-Flash

Zhipu's GLM-4.7-Flash is a 30B-A3B MoE model that posts 59.2% on SWE-bench Verified and 79.5% on tau2-Bench while running on a single RTX 4090 - MIT licensed and free via the Z.AI API.

GLM-4.7-Flash

GLM-4.7-Flash is the model that makes you reconsider what a "small" model can do on coding benchmarks. Zhipu AI - the Beijing-based lab now operating as Z.AI - released this 30B-A3B MoE model in January 2026 as the free-tier counterpart to their full GLM-4.7 flagship, and it posts numbers on SWE-bench Verified (59.2%) that embarrass models with ten times its active parameter count.

TL;DR

  • 30B total / ~3B active MoE with 64 routed experts (4 active + 1 shared per token)
  • 59.2% SWE-bench Verified - nearly 3x Qwen3-30B-A3B's 22.0% on the same benchmark
  • 79.5% on tau2-Bench for agentic tool use, 91.6% on AIME 2025
  • Runs on 24GB GPUs (RTX 3090/4090) with Q4 quantization at 60-80 tokens/sec
  • MIT licensed with free API tier on Z.AI - no usage caps reported

Overview

The architecture is a straightforward MoE design - 47 layers, 2048 hidden dimension, 64 routed experts with 4 selected per token plus 1 shared expert that is always active. This gives you roughly 3B active parameters per forward pass from a 30B total pool. Nothing exotic in the architecture itself - no Mamba layers, no hybrid attention schemes. The performance gains come from training methodology and data quality, not architectural novelty.

What makes GLM-4.7-Flash interesting is the gap between its coding benchmarks and everything else in the 30B-A3B weight class. SWE-bench Verified at 59.2% is not just good for a 3B-active model - it beats most models regardless of size. For context, the full GLM-4.7 (358B parameters) scores 73.8% on the same benchmark, so Flash retains nearly 80% of the flagship's coding capability at a fraction of the compute. The tau2-Bench score of 79.5% tells a similar story for agentic workflows involving multi-step tool invocation.

Z.AI positioned this explicitly as a developer tool. The model ships with a "preserved thinking" mode that maintains reasoning context across long multi-turn conversations, and a step-by-step thinking mode for debugging. It understands shell syntax, bash scripting, and frontend frameworks (React, Vue) well enough to be practically useful as a coding assistant. The paper reference (arXiv 2508.06471) describes this as part of the "ARC" - Agentic, Reasoning, and Coding - foundation model line.

Key Specifications

SpecificationDetails
ProviderZ.AI (Zhipu AI)
Model FamilyGLM-4.7
ArchitectureSparse Mixture-of-Experts
Total Parameters30B
Active Parameters~3B
Layers47
Hidden Dimension2,048
MoE Intermediate Size1,536
Attention Heads20
Experts64 routed + 1 shared (4 active per token)
Routed Scaling Factor1.8
Context Window128K tokens (max 202,752)
Max Output131,072 tokens
Input ModalitiesText
Activation FunctionSiLU
LanguagesEnglish, Chinese
Release DateJanuary 19, 2026
LicenseMIT

Benchmark Performance

BenchmarkGLM-4.7-FlashQwen3-30B-A3BGPT-OSS-20B
AIME 202591.685.091.7
GPQA Diamond75.273.471.5
LiveCodeBench v664.066.061.0
HLE14.49.810.9
SWE-bench Verified59.222.034.0
tau2-Bench79.549.047.7
BrowseComp42.82.2928.3

The SWE-bench result demands scrutiny. GLM-4.7-Flash scores 59.2% versus Qwen3-30B-A3B's 22.0% - that is a 2.7x gap on the same benchmark between models with similar active parameter counts. Either Zhipu has discovered something genuinely effective in their training pipeline for code repair tasks, or there are evaluation methodology differences worth investigating. The BrowseComp score (42.8 vs 2.29) shows a similar pattern - massive gaps that suggest different training emphasis rather than strictly better architecture.

On pure reasoning (AIME 91.6, GPQA 75.2) the model is competitive with but does not clearly dominate the field. LiveCodeBench v6 at 64.0 actually trails Qwen3-30B-A3B's 66.0, which is an interesting divergence from the SWE-bench pattern - possibly reflecting different aspects of coding ability (code generation vs. code repair).

Key Capabilities

The primary use case is clear: coding assistance and agentic tool use. The 59.2% SWE-bench Verified score means the model can successfully diagnose and fix real-world GitHub issues nearly 6 out of 10 times. The tau2-Bench score of 79.5% shows it can sequence multi-step tool calls reliably - critical for any agentic framework that needs to chain API calls, file operations, and code modifications.

For local deployment, GLM-4.7-Flash fits on consumer hardware. A Q4_K_M quantized version consumes approximately 16GB VRAM, leaving room for the 128K context window on a 24GB card (RTX 3090/4090). Reported inference speeds of 60-80+ tokens per second on these cards make it genuinely usable for interactive coding workflows. Full BF16 precision requires around 60GB VRAM, so a single A100-80GB or two consumer GPUs in a split configuration.

The model supports both vLLM and SGLang for serving, with MTP (Multi-step Token Prediction) speculative decoding for additional throughput. The GLM47 tool-call parser and GLM45 reasoning parser are available as plugins for structured agentic deployments.

Pricing and Availability

GLM-4.7-Flash is MIT licensed - the most permissive license you can get. Weights are on HuggingFace and ModelScope. Z.AI offers a free API tier with no reported daily usage caps, though it is limited to one concurrent request. Third-party providers like Novita offer the model at $0.07/M input tokens and $0.40/M output tokens for higher throughput needs.

For self-hosting, the model is available through Ollama, vLLM, SGLang, and standard HuggingFace Transformers. GGUF quantizations from the community (Unsloth and others) provide the easiest path to local deployment.

Strengths

  • 59.2% SWE-bench Verified is exceptional for any model, let alone one with 3B active parameters
  • MIT license - no restrictions, no fine print, full commercial freedom
  • Free API tier with no daily caps from Z.AI
  • Runs on consumer GPUs (16GB VRAM quantized) at interactive speeds
  • Strong agentic tool use (tau2-Bench 79.5%) for multi-step workflows

Weaknesses

  • English and Chinese only - no multilingual breadth
  • 128K context is modest compared to 1M+ competitors in the same weight class
  • LiveCodeBench v6 (64.0) trails Qwen3-30B-A3B despite dominating SWE-bench - inconsistent coding profile
  • Text-only model - no vision or multimodal capabilities
  • Self-reported benchmarks from Z.AI - the SWE-bench gap over competitors warrants independent verification
  • Chinese AI lab with limited transparency into training data composition

Sources:

GLM-4.7-Flash
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.