Qwen3.5-35B-A3B vs GLM-4.7-Flash: Two Chinese MoE Models, Very Different Strengths
Head-to-head comparison of Qwen3.5-35B-A3B and GLM-4.7-Flash - two Chinese-origin 30B-A3B MoE models with Apache 2.0/MIT licenses that dominate different benchmarks despite near-identical parameter budgets.

Here is a puzzle for anyone who thinks model quality is a simple function of parameter count. Qwen3.5-35B-A3B and GLM-4.7-Flash are both ~30B total parameter MoE models activating roughly 3B parameters per token, both from Chinese AI labs, both under maximally permissive open-source licenses. Qwen dominates on knowledge benchmarks (MMLU-Pro 85.3 vs GLM's ~60) and reasoning (GPQA 84.2 vs 75.2). But GLM posts 59.2% on SWE-bench Verified - close enough to Qwen's 69.2% that the gap is far smaller than their general intelligence scores would predict. And on agentic tool use, GLM's tau2-Bench 79.5% is competitive with Qwen's TAU2-Bench 81.2%.
The question is not which model is "better." It is why two models with similar compute budgets and the same Chinese AI research ecosystem developed such different capability profiles - and which profile matches your actual workload.
TL;DR
- Choose Qwen3.5-35B-A3B if you need broad intelligence (MMLU-Pro 85.3), strong reasoning (GPQA 84.2), native multimodal (text/image/video), and the highest overall benchmark scores in this weight class
- Choose GLM-4.7-Flash if you need a free, MIT-licensed coding agent with strong SWE-bench (59.2%) and tool-use scores (tau2-Bench 79.5%) that runs on a 24GB GPU at 60-80 tokens/sec
Quick Comparison
| Feature | Qwen3.5-35B-A3B | GLM-4.7-Flash |
|---|---|---|
| Provider | Alibaba Cloud (Qwen) | Z.AI (Zhipu AI) |
| Release Date | February 24, 2026 | January 19, 2026 |
| Total Parameters | 35B | 30B |
| Active Parameters | 3B | ~3B |
| Architecture | Gated DeltaNet + Sparse MoE | Sparse MoE (Transformer) |
| Experts | 256 routed (8 active + 1 shared) | 64 routed (4 active + 1 shared) |
| Context Window | 262K (1M extended) | 128K (max 202K) |
| Modalities | Text, Image, Video | Text only |
| Languages | 201 | English, Chinese |
| License | Apache 2.0 | MIT |
| Free API | Qwen Chat | Z.AI API (no caps) |
| Best For | Broad knowledge, reasoning, multimodal | Coding, agentic tool use, budget deployment |
Qwen3.5-35B-A3B: Broad Intelligence Leader
Qwen3.5-35B-A3B is the headline release of Alibaba's Qwen 3.5 Medium Series, and it makes the case that a 3B-active model can post scores that match or exceed last generation's flagships. MMLU-Pro at 85.3 and GPQA Diamond at 84.2 are the numbers that get the most attention - and they should, because these are frontier-class knowledge and reasoning scores from a model that activates fewer parameters than GPT-2.
The architecture is a hybrid of Gated DeltaNet (a form of linear attention with O(n) sequence scaling) and standard softmax attention in a 3:1 ratio across 40 layers. The MoE layer is aggressively sparse: 256 total experts, 8 routed plus 1 shared per token. This extreme sparsity - only about 3.5% of experts activate per token - is what allows the model to pack 35B parameters of knowledge into a 3B inference budget.
Where Qwen pulls ahead most decisively is breadth. It is natively multimodal (text, image, video via early fusion), supports 201 languages, and posts strong scores across vision benchmarks like MMMU (81.4), MathVision (83.9), and OCRBench (91.0). It is not just a text model - it is a general-purpose reasoning system that happens to be efficient enough to run on consumer hardware. The TAU2-Bench score of 81.2 for agentic tasks and ScreenSpot Pro 68.6 for GUI grounding show that this breadth extends to real-world interaction tasks.
On coding specifically, Qwen posts SWE-bench Verified 69.2% and LiveCodeBench v6 74.6%. These are strong numbers, but as we will see, the gap to GLM on SWE-bench is smaller than the gap on knowledge benchmarks. The model is Apache 2.0 licensed, with quantized GGUF versions available for consumer deployment and a managed API through Qwen3.5-Flash at $0.10/M input tokens.
GLM-4.7-Flash: The Coding Specialist
GLM-4.7-Flash is a different animal. Where Qwen optimizes for broad intelligence, GLM optimizes for two specific capabilities: code repair and agentic tool use. The results are striking. SWE-bench Verified at 59.2% is nearly 3x the score of the similarly-sized Qwen3-30B-A3B (22.0%) from the previous generation, and within striking distance of models with far larger parameter budgets. The tau2-Bench score of 79.5% for multi-step tool invocation is equally impressive - this model can reliably sequence API calls, file operations, and code modifications in agentic workflows.
The architecture is straightforward: 47 layers, 2048 hidden dimension, 64 routed experts with 4 active per token plus 1 shared. No hybrid attention mechanisms, no Mamba layers, no architectural novelties. The performance comes from training methodology and data quality. Z.AI describes this as part of their "ARC" (Agentic, Reasoning, and Coding) model line, built on the foundation described in arXiv 2508.06471. The model ships with a "preserved thinking" mode for maintaining reasoning context across multi-turn conversations and a step-by-step thinking mode for debugging.
The practical deployment story is what makes GLM-4.7-Flash compelling for individual developers and small teams. With Q4_K_M quantization, it fits in approximately 16GB of VRAM - comfortably within the 24GB envelope of an RTX 3090 or 4090, with room for the 128K context window. Reported inference speeds of 60-80+ tokens per second on consumer cards make it genuinely interactive. The MIT license means zero restrictions, and Z.AI offers a free API tier with no reported daily usage caps. For a developer who wants a competent coding assistant running entirely on their own hardware at zero marginal cost, GLM-4.7-Flash is one of the best options available.
The weaknesses are equally clear. MMLU-Pro at approximately 60 (using the GLM-4.7 non-Flash reference point) is 25 points behind Qwen - a massive gap on broad knowledge tasks. The model is text-only, English and Chinese only, and the 128K context window is modest compared to competitors. It is a specialist, not a generalist.
Detailed Benchmark Comparison
| Benchmark | Qwen3.5-35B-A3B | GLM-4.7-Flash | Winner |
|---|---|---|---|
| MMLU-Pro | 85.3 | ~60.0 | Qwen (+25.3) |
| GPQA Diamond | 84.2 | 75.2 | Qwen (+9.0) |
| AIME 2025 | 89.0 (HMMT) | 91.6 | GLM (+2.6) |
| SWE-bench Verified | 69.2 | 59.2 | Qwen (+10.0) |
| LiveCodeBench v6 | 74.6 | 64.0 | Qwen (+10.6) |
| TAU2-Bench / tau2-Bench | 81.2 | 79.5 | Close (Qwen +1.7) |
| HLE | - | 14.4 | GLM |
| BrowseComp | 61.0 | 42.8 | Qwen (+18.2) |
| MMMU (Vision) | 81.4 | N/A (text only) | Qwen |
| MathVision | 83.9 | N/A (text only) | Qwen |
| AndroidWorld | 71.1 | N/A | Qwen |
The pattern is clear but nuanced. Qwen leads on everything - but the margins vary enormously. On broad knowledge (MMLU-Pro), Qwen is 25 points ahead. On reasoning (GPQA), 9 points. On coding (SWE-bench), 10 points. On agentic tool use (TAU2-Bench), less than 2 points. And on pure math (AIME 2025), GLM actually wins by 2.6 points.
This tells you something important about GLM-4.7-Flash's training priorities. Z.AI clearly invested heavily in coding capability and agentic tool use at the expense of broad knowledge benchmarks. The result is a model that punches far above its "general intelligence" weight class on the tasks that matter most for developer workflows. If your use case is a coding assistant or an agentic framework that chains tool calls, the gap between 81.2 and 79.5 on TAU2-Bench is negligible. If your use case requires encyclopedic knowledge or multilingual understanding, the 25-point MMLU-Pro gap makes Qwen the only viable choice.
One important caveat: both models' benchmarks are self-reported by their respective labs. The SWE-bench numbers in particular warrant independent verification - GLM's 59.2% is exceptional for any model at this size, and the gap between SWE-bench (where GLM is close to Qwen) and LiveCodeBench (where Qwen leads by 10.6 points) suggests these benchmarks may be measuring different aspects of coding ability. For community-verified scores, check the Open Source LLM Leaderboard and Coding Benchmarks Leaderboard.
Pricing and Deployment
| Factor | Qwen3.5-35B-A3B | GLM-4.7-Flash |
|---|---|---|
| Self-hosted (BF16) | ~70GB VRAM | ~60GB VRAM |
| Self-hosted (Q4) | ~20GB VRAM | ~16GB VRAM |
| Consumer GPU | RTX 4090 (tight) | RTX 3090/4090 (comfortable) |
| Inference speed (Q4, 4090) | Varies (DeltaNet maturing) | 60-80+ tok/s |
| Managed API (input) | $0.10/M (Qwen3.5-Flash) | Free (Z.AI) / $0.07/M (Novita) |
| Managed API (output) | $0.10/M | Free (Z.AI) / $0.40/M (Novita) |
| License | Apache 2.0 | MIT |
| Framework support | vLLM, SGLang (maturing) | vLLM, SGLang, Ollama |
GLM-4.7-Flash has the edge on deployment economics. The smaller quantized footprint (~16GB vs ~20GB), faster consumer-GPU inference (60-80 tok/s is well-characterized), and free Z.AI API with no reported caps make it the cheaper option by every metric. The MIT license is slightly more permissive than Apache 2.0 in practice (no attribution requirement), though both are maximally open for commercial use.
Qwen's advantage is the managed ecosystem. If you want a hosted API with 1M context, built-in tool calling, and production SLAs, Qwen3.5-Flash provides that at $0.10/M tokens. GLM's free API tier is remarkable for prototyping and individual use, but it is limited to one concurrent request - not suitable for production workloads. Third-party hosting through Novita or similar providers fills the gap at $0.07/M input, but the ecosystem around GLM is smaller than Qwen's.
For self-hosting, GLM's straightforward MoE architecture (standard Transformer, no Gated DeltaNet) means broader framework compatibility today. Qwen's DeltaNet layers are still getting first-class support in vLLM and SGLang, which means potential deployment friction in production. If you are building something this week and need it to work with your existing inference stack, GLM is the safer bet.
Pros and Cons
Qwen3.5-35B-A3B
Pros:
- MMLU-Pro 85.3 and GPQA 84.2 - highest broad intelligence in the 3B active class
- Native multimodal (text, image, video) via early fusion - not available in GLM
- SWE-bench 69.2% and LiveCodeBench 74.6% - strong coding across both benchmarks
- 262K native context, extensible to 1M - double GLM's 128K
- 201 languages vs GLM's 2 - no comparison for multilingual workloads
- Apache 2.0 license with a mature managed API ecosystem
Cons:
- Gated DeltaNet framework support still maturing - deployment friction today
- 256-expert MoE is sensitive to quantization - quality may degrade with aggressive compression
- Higher VRAM footprint (~20GB Q4 vs GLM's ~16GB)
- No free API with unlimited access like Z.AI's offering
- Self-reported benchmarks pending independent verification
GLM-4.7-Flash
Pros:
- SWE-bench Verified 59.2% - exceptional code repair capability for a 3B-active model
- tau2-Bench 79.5% - near-Qwen agentic tool-use performance at a fraction of the knowledge overhead
- MIT license - the most permissive open-source license available
- Free Z.AI API with no reported daily caps
- 16GB VRAM quantized - runs comfortably on RTX 3090/4090 at 60-80 tok/s
- Standard Transformer MoE - broad framework compatibility, no exotic layer types
Cons:
- MMLU-Pro ~60 is 25 points behind Qwen - not viable for knowledge-intensive tasks
- Text-only model - no vision or multimodal capabilities
- English and Chinese only - no multilingual support
- 128K context is half of Qwen's 262K native (and far short of the 1M extended)
- Self-reported SWE-bench score warrants independent verification given the gap to knowledge benchmarks
- Smaller ecosystem and community compared to Qwen
Verdict
These two models represent different philosophies about what matters in a 3B-active open-weight model.
Choose Qwen3.5-35B-A3B if you need a general-purpose model that handles knowledge, reasoning, coding, vision, and agentic tasks across the board. The 25-point MMLU-Pro advantage means it will give better answers on factual queries, the multimodal capabilities open up use cases that GLM simply cannot serve, and the 262K-1M context window handles larger inputs. If you are building a product that needs to do many things well - a general AI assistant, a multi-skill agent, or a research tool - Qwen is the right foundation. For the full family lineup, see Qwen3.5-122B-A10B and Qwen3.5-27B.
Choose GLM-4.7-Flash if your workload is primarily coding and agentic tool use, and you want to minimize cost. The SWE-bench and tau2-Bench scores are close enough to Qwen that the practical difference for coding workflows may be smaller than the headline MMLU-Pro gap suggests. The MIT license, free API, and consumer-GPU deployment story make it the most accessible high-capability coding model available. If you are an individual developer running a local coding assistant, GLM is hard to beat on value.
Choose either if you are building an agentic system that chains tool calls - both post TAU2-Bench scores above 79%, which means reliable multi-step tool orchestration. The tie-breaker in that scenario is whether you also need multimodal input, broad knowledge, or multilingual support (Qwen) versus minimal deployment cost and maximum licensing freedom (GLM). For a broader perspective on how these models fit into the open-source landscape, see our open source vs proprietary AI guide.
Sources:
- Qwen3.5-35B-A3B Model Card (HuggingFace)
- Qwen3.5: Towards Native Multimodal Agents (Blog)
- GLM-4.7-Flash Model Card (HuggingFace)
- GLM-4.7 Developer Documentation (Z.AI)
- Zhipu AI Releases GLM-4.7-Flash (MarkTechPost)
- GLM-4.7-Flash Launch Analysis (llm-stats.com)
- GLM-4.5: Agentic, Reasoning, and Coding Foundation Models (arXiv 2508.06471)
- Qwen3.5 Developer Guide (NxCode)
