Models

Qwen3.5-9B

Qwen3.5-9B is a 9B dense model that outperforms Qwen3-30B on most benchmarks and beats GPT-5-Nano on vision tasks. Natively multimodal with 262K-1M context, Apache 2.0 licensed.

Qwen3.5-9B

Qwen3.5-9B is the largest model in the Qwen 3.5 Small Series and arguably the most impressive for what it hits relative to its size. At 9 billion parameters, it beats the previous generation's Qwen3-30B on most language benchmarks and dominates GPT-5-Nano and Gemini-2.5-Flash-Lite on vision tasks by double-digit margins.

TL;DR

  • 9B dense model with Gated DeltaNet hybrid attention - beats Qwen3-30B (3x its size) on GPQA, IFEval, LongBench
  • Beats GPT-5-Nano on MMMU-Pro (70.1 vs 57.2), MathVision (78.9 vs 62.2), OmniDocBench (87.7 vs 55.9)
  • 262K native context, extendable to ~1M with YaRN - 201 languages, multi-token prediction
  • Natively multimodal: text, images, and video from the same weights
  • Apache 2.0 - runs on a single RTX 4090 at 4-bit quantization (~5GB VRAM)

The 9B strikes a rare sweet spot: it's small enough to run on consumer hardware but capable enough to compete with models 3-9x its size on serious benchmarks. At BF16, it fits on a single RTX 3090 or any 24GB GPU. With 4-bit quantization, it drops to roughly 5GB - viable on a RTX 3060 12GB or a M1 Mac with room to spare.

Key Specifications

SpecificationDetails
ProviderAlibaba Cloud (Qwen)
Model FamilyQwen 3.5
ArchitectureGated DeltaNet + Gated Attention (3:1 hybrid)
Total Parameters9B
Active Parameters9B (all - dense)
Layers32
Hidden Dimension4,096
FFN Intermediate Dimension12,288
Attention Pattern8 x (3 x (Gated DeltaNet -> FFN) -> 1 x (Gated Attention -> FFN))
Gated DeltaNet Heads32 for V, 16 for QK; Head Dim: 128
Gated Attention Heads16 Q, 4 KV; Head Dim: 256; RoPE Dim: 64
Context Window262,144 tokens (native), ~1M (YaRN extended)
Max Output65,536 tokens
Input ModalitiesText, Image, Video
Vocabulary248,320 tokens
Languages201
TrainingMulti-Token Prediction (MTP), strong-to-weak distillation
Thinking ModeEnabled (toggleable via enable_thinking parameter)
Release DateMarch 2, 2026
LicenseApache 2.0

The 9B uses the deeper 32-layer stack (shared with the 4B) and the widest hidden dimension (4,096) among the small models. Its FFN intermediate dimension of 12,288 gives it meaningful capacity for world knowledge and reasoning. The vision encoder is a DeepStack Vision Transformer with Conv3d patch embeddings for temporal video understanding.

Benchmark Performance

Language (Thinking Mode)

BenchmarkQwen3.5-9BQwen3.5-4BQwen3-30BQwen3-80BQwen3.5-27B
MMLU-Pro82.579.180.982.786.1
C-Eval88.285.187.489.7-
SuperGPQA58.252.956.860.8-
GPQA Diamond81.776.273.477.285.5
IFEval91.589.888.988.995.0
AA-LCR (Long Context)63.057.049.051.7-
LongBench v255.250.044.848.0-

The 9B beats the previous Qwen3-30B on GPQA Diamond by 8 points, IFEval by 3 points, and LongBench v2 by 10 points. It even edges past Qwen3-80B on GPQA Diamond and IFEval - models that required an A100 cluster to run.

Vision-Language

BenchmarkQwen3.5-9BGPT-5-NanoGemini-2.5-Flash-LiteQwen3.5-27B
MMMU78.475.8-82.3
MMMU-Pro70.157.259.7-
MathVision78.962.252.186.0
MathVista (mini)85.771.572.887.8
RealWorldQA80.371.872.2-
MMBench90.1-82.7-
OmniDocBench1.587.755.979.4-
VideoMME (w/sub)84.5-74.6-

The vision results are the 9B's strongest selling point. It beats GPT-5-Nano by 13 points on MMMU-Pro, 17 points on MathVision, and 32 points on document understanding. Against the 27B (3x its size), it trails by only 4 points on MMMU and 7 points on MathVision - a narrow gap for a 3x reduction in compute.

Key Capabilities

Native multimodal at every level - Unlike Qwen 3 which needed separate VL model variants, the 9B handles text, images, and video from the same weights. The DeepStack Vision Transformer uses Conv3d patch embeddings to capture temporal dynamics in video, and merges features from multiple encoder layers rather than just the final one. VideoMME at 84.5 with subtitles shows genuine video comprehension, not just frame sampling.

Long context without the cost - The Gated DeltaNet linear attention maintains constant memory complexity, enabling 262K native context with extension to roughly 1M tokens via YaRN. LongBench v2 at 55.2 and AA-LCR at 63.0 confirm the model actually uses this context effectively, not just theoretically supports it. For comparison, the previous Qwen3-80B scored 48.0 on LongBench v2 despite being nearly 9x larger.

Thinking and non-thinking modes - The model supports both explicit chain-of-thought reasoning (enclosed in <think>...</think> tags) and direct response generation. Thinking mode is enabled by default and can be toggled via the enable_thinking API parameter or /think and /no_think tags. Thinking mode pushes GPQA Diamond to 81.7; non-thinking mode still delivers competitive results for latency-sensitive applications.

Pricing and Availability

Apache 2.0 licensed and available on HuggingFace and ModelScope. A base model (Qwen3.5-9B-Base) is also available for fine-tuning and research. Multiple quantized variants are available.

Deployment OptionVRAM RequiredNotes
BF16 (full precision)~18 GBRTX 3090/4090 24GB, A100
8-bit quantization~9 GBRTX 3060 12GB, M1 Pro Mac
4-bit quantization~5 GBRTX 3060, M1 Mac, most modern GPUs

Supported inference frameworks: vLLM (recommended), SGLang (recommended), llama.cpp, MLX (Apple Silicon), Hugging Face Transformers, and KTransformers.

Strengths

  • Outperforms Qwen3-30B (3x its size) on GPQA, IFEval, and long-context benchmarks
  • Controls GPT-5-Nano on vision: MMMU-Pro +13, MathVision +17, OmniDocBench +32
  • 262K to 1M context window from just 9B parameters - constant memory via DeltaNet
  • Natively multimodal: text, images, and video without adapter overhead
  • Runs on a single consumer GPU - RTX 4090 at BF16, RTX 3060 at 4-bit
  • 201 languages with 248K vocabulary - strong multilingual coverage
  • VideoMME 84.5 rivals models 10x its size

Weaknesses

  • Trails the Qwen3.5-27B by 4-8 points on top-end benchmarks (MMLU-Pro 82.5 vs 86.1)
  • All benchmarks are self-reported - independent validation pending
  • Gated DeltaNet architecture requires compatible inference frameworks for ideal speed
  • 9B all-active means higher per-token cost than the 35B-A3B MoE (3B active)
  • No managed API specifically aligned with this model - self-hosting only
  • Thinking mode adds latency and token cost for simple tasks

Sources:

Qwen3.5-9B
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.