Qwen3.5-0.8B
Qwen3.5-0.8B is the smallest natively multimodal model in the Qwen 3.5 family - 0.8B parameters handling text, images, and video with 262K context. MathVista 62.2, OCRBench 74.5. Apache 2.0.

Qwen3.5-0.8B is the floor of the Qwen 3.5 Small Series - and it's a remarkably high floor. Under a billion parameters, this model natively processes text, images, and video, supports 262K context, covers 201 languages, and posts MathVista 62.2 and OCRBench 74.5. It runs on around 1.6GB of VRAM at full precision and fits on a phone with 4-bit quantization.
TL;DR
- 0.8B dense model - the smallest natively multimodal model in the Qwen 3.5 family
- Handles text, images, and video from under a billion parameters
- MathVista 62.2, OCRBench 74.5, VideoMME 63.8 (w/subtitles)
- 262K native context, 201 languages, multi-token prediction
- Apache 2.0 - ~1.6GB VRAM at BF16, ~0.5GB at 4-bit
The 0.8B exists for devices where every megabyte counts. Phone apps, embedded systems, IoT devices, and Raspberry Pi deployments can run a model that processes screenshots, reads documents, and even handles basic video understanding. A year ago, that required a 7B model and a serious GPU.
Key Specifications
| Specification | Details |
|---|---|
| Provider | Alibaba Cloud (Qwen) |
| Model Family | Qwen 3.5 |
| Architecture | Gated DeltaNet + Gated Attention (3:1 hybrid) |
| Total Parameters | 0.8B |
| Active Parameters | 0.8B (all - dense) |
| Layers | 24 |
| Hidden Dimension | 1,024 |
| FFN Intermediate Dimension | 3,584 |
| Attention Pattern | 6 x (3 x (Gated DeltaNet -> FFN) -> 1 x (Gated Attention -> FFN)) |
| Gated DeltaNet Heads | 16 for V, 16 for QK; Head Dim: 128 |
| Gated Attention Heads | 8 Q, 2 KV; Head Dim: 256; RoPE Dim: 64 |
| Context Window | 262,144 tokens (native) |
| Max Output | 65,536 tokens |
| Input Modalities | Text, Image, Video |
| Vocabulary | 248,320 tokens |
| Languages | 201 |
| Training | Multi-Token Prediction (MTP), strong-to-weak distillation |
| Thinking Mode | Non-thinking (default) |
| Release Date | March 2, 2026 |
| License | Apache 2.0 |
The 0.8B shares the 24-layer stack with the 2B but halves the hidden dimension (1,024 vs 2,048) and uses a smaller FFN (3,584 vs 6,144). The attention architecture is identical - same head dimensions, same 3:1 DeltaNet-to-Attention ratio.
Benchmark Performance
Language (Non-Thinking Mode)
| Benchmark | Qwen3.5-0.8B |
|---|---|
| MMLU-Pro | 29.7 |
| MMLU-Redux | 48.5 |
| C-Eval | 46.4 |
| SuperGPQA | 16.9 |
| IFEval | 52.1 |
| MMMLU | 34.1 |
Language benchmarks are honest about the 0.8B's limits. This isn't a model for graduate-level reasoning or complex instruction following. MMLU-Pro at 29.7 and SuperGPQA at 16.9 show the constraints of sub-billion parameter language modeling. IFEval at 52.1 is functional but not reliable for production instruction following.
Vision-Language
| Benchmark | Qwen3.5-0.8B | Qwen3.5-2B | Qwen3.5-9B |
|---|---|---|---|
| MMMU | 49.0 | 64.2 | 78.4 |
| MathVista (mini) | 62.2 | 76.7 | 85.7 |
| RealWorldQA | 63.4 | 74.5 | 80.3 |
| MMStar | 58.3 | - | 79.7 |
| MMBench (EN-DEV) | 69.9 | 83.3 | 90.1 |
| OCRBench | 74.5 | 84.5 | - |
| VideoMME (w/sub) | 63.8 | 75.6 | 84.5 |
| RefCOCO (avg) | 79.3 | 84.8 | - |
Vision is where the 0.8B punches above its weight. OCRBench at 74.5 means useful document text extraction from a phone-sized model. MathVista at 62.2 and VideoMME at 63.8 demonstrate genuine multimodal reasoning. RefCOCO at 79.3 (visual grounding - pointing to objects in images) is surprisingly strong for a model this small.
Key Capabilities
Multimodal from under a billion parameters - This is the smallest model in the Qwen 3.5 family and one of the smallest natively multimodal models publicly available. It processes images, video, and text through the same architecture and weights. The DeepStack Vision Transformer with Conv3d handles temporal video understanding even at this scale.
Edge and mobile deployment - At ~1.6GB VRAM (BF16) or ~0.5GB (4-bit quantization), the 0.8B runs on phone-class hardware. Seven quantized variants are available on HuggingFace, including GGUF formats compatible with llama.cpp for CPU-only deployment. This opens multimodal AI to IoT devices, embedded systems, and offline mobile applications.
262K context window - The Gated DeltaNet architecture provides constant-memory attention, enabling 262K token context even at 0.8B parameters. This means the model can process long documents, multi-page PDFs, or extended conversation histories without running out of memory on edge devices.
Pricing and Availability
Apache 2.0 licensed and available on HuggingFace and ModelScope. A base model (Qwen3.5-0.8B-Base) and seven quantized variants are available.
| Deployment Option | VRAM Required | Notes |
|---|---|---|
| BF16 (full precision) | ~1.6 GB | Any modern GPU, M1 Mac |
| 8-bit quantization | ~0.8 GB | Mobile GPUs, integrated graphics |
| 4-bit quantization | ~0.5 GB | Phones, Raspberry Pi, CPU-only |
Strengths
- Smallest natively multimodal model in Qwen 3.5 - text, images, video from 0.8B params
- OCRBench 74.5 and RefCOCO 79.3 enable practical vision tasks on edge devices
- 262K context from constant-memory DeltaNet attention
- ~0.5GB at 4-bit quantization - runs on phones and embedded hardware
- 201 languages with 248K vocabulary
- 7 quantized variants including GGUF for llama.cpp
- Apache 2.0 with base model for fine-tuning
Weaknesses
- Language benchmarks are limited - MMLU-Pro 29.7, IFEval 52.1
- Non-thinking mode only - no chain-of-thought reasoning available
- Not suitable for complex reasoning, coding, or multi-step instruction following
- Knowledge capacity is constrained by the small parameter count
- Self-reported benchmarks - independent validation pending
- Vision quality degrades clearly vs the 2B (OCRBench 74.5 vs 84.5)
Related Coverage
- Qwen 3.5 Small Series Ships Four Models
- Qwen3.5-2B
- Qwen3.5-4B
- Qwen3.5-9B
- Qwen 3.5 Medium Series Drops Four Models
- Open Source LLM Leaderboard
Sources:
