Google Gemma 4 QAT Fits Frontier AI in Under 1GB
Google DeepMind's new QAT checkpoints shrink the Gemma 4 E2B model to under 1GB, making serious on-device AI viable for phones and budget laptops.

Google DeepMind published new quantization-aware training (QAT) checkpoints for the Gemma 4 family on June 5, cutting the E2B model's memory footprint to under 1GB and opening a new class of device - smartphones, low-end laptops, single-board computers - to capable local inference.
Key Specs
| Variant | Format | Memory Footprint | Target Hardware |
|---|---|---|---|
| Gemma 4 E2B text-only | Mobile QAT | Under 1 GB | Smartphones, NPUs |
| Gemma 4 E2B (Q4_0) | W4A16 | Reduced vs FP16 | Laptops, consumer GPUs |
| Gemma 4 E4B | QAT | Reduced vs FP16 | Mid-range devices |
| Gemma 4 12B / 26B MoE | Standard (prior) | 16 GB+ | Desktop, server |
Why QAT Is Different From Standard Compression
Most quantization you encounter is post-training quantization (PTQ): take a finished model, compress its weights, accept the quality loss. It's fast to apply but the model has no chance to adapt to the precision reduction during learning.
QAT works differently. The quantization simulation runs inside the training loop itself, so the model's weights can adjust to compensate for reduced precision as they're being optimized. The result, according to Google DeepMind's Olivier Lacombe and Omar Sansevorio, is that "QAT results yield even higher overall quality compared to standard PTQ baselines." That's the promise of the technique, and it's why this release matters even for developers already running PTQ-compressed Gemma models.
The practical gap between PTQ and QAT shows up most in small models, where every parameter counts. A 2B-parameter model that you quantize naively after training loses a noticeable fraction of its capability. One trained with quantization awareness tends to hold that capability much better.
The Mobile Format - Four Optimizations at Once
Google released two quantization formats: the general Q4_0 (4-bit weights with 16-bit activations, W4A16) and a mobile-specialized format that goes further. The mobile format combines four techniques:
Static Activations
Rather than computing per-token scaling factors at inference time, the model pre-calculates and freezes activation scales during QAT. This reduces load on mobile processors - specifically the NPUs found in modern phone SoCs - and cuts latency on batch-constrained devices.
Channel-Wise Quantization
Weights are structured channel-by-channel to align with how mobile accelerator hardware actually performs matrix multiplication. This lets the NPU run native calculations rather than working around mismatched data layouts.
Targeted 2-Bit Quantization
Token generation layers receive the heaviest compression (2-bit quantization), while attention heads and feed-forward components that handle reasoning retain higher precision. Selectively crushing the parts that tolerate it is a cleaner approach than uniform compression across every layer.
Embedding and KV Cache Optimization
Vocabulary embeddings and the key-value cache get dedicated compression treatment. This matters specifically for conversational workloads where context accumulates across turns and KV cache memory can dominate the footprint.
Google's Gemma 4 QAT release targets smartphones and laptops with multiple simultaneous compression techniques.
Source: androidauthority.com
Supported Frameworks
One thing Google has gotten right with the Gemma series is distribution breadth. The QAT checkpoints ship with support for ten inference stacks at launch:
- llama.cpp - GGUF format
- vLLM - compressed tensor format
- Ollama, LM Studio - consumer-friendly local runners
- MLX - Apple Silicon path
- LiteRT-LM - Google's own mobile inference runtime
- Transformers.js - browser and Node.js deployments
- SGLang - batched server deployments
- Hugging Face Transformers
- Unsloth - fine-tuning with quantized base models
The Unsloth inclusion is standout. Most quantization releases target inference only. Supporting QAT bases for fine-tuning means developers can adapt the compressed model to domain-specific tasks without reverting to FP16, which would make edge deployment impractical.
All checkpoints are available on Hugging Face right away.
The QAT variants enable deployment across a much wider range of hardware than the original Gemma 4 release.
Source: androidauthority.com
Context: Gemma 4 So Far
The original Gemma 4 launch in April 2026 introduced a 26B MoE, 31B dense, and two edge variants under Apache 2.0 - Google claimed the highest intelligence-per-parameter of any open model at the time. The AI Edge Gallery app followed in mid-April, putting Gemma 4 E2B and E4B on Android and iOS phones offline. That app already ran the models on-device, but it required downloading full-precision weights, which pushed the storage and RAM requirements higher than most consumer devices handle comfortably.
The QAT release addresses that constraint directly. Getting the E2B below 1GB means it fits in the working memory of nearly any phone sold in the last four years without swapping to storage. The Gemma family has crossed 150 million downloads, so the surface area for this update is large.
What To Watch
The mobile format's static activation quantization trades some flexibility for speed. Models with frozen activation scales can behave oddly on inputs very different from their QAT calibration set - especially very short or very long inputs, or out-of-domain vocabulary. That's not a dealbreaker for most deployments, but it's worth testing against your actual workload before assuming the compressed model matches the FP16 version in your use case.
The 2-bit quantization on token generation layers is aggressive. Benchmark numbers on standard English academic tasks don't always reflect degradation on structured output formats - JSON schemas, code generation, mathematical reasoning with intermediate steps. Developers using Gemma 4 for those tasks specifically should run their own eval on the QAT checkpoints before switching.
Google hasn't published detailed ablation results showing per-benchmark differences between the FP16 baseline, PTQ Q4, and these QAT variants. Those comparisons would make the quality-vs-size tradeoff much more concrete for practitioners. The claim that QAT beats PTQ is credible and consistent with the literature, but the magnitude of the gap in this specific case is still left to developers to measure themselves.
The QAT release moves the Gemma 4 family from "can run on a phone with some effort" to "fits comfortably in phone memory." That's a different deployment category, and for developers building local-first or offline-capable applications, it opens options that were genuinely out of reach at higher precision weights.
Sources:
