Google Gemma 4 QAT Fits Frontier AI in Under 1GB

Google DeepMind published new quantization-aware training (QAT) checkpoints for the Gemma 4 family on June 5, cutting the E2B model's memory footprint to under 1GB and opening a new class of device - smartphones, low-end laptops, single-board computers - to capable local inference.

Key Specs

Variant	Format	Memory Footprint	Target Hardware
Gemma 4 E2B text-only	Mobile QAT	Under 1 GB	Smartphones, NPUs
Gemma 4 E2B (Q4_0)	W4A16	Reduced vs FP16	Laptops, consumer GPUs
Gemma 4 E4B	QAT	Reduced vs FP16	Mid-range devices
Gemma 4 12B / 26B MoE	Standard (prior)	16 GB+	Desktop, server

Why QAT Is Different From Standard Compression

Most quantization you encounter is post-training quantization (PTQ): take a finished model, compress its weights, accept the quality loss. It's fast to apply but the model has no chance to adapt to the precision reduction during learning.

QAT works differently. The quantization simulation runs inside the training loop itself, so the model's weights can adjust to compensate for reduced precision as they're being optimized. The result, according to Google DeepMind's Olivier Lacombe and Omar Sansevorio, is that "QAT results yield even higher overall quality compared to standard PTQ baselines." That's the promise of the technique, and it's why this release matters even for developers already running PTQ-compressed Gemma models.

The practical gap between PTQ and QAT shows up most in small models, where every parameter counts. A 2B-parameter model that you quantize naively after training loses a noticeable fraction of its capability. One trained with quantization awareness tends to hold that capability much better.

The Mobile Format - Four Optimizations at Once

Google released two quantization formats: the general Q4_0 (4-bit weights with 16-bit activations, W4A16) and a mobile-specialized format that goes further. The mobile format combines four techniques:

Static Activations

Rather than computing per-token scaling factors at inference time, the model pre-calculates and freezes activation scales during QAT. This reduces load on mobile processors - specifically the NPUs found in modern phone SoCs - and cuts latency on batch-constrained devices.

Channel-Wise Quantization

Weights are structured channel-by-channel to align with how mobile accelerator hardware actually performs matrix multiplication. This lets the NPU run native calculations rather than working around mismatched data layouts.

Targeted 2-Bit Quantization

Token generation layers receive the heaviest compression (2-bit quantization), while attention heads and feed-forward components that handle reasoning retain higher precision. Selectively crushing the parts that tolerate it is a cleaner approach than uniform compression across every layer.

Embedding and KV Cache Optimization

Vocabulary embeddings and the key-value cache get dedicated compression treatment. This matters specifically for conversational workloads where context accumulates across turns and KV cache memory can dominate the footprint.

Gemma 4 QAT model optimization techniques overview Google's Gemma 4 QAT release targets smartphones and laptops with multiple simultaneous compression techniques. Source: androidauthority.com

Supported Frameworks

One thing Google has gotten right with the Gemma series is distribution breadth. The QAT checkpoints ship with support for ten inference stacks at launch:

llama.cpp - GGUF format
vLLM - compressed tensor format
Ollama, LM Studio - consumer-friendly local runners
MLX - Apple Silicon path
LiteRT-LM - Google's own mobile inference runtime
Transformers.js - browser and Node.js deployments
SGLang - batched server deployments
Hugging Face Transformers
Unsloth - fine-tuning with quantized base models

The Unsloth inclusion is standout. Most quantization releases target inference only. Supporting QAT bases for fine-tuning means developers can adapt the compressed model to domain-specific tasks without reverting to FP16, which would make edge deployment impractical.

All checkpoints are available on Hugging Face right away.

Gemma 4 QAT comparison across device types The QAT variants enable deployment across a much wider range of hardware than the original Gemma 4 release. Source: androidauthority.com

Context: Gemma 4 So Far

The original Gemma 4 launch in April 2026 introduced a 26B MoE, 31B dense, and two edge variants under Apache 2.0 - Google claimed the highest intelligence-per-parameter of any open model at the time. The AI Edge Gallery app followed in mid-April, putting Gemma 4 E2B and E4B on Android and iOS phones offline. That app already ran the models on-device, but it required downloading full-precision weights, which pushed the storage and RAM requirements higher than most consumer devices handle comfortably.

The QAT release addresses that constraint directly. Getting the E2B below 1GB means it fits in the working memory of nearly any phone sold in the last four years without swapping to storage. The Gemma family has crossed 150 million downloads, so the surface area for this update is large.

What To Watch

The mobile format's static activation quantization trades some flexibility for speed. Models with frozen activation scales can behave oddly on inputs very different from their QAT calibration set - especially very short or very long inputs, or out-of-domain vocabulary. That's not a dealbreaker for most deployments, but it's worth testing against your actual workload before assuming the compressed model matches the FP16 version in your use case.

The 2-bit quantization on token generation layers is aggressive. Benchmark numbers on standard English academic tasks don't always reflect degradation on structured output formats - JSON schemas, code generation, mathematical reasoning with intermediate steps. Developers using Gemma 4 for those tasks specifically should run their own eval on the QAT checkpoints before switching.

Google hasn't published detailed ablation results showing per-benchmark differences between the FP16 baseline, PTQ Q4, and these QAT variants. Those comparisons would make the quality-vs-size tradeoff much more concrete for practitioners. The claim that QAT beats PTQ is credible and consistent with the literature, but the magnitude of the gap in this specific case is still left to developers to measure themselves.

The QAT release moves the Gemma 4 family from "can run on a phone with some effort" to "fits comfortably in phone memory." That's a different deployment category, and for developers building local-first or offline-capable applications, it opens options that were genuinely out of reach at higher precision weights.

Sources: