Name: DiffusionGemma 26B
Author: Google DeepMind

DiffusionGemma 26B is Google DeepMind's first open-weight text diffusion model, released on June 10, 2026. It doesn't produce tokens one at a time like every standard autoregressive language model. Instead, it denoises entire 256-token blocks in parallel, a technique called block-autoregressive discrete diffusion. The result is roughly 4x faster generation than equivalent autoregressive models, with throughput hitting 1,100+ tokens/sec on a single NVIDIA H100.

TL;DR

Parallel 256-token block generation via discrete diffusion, not autoregressive decoding
25.2B total parameters, 3.8B active at inference - fits in 18 GB VRAM with quantization
Scores 5-15 points below Gemma 4 26B on reasoning benchmarks; 4x faster in return

Overview

The model builds on the Gemma 4 26B MoE backbone, pairing that architecture with a novel diffusion head. The MoE structure - 128 total experts plus one shared, 8 active per token - keeps active parameter count at 3.8B during inference, which is what allows it to fit on consumer and workstation GPUs. The diffusion head replaces standard autoregressive decoding with iterative denoising over a fixed-length canvas.

For sequences longer than 256 tokens, the model processes blocks sequentially - once a 256-token block is fully denoised, it gets committed to the KV cache, and the next block starts fresh, conditioned on everything committed so far. This block-autoregressive design combines diffusion speed within blocks with autoregressive stability across blocks. Context length is the same 256K tokens inherited from Gemma 4.

DiffusionGemma handles text, image, and video input (up to 60 seconds at 1fps), but outputs text only. It supports 35+ languages and has a built-in thinking mode via configurable <|think|> tokens. The Apache 2.0 license means no restrictions on commercial deployment. Available now at google/diffusiongemma-26B-A4B-it on Hugging Face.

Key Specifications

Specification	Details
Provider	Google DeepMind
Model Family	Gemma 4
Total Parameters	25.2B
Active Parameters	3.8B
MoE Experts	8 active / 128 total + 1 shared
Context Window	256K tokens
Canvas Length	256 tokens per denoising pass
Languages	35+
Input Modalities	Text, Image, Video (up to 60s)
Output Modalities	Text only
Min VRAM (quantized)	~18 GB
License	Apache 2.0
Release Date	June 10, 2026
Input Price	Free
Output Price	Free

Benchmark Performance

DiffusionGemma trades benchmark accuracy for speed. On every eval it scores below the Gemma 4 26B it's built on - the question is whether the gap matters for your use case.

Benchmark	DiffusionGemma 26B	Gemma 4 26B	Notes
MMLU Pro	77.6%	82.6%	5pt gap
AIME 2026	69.1%	88.3%	19pt gap
LiveCodeBench v6	69.1%	77.1%	8pt gap
GPQA Diamond	73.2%	82.3%	9pt gap
MMMU Pro	54.3%	73.8%	19pt gap
MATH-Vision	70.5%	82.4%	12pt gap

The AIME 2026 gap is the one that stands out. For complex multi-step math, the diffusion architecture loses a lot - 69.1% vs 88.3% is a meaningful difference. MMMU Pro also shows a large drop (54.3% vs 73.8%), suggesting that combined vision-language reasoning takes a harder hit than pure text tasks. See the coding benchmarks leaderboard and MMLU-Pro leaderboard for full context on where these scores land across all models.

General knowledge and science reasoning hold up better. MMLU Pro at 77.6% and GPQA Diamond at 73.2% are competitive with many models that are purely autoregressive. If your workload is mostly information retrieval, summarization, code infilling, or structured formatting - not olympiad-level math - the benchmark gap is more tolerable.

Google's own guidance: use Gemma 4 for maximum-quality production applications; use DiffusionGemma where speed is the dominant constraint.

Key Capabilities

The core use case is latency-sensitive, high-throughput inference. At 1,100+ tokens/sec on H100 and 700+ tokens/sec on RTX 5090, DiffusionGemma is significantly faster than any autoregressive model at this quality tier - including Mercury 2 from Inception Labs, which runs around the same throughput range but at smaller effective parameter counts.

The bidirectional attention in the diffusion decoder is a genuine architectural advantage for certain task types. Autoregressive models can only attend leftward during generation; DiffusionGemma sees the entire 256-token canvas at once. This helps with:

Code infilling - filling in gaps in existing code rather than appending from the end
Structured output - JSON, markdown, and tabular formats where consistency across the whole block matters
Non-linear text - any generation where later tokens constrain earlier ones (e.g., completing a crossword-style constrained fill)

Google demonstrated this with Sudoku solving - after supervised fine-tuning on a small synthetic dataset, the model reached a 80% solve rate. A standard autoregressive model starting from the same checkpoint scored 0%.

The model also supports adaptive inference via entropy-bound denoising. On simple queries it stops early - fewer denoising steps, faster output. Complex queries get the full 48-step budget. This makes average real-world latency lower than the peak numbers suggest for mixed-workload deployments. It's also the first diffusion LLM natively supported in vLLM, which matters for production deployment.

Pricing and Availability

The weights are free under Apache 2.0. Self-host anywhere without restrictions. For cloud deployment, Google Cloud's Agent Platform Model Garden carries the model, as does NVIDIA NIM for enterprises that want a managed endpoint.

Inference frameworks currently supporting DiffusionGemma:

vLLM - production serving with FP8 and NVFP4 quantized checkpoints via RedHatAI hub
Hugging Face Transformers - DiffusionGemmaForBlockDiffusion class, available in the transformers library
SGLang - programmatic multi-step inference
MLX - Apple Silicon deployment
llama.cpp - community GGUF quantizations via Unsloth (as of release date, still landing)

VRAM requirements: ~18 GB with quantization, which puts it in RTX 4090 / RTX 5090 territory for local deployment. The DGX Spark (128 GB unified memory) handles it comfortably for development setups.

Fine-tuning is supported through Unsloth (LoRA adapters), NVIDIA NeMo, and the official JAX toolbox called Hackable Diffusion, which includes example recipes for SFT on custom tasks.

Compare this to Mercury 2, which is API-only with usage-based pricing - DiffusionGemma offers the full open-source stack at no cost.

Check the AI speed and latency leaderboard for an up-to-date comparison of throughput numbers across inference providers.

Strengths and Weaknesses

Strengths

4x faster token generation vs autoregressive models at equivalent scale
Fits in 18 GB VRAM with quantization - RTX 4090/5090 viable
Bidirectional attention enables infilling, constrained generation, and structured output
Apache 2.0 - no usage restrictions, full self-host
Adaptive early stopping reduces average latency on simple queries
Full multimodal input (text, image, video) inherited from Gemma 4

Weaknesses

5-19 point benchmark gaps vs Gemma 4 26B across all evals - meaningful for complex reasoning
Text-only output limits its use in multimodal generation pipelines
Diffusion architecture is less mature in tooling than standard transformers
Block-sequential processing beyond 256 tokens means long-generation speed advantage shrinks
llama.cpp / Ollama support still incomplete at launch

Google Gemma 4 Ships Four Open Models Under Apache 2.0 - the Gemma 4 backbone DiffusionGemma is built on
Inception Ships Mercury 2 - A Diffusion LLM That Hits 1,009 Tokens Per Second - competing diffusion approach
Mercury 2 Is 13x Faster Than Claude Haiku - Verified - independent throughput benchmarks for comparison
AI Speed and Latency Leaderboard
Open-Source LLM Leaderboard
MMLU-Pro Leaderboard

FAQ

Is DiffusionGemma better than Gemma 4 26B?

For throughput-sensitive workloads: yes. For complex reasoning, math, or vision-language tasks: no. It's 4x faster but scores 5-19 points lower on key benchmarks. Pick based on whether latency or quality is your binding constraint.

What hardware does DiffusionGemma need?

Minimum ~18 GB VRAM with FP8 quantization. A RTX 4090 or RTX 5090 works for single-user local inference. H100 or H200 for production serving. NVIDIA DGX Spark handles it in development.

Can DiffusionGemma produce images or video?

No. It accepts images and video as input (for analysis, OCR, VQA tasks), but it generates text only.

How does DiffusionGemma compare to Mercury 2?

Both are diffusion-based text generators reaching 1,000+ tokens/sec. DiffusionGemma is open-weight under Apache 2.0 with 25.2B total parameters and full self-host support. Mercury 2 is API-only with usage pricing. Quality benchmarks favor Mercury 2 on structured reasoning; DiffusionGemma has a bigger parameter base and broader modality support.

What inference frameworks support DiffusionGemma?

vLLM, Hugging Face Transformers, SGLang, and MLX are supported at launch. llama.cpp GGUF support via Unsloth was in progress at release. Fine-tuning via Unsloth, NVIDIA NeMo, and the official Hackable Diffusion JAX toolbox.

Sources: