DiffusionGemma 26B

DiffusionGemma 26B is Google DeepMind's open-weight discrete diffusion language model that generates 256 tokens in parallel, reaching 1,100+ tokens/sec on H100 - roughly 4x faster than autoregressive models of the same size.

DiffusionGemma 26B

DiffusionGemma 26B is Google DeepMind's first open-weight text diffusion model, released on June 10, 2026. It doesn't produce tokens one at a time like every standard autoregressive language model. Instead, it denoises entire 256-token blocks in parallel, a technique called block-autoregressive discrete diffusion. The result is roughly 4x faster generation than equivalent autoregressive models, with throughput hitting 1,100+ tokens/sec on a single NVIDIA H100.

TL;DR

  • Parallel 256-token block generation via discrete diffusion, not autoregressive decoding
  • 25.2B total parameters, 3.8B active at inference - fits in 18 GB VRAM with quantization
  • Scores 5-15 points below Gemma 4 26B on reasoning benchmarks; 4x faster in return

Overview

The model builds on the Gemma 4 26B MoE backbone, pairing that architecture with a novel diffusion head. The MoE structure - 128 total experts plus one shared, 8 active per token - keeps active parameter count at 3.8B during inference, which is what allows it to fit on consumer and workstation GPUs. The diffusion head replaces standard autoregressive decoding with iterative denoising over a fixed-length canvas.

For sequences longer than 256 tokens, the model processes blocks sequentially - once a 256-token block is fully denoised, it gets committed to the KV cache, and the next block starts fresh, conditioned on everything committed so far. This block-autoregressive design combines diffusion speed within blocks with autoregressive stability across blocks. Context length is the same 256K tokens inherited from Gemma 4.

DiffusionGemma handles text, image, and video input (up to 60 seconds at 1fps), but outputs text only. It supports 35+ languages and has a built-in thinking mode via configurable <|think|> tokens. The Apache 2.0 license means no restrictions on commercial deployment. Available now at google/diffusiongemma-26B-A4B-it on Hugging Face.

Key Specifications

SpecificationDetails
ProviderGoogle DeepMind
Model FamilyGemma 4
Total Parameters25.2B
Active Parameters3.8B
MoE Experts8 active / 128 total + 1 shared
Context Window256K tokens
Canvas Length256 tokens per denoising pass
Languages35+
Input ModalitiesText, Image, Video (up to 60s)
Output ModalitiesText only
Min VRAM (quantized)~18 GB
LicenseApache 2.0
Release DateJune 10, 2026
Input PriceFree
Output PriceFree

Benchmark Performance

DiffusionGemma trades benchmark accuracy for speed. On every eval it scores below the Gemma 4 26B it's built on - the question is whether the gap matters for your use case.

BenchmarkDiffusionGemma 26BGemma 4 26BNotes
MMLU Pro77.6%82.6%5pt gap
AIME 202669.1%88.3%19pt gap
LiveCodeBench v669.1%77.1%8pt gap
GPQA Diamond73.2%82.3%9pt gap
MMMU Pro54.3%73.8%19pt gap
MATH-Vision70.5%82.4%12pt gap

The AIME 2026 gap is the one that stands out. For complex multi-step math, the diffusion architecture loses a lot - 69.1% vs 88.3% is a meaningful difference. MMMU Pro also shows a large drop (54.3% vs 73.8%), suggesting that combined vision-language reasoning takes a harder hit than pure text tasks. See the coding benchmarks leaderboard and MMLU-Pro leaderboard for full context on where these scores land across all models.

General knowledge and science reasoning hold up better. MMLU Pro at 77.6% and GPQA Diamond at 73.2% are competitive with many models that are purely autoregressive. If your workload is mostly information retrieval, summarization, code infilling, or structured formatting - not olympiad-level math - the benchmark gap is more tolerable.

Google's own guidance: use Gemma 4 for maximum-quality production applications; use DiffusionGemma where speed is the dominant constraint.

Key Capabilities

The core use case is latency-sensitive, high-throughput inference. At 1,100+ tokens/sec on H100 and 700+ tokens/sec on RTX 5090, DiffusionGemma is significantly faster than any autoregressive model at this quality tier - including Mercury 2 from Inception Labs, which runs around the same throughput range but at smaller effective parameter counts.

The bidirectional attention in the diffusion decoder is a genuine architectural advantage for certain task types. Autoregressive models can only attend leftward during generation; DiffusionGemma sees the entire 256-token canvas at once. This helps with:

  • Code infilling - filling in gaps in existing code rather than appending from the end
  • Structured output - JSON, markdown, and tabular formats where consistency across the whole block matters
  • Non-linear text - any generation where later tokens constrain earlier ones (e.g., completing a crossword-style constrained fill)

Google demonstrated this with Sudoku solving - after supervised fine-tuning on a small synthetic dataset, the model reached a 80% solve rate. A standard autoregressive model starting from the same checkpoint scored 0%.

The model also supports adaptive inference via entropy-bound denoising. On simple queries it stops early - fewer denoising steps, faster output. Complex queries get the full 48-step budget. This makes average real-world latency lower than the peak numbers suggest for mixed-workload deployments. It's also the first diffusion LLM natively supported in vLLM, which matters for production deployment.

Pricing and Availability

The weights are free under Apache 2.0. Self-host anywhere without restrictions. For cloud deployment, Google Cloud's Agent Platform Model Garden carries the model, as does NVIDIA NIM for enterprises that want a managed endpoint.

Inference frameworks currently supporting DiffusionGemma:

  • vLLM - production serving with FP8 and NVFP4 quantized checkpoints via RedHatAI hub
  • Hugging Face Transformers - DiffusionGemmaForBlockDiffusion class, available in the transformers library
  • SGLang - programmatic multi-step inference
  • MLX - Apple Silicon deployment
  • llama.cpp - community GGUF quantizations via Unsloth (as of release date, still landing)

VRAM requirements: ~18 GB with quantization, which puts it in RTX 4090 / RTX 5090 territory for local deployment. The DGX Spark (128 GB unified memory) handles it comfortably for development setups.

Fine-tuning is supported through Unsloth (LoRA adapters), NVIDIA NeMo, and the official JAX toolbox called Hackable Diffusion, which includes example recipes for SFT on custom tasks.

Compare this to Mercury 2, which is API-only with usage-based pricing - DiffusionGemma offers the full open-source stack at no cost.

Check the AI speed and latency leaderboard for an up-to-date comparison of throughput numbers across inference providers.

Strengths and Weaknesses

Strengths

  • 4x faster token generation vs autoregressive models at equivalent scale
  • Fits in 18 GB VRAM with quantization - RTX 4090/5090 viable
  • Bidirectional attention enables infilling, constrained generation, and structured output
  • Apache 2.0 - no usage restrictions, full self-host
  • Adaptive early stopping reduces average latency on simple queries
  • Full multimodal input (text, image, video) inherited from Gemma 4

Weaknesses

  • 5-19 point benchmark gaps vs Gemma 4 26B across all evals - meaningful for complex reasoning
  • Text-only output limits its use in multimodal generation pipelines
  • Diffusion architecture is less mature in tooling than standard transformers
  • Block-sequential processing beyond 256 tokens means long-generation speed advantage shrinks
  • llama.cpp / Ollama support still incomplete at launch

FAQ

Is DiffusionGemma better than Gemma 4 26B?

For throughput-sensitive workloads: yes. For complex reasoning, math, or vision-language tasks: no. It's 4x faster but scores 5-19 points lower on key benchmarks. Pick based on whether latency or quality is your binding constraint.

What hardware does DiffusionGemma need?

Minimum ~18 GB VRAM with FP8 quantization. A RTX 4090 or RTX 5090 works for single-user local inference. H100 or H200 for production serving. NVIDIA DGX Spark handles it in development.

Can DiffusionGemma produce images or video?

No. It accepts images and video as input (for analysis, OCR, VQA tasks), but it generates text only.

How does DiffusionGemma compare to Mercury 2?

Both are diffusion-based text generators reaching 1,000+ tokens/sec. DiffusionGemma is open-weight under Apache 2.0 with 25.2B total parameters and full self-host support. Mercury 2 is API-only with usage pricing. Quality benchmarks favor Mercury 2 on structured reasoning; DiffusionGemma has a bigger parameter base and broader modality support.

What inference frameworks support DiffusionGemma?

vLLM, Hugging Face Transformers, SGLang, and MLX are supported at launch. llama.cpp GGUF support via Unsloth was in progress at release. Fine-tuning via Unsloth, NVIDIA NeMo, and the official Hackable Diffusion JAX toolbox.


Sources:

✓ Last verified June 10, 2026

James Kowalski
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.