DiffusionGemma Generates Text 4x Faster With Open Diffusion-Based Decoding

Google DeepMind released DiffusionGemma, an open 26B model that generates text via parallel diffusion decoding, reaching up to 2,000 tokens per second and running locally.

Dr. Nova Chen★Jun 17, 2026★6 min read

A Different Way to Generate Text — and It Is Fast

Most language models you have heard of generate text one token at a time, left to right, like a typewriter that cannot move on until the current letter is struck. On June 10, 2026, Google DeepMind released something that works on a genuinely different principle: DiffusionGemma, an open model that generates text the way image diffusion models generate pictures — by starting from noise and refining many tokens in parallel. It is an exciting architectural experiment, and the early performance numbers are eye-catching.

How Diffusion-Based Text Generation Works

In a diffusion language model, the system does not commit to each word sequentially. Instead, it denoises up to 256 tokens in parallel per step, gradually resolving a blurry draft into coherent text across several refinement passes. The payoff is throughput. Because the model fills in large spans at once rather than crawling token by token, it can be dramatically faster on the right hardware.

DiffusionGemma is built on the Gemma 4 architecture with 26 billion parameters, of which about 3.8 billion are active per step thanks to its efficient design. NVIDIA-optimized, it reaches roughly 1,000 tokens per second on an H100 and up to 2,000 tokens per second on a DGX Station — about four times the speed of comparable autoregressive models. For interactive applications where latency is everything, that kind of speedup is meaningful.

Open, Local, and Ready to Build On

Here is the part the open-weight community will appreciate most: DiffusionGemma was released under the permissive Apache 2.0 license, with day-one support in Hugging Face Transformers, vLLM, and Unsloth. It runs fully locally on consumer and professional GPUs, so researchers and developers can download it and start experimenting immediately — no API gatekeeping required.

This matters beyond raw benchmarks. Diffusion-based language modeling has been an active research thread for a while, but a capable, openly licensed model at this scale gives the whole community a serious platform to study the approach in practice. Questions like how diffusion decoding handles long-form reasoning, how it trades off speed against quality, and how it behaves under fine-tuning are now things anyone can investigate directly.

Why This Is Worth Watching

I want to be measured here: autoregressive models remain the workhorses of the field, and a single release does not overturn that. But parallel diffusion decoding is one of the more promising avenues for faster local inference, and seeing it delivered as a free, locally runnable model is exactly the kind of open contribution that accelerates research for everyone. For our mini computer enthusiasts eyeing local AI, the efficiency angle is especially intriguing — faster decoding means more responsive on-device assistants.

DiffusionGemma is a reminder that the architecture conversation is far from settled, and that the open ecosystem remains where some of the most interesting experiments land first.

Sources: NVIDIA Blog — "RTX AI Garage: local Gemma diffusion," June 10, 2026; MLQ.ai — "Google DeepMind releases DiffusionGemma," June 2026.