DiffusionGemma: Google Accelerates Text Generation via Discrete...

Google has introduced DiffusionGemma—an experimental model based on the Gemma 4 26B MoE architecture that uses the discrete diffusion method to radically accelerate text generation. Instead of sequential token prediction, the model processes data blocks in parallel, enabling extreme inference speeds on modern hardware.

What Happened

Google has developed DiffusionGemma, an experimental model with a Mixture-of-Experts (MoE) architecture. The model has a total of 26B parameters; however, thanks to its sparse structure (only 8 out of 128 experts are active), only 3.8B parameters are used during inference. The core innovation lies in the use of discrete diffusion, which allows for the parallel processing of 256-token blocks (canvases). As a result, generation speeds reach over 1,000 tokens per second on an NVIDIA H100 and over 700 tokens per second on consumer RTX 5090 graphics cards.

Context

Traditional LLMs use an autoregressive approach, predicting each subsequent token sequentially, which creates a speed bottleneck and prevents full utilization of GPU parallelism. DiffusionGemma proposes a paradigm shift, moving from character-by-character or stream-oriented decoding to methods capable of generating entire text fragments in a single step.

Why It Matters for the Industry

The transition to diffusion approaches could radically change industry standards for inference performance. This opens the way for hybrid architectures that solve the latency problem when generating large volumes of text. The success of such prototypes could lead to an industrial standard shift: if diffusion quality matches autoregression, traditional token-by-token streaming may become obsolete.

Why It Matters for Users

For end users, this means the emergence of AI assistants that work almost instantaneously. Instead of the familiar process of text "typing" letter by letter, interfaces will be able to output entire paragraphs at once. This is critical for interactive applications and running on local consumer hardware, such as the RTX 50 series, where high throughput without massive latency is essential.

What Is Not Yet Known / Limitations

At the current stage, there is a significant trade-off between speed and quality: the use of discrete diffusion leads to a reduction in accuracy and text generation quality compared to classical autoregressive models, which may limit their application in serious production systems.

Sources

Author

Look at AI, Editorial Staff