Google Introduces DiffusionGemma

🚀 Google Introduces DiffusionGemma

An experimental model based on Gemma 4 26B MoE uses discrete diffusion to process blocks of 256 tokens in parallel. This enables speeds exceeding 1,000 tokens/s on an NVIDIA H100.

🌍 The diffusion approach changes inference standards, maximizing GPU parallelism and reducing latency.

👤 Future local AI assistants could output entire paragraphs of text instantaneously, which is crucial for operation on consumer GPUs like the RTX 5090.

Source 1: https://blog.google/innovation-and-ai/technology/developers-tools/diffusion-gemma-faster-text-generation/ Source 2: https://huggingface.co/google/diffusiongemma-26B-A4B-it

Sources