🚀 Google Introduces DiffusionGemma
An experimental model based on Gemma 4 26B MoE uses discrete diffusion to process blocks of 256 tokens in parallel. This enables speeds exceeding 1,000 tokens/s on an NVIDIA H100.
🌍 The diffusion approach changes inference standards, maximizing GPU parallelism and reducing latency.
👤 Future local AI assistants could output entire paragraphs of text instantaneously, which is crucial for operation on consumer GPUs like the RTX 5090.
Source 1: https://blog.google/innovation-and-ai/technology/developers-tools/diffusion-gemma-faster-text-generation/ Source 2: https://huggingface.co/google/diffusiongemma-26B-A4B-it
