Google researchers have developed a method to significantly accelerate the performance of Gemini Nano v3 models on Pixel devices by implementing a lightweight Multi-Token Prediction (MTP) layer on top of a frozen base architecture.

image

What Happened

Google introduced MTP technology, which adds a specialized MTP layer (head) to the main model. This layer utilizes a cross-attention mechanism to work with the existing KV cache, allowing it to predict multiple tokens in a single step. On Pixel 9 devices, this approach provides a speed increase of more than 50% and saves approximately 130 MB of RAM by eliminating the need for cache duplication.

Context

Traditionally, to accelerate text generation (speculative decoding), separate draft models (drafters) were used, which required significant memory resources. Google's new method allows for the optimization of already trained edge models without affecting the weights of their base architecture, solving the critical problem of RAM shortage on mobile devices.

Why It Matters for the Industry

This technology demonstrates an efficient path for optimizing edge-AI without the need for expensive fine-tuning of massive base models. This significantly lowers the technical and financial barriers to deploying complex LLMs directly onto mobile platforms and other edge devices.

Why It Matters for Users

For Pixel smartphone owners, using Gemini Nano will become noticeably faster and more energy-efficient. This will directly improve the performance of local features, such as notification summarization and intelligent autocorrect, providing instant response without needing to connect to the cloud.

Sources

Author

Look at AI, Editorial Team