Breakthrough in Local Inference: MTP Technology Enables Instant...

Experimental tests demonstrate that applying Multi-Token Prediction (MTP) technology to the Qwen 3.6 model family radically increases generation speed even on consumer-grade equipment, making the local deployment of powerful LLMs technically and economically viable.

What Happened

During testing, a configuration of two RTX 3090 GPUs linked via NVLink showed performance of up to 187 tokens per second when using the Qwen 3.6 93B model with MTP technology. For more compact models, such as Qwen 3.6 27B, using specialized llama.cpp branches with MTP support provides a speed increase of approximately 1.8x compared to standard generation.

Context

Multi-Token Prediction (MTP) technology changes the traditional speculative decoding paradigm. Instead of predicting just one next token, the model is trained and operates to predict several tokens simultaneously in a single step, which significantly increases inference efficiency without substantial loss of accuracy.

Why It Matters for the Industry

For the AI industry, the implementation of MTP signifies a shift from simply scaling parameter counts to architectural solutions that optimize generation efficiency (tokens/sec/$). This stimulates the development of edge computing and the standardization of multi-token prediction methods in key open-source tools such as llama.cpp and vLLM.

Why It Matters for Users

For everyday users and developers, this means that running heavy models like the 93B level on home hardware (for example, on used RTX 3090 cards) is becoming a reality with near-instant response times. This opens up possibilities for creating fast, autonomous, and private AI agents that run directly on personal devices without dependency on cloud APIs.

Sources

Author

Look at AI, Editorial Team