Xiaomi has unveiled the MiMo-V2.5-Pro-UltraSpeed mode for its 1 trillion parameter flagship model, demonstrating unprecedented generation speeds ranging from 1,000 to 1,200 tokens per second (TPS). This result was achieved on standard hardware with 8 GPUs, making high-performance inference accessible without the need for specialized accelerators.

image

What Happened

Xiaomi released the new MiMo-V2.5-Pro-UltraSpeed mode, which allows its 1T parameter flagship model to operate 15 times faster than ChatGPT and Claude. High performance on a standard 8-GPU node was made possible through the synergy of three technologies: FP4 quantization of expert layers, the DFlash speculative decoding method, and the optimized TileRT engine. Meanwhile, generation quality remains on par with Claude Opus.

Context

Traditionally, achieving ultra-high inference speeds for large models required specialized AI accelerators, such as chips from Groq or Cerebras. Xiaomi's technological breakthrough allows for the use of existing infrastructure consisting of standard GPUs, neutralizing the advantage of proprietary specialized hardware through software and algorithmic optimization.

Why It Matters for the Industry

For the industry, this means a radical reduction in dependency on specialized AI accelerators and the ability to run ultra-powerful models on ordinary commodity hardware. This paves the way for the mass adoption of real-time AI systems in the cloud and changes the economics of LLM serving, shifting the focus from parameter count to inference efficiency (Tokens per Joule/Dollar).

Why It Matters for Users

Users gain access to powerful models with near-zero latency. This is critical for creating instantaneous AI agents, high-frequency trading systems, and real-time data analysis tools, where response speed directly determines product functionality.

Sources

Author

Look at AI, Editorial Team