MiniMax has introduced the MiniMax-M3 multimodal model, which utilizes the innovative MiniMax Sparse Attention (MSA) mechanism to handle context lengths of up to 1 million tokens while maintaining high inference speeds.

image
image

What Happened

MiniMax released the open MiniMax-M3 multimodal model with a total parameter count of approximately 428 billion. Thanks to its dual-branch architecture (an Index Branch for retrieval and a Main Branch for precise attention), the model utilizes only 23 billion parameters during inference. This achieves a 14.2x speedup in the prefill stage and a 7.6x speedup in decoding compared to the GQA mechanism on H800 hardware, while reducing computational costs per token by 28 times.

Context

Traditional attention architectures (such as Dense Attention or GQA) face quadratic growth in computational costs as context length increases. The release of MiniMax-M3 marks an industry shift from competing solely on total parameter counts toward optimizing inference efficiency and context scalability through sparse attention mechanisms.

Why It Matters for the Industry

The emergence of an efficient Sparse Attention mechanism that outperforms GQA paves the way for commercially viable multimodal agents with ultra-long context. This technology allows systems to scale without exponential increases in hardware costs, making the deployment of models with 1M+ token contexts economically feasible on standard hardware stacks like the H800.

Why It Matters for Users

Users gain the ability to work with colossal amounts of data—such as entire libraries of books or multi-hour videos—significantly faster and more affordably. The model also offers two operating modes: 'thinking' for solving complex analytical tasks and 'non-thinking' for maximum response speed.

Sources

Author

Look at AI, Editorial Team