Lightricks has released Wan2.2-NVFP4-Sparse — an extremely fast version of the Wan 2.2 (14B parameters) video generation model. The model utilizes NVFP4 quantization and Sparse Attention, optimized for the NVIDIA architecture...

What Happened

Inference has been reduced to just 4 steps, providing a 50-60x speedup compared to the standard version: generating 720p video takes only 45 seconds instead of 2668 seconds on an RTX 5090.

Context

Wan2.2-NVFP4-Sparse: The use of specialized NVFP4 quantization allows for a significant reduction in precision requirements without catastrophic loss of quality, optimizing tensor operations on Blackwell. The Sparse Attention mechanism, combined with distillation, reduces the number of inference steps to 4, which is a critical factor for transitioning from batch processing to real-time generation. The achieved 50-60x speedup (from 2668 to 45 seconds for 720p on an RTX 5090) demonstrates the potential of hardware-aware optimization for heavy models (14B parameters). Extreme reduction in inference time: 720p generation takes 45 seconds on an RTX 5090 versus 2668 seconds in the standard version. Using specialized NVFP4 quantization and Sparse Attention allows for efficient operation of the 14B parameter model.

Why It Matters for the Industry

It demonstrates the capabilities of deep optimization for the new generation of GPUs (Blackwell) through specialized quantization and distillation, making heavy video models suitable for near real-time applications.

Why It Matters for Users

Creating high-quality, high-resolution video can now take less than a minute instead of tens of minutes, radically changing the workflow in AI production.

What Is Not Yet Known / Limitations

The only notable difference in assessments concerns long-term implications: it points to risks of technological dependency on specific hardware (Blackwell) and questions regarding security/lifecycle management, while other roles focus exclusively on performance and...

Sources