๐ŸŽจ LoomVideo has been introducedโ€”a multimodal model for video generation and editing from Peking University and Alibaba.

The 5-billion parameter architecture combines Wan 2.2 (TI2V 5B) and Qwen3-VL-8B through Deepstack Injection and Scale-and-Add mechanisms. This allows for inference acceleration of 5.4โ€“6.2x compared to the standard token concatenation method.

๐ŸŒ The shift toward compact models (5B) instead of heavyweight ones (13B+) and the use of efficient conditioning radically reduces computational costs, paving the way for real-time video production.

๐Ÿ‘ค You will be able to create and edit videos much faster using less resource-intensive models that outperform giants in tasks such as e-commerce and fashion generation.

Source 1: http://msalab-pku.github.io/projects/LoomVideo/index.html Source 2: https://arxiv.org/abs/2606.06042