LoomVideo: New Multimodal Model for Video Generation and Editing

🎨 LoomVideo has been introduced—a multimodal model for video generation and editing from Peking University and Alibaba.

The 5-billion parameter architecture combines Wan 2.2 (TI2V 5B) and Qwen3-VL-8B through Deepstack Injection and Scale-and-Add mechanisms. This allows for inference acceleration of 5.4–6.2x compared to the standard token concatenation method.

🌍 The shift toward compact models (5B) instead of heavyweight ones (13B+) and the use of efficient conditioning radically reduces computational costs, paving the way for real-time video production.

👤 You will be able to create and edit videos much faster using less resource-intensive models that outperform giants in tasks such as e-commerce and fashion generation.

Source 1: http://msalab-pku.github.io/projects/LoomVideo/index.html Source 2: https://arxiv.org/abs/2606.06042

LoomVideo: A New Multimodal Model for Video Generation and Editing

Sources