LoomVideo: A New Compact Multimodal Model for Video Generation and Editing

Researchers from Peking University and Alibaba have introduced LoomVideo—a 5-billion parameter multimodal model that combines video generation and editing capabilities. Thanks to innovative architectural solutions, the model provides high processing speeds and instruction accuracy, offering an efficient alternative to heavyweight neural networks.

What Happened

A development team from Peking University and Alibaba presented LoomVideo. The 5B parameter model is based on a hybrid architecture that combines Wan 2.2 (TI2V 5B) and Qwen3-VL-8B through a Deepstack Injection mechanism. The technology supports video generation from text descriptions, instruction-based editing, and working with reference images and videos. A key achievement is the use of the Scale-and-Add Conditioning method instead of standard token concatenation, which has accelerated inference by 5.4–6.2 times.

Context

Traditional video generation models often require a massive number of parameters (13B or more) and significant computational power. The primary method for context management in such models is token concatenation, which slows down the data processing. LoomVideo proposes a shift toward more compact and optimized architectures, focusing on the efficiency of data conditioning mechanisms.

Why It Matters for the Industry

For the AI industry, this signifies a shift in focus from simple parameter scaling (scaling laws) toward architectural optimization. The use of Scale-and-Add Conditioning allows for the creation of high-performance, smaller-sized models that can compete with giants in specialized tasks. This paves the way for cheaper and faster real-time video production and makes high-quality generation accessible for local deployment on less expensive hardware.

Why It Matters for Users

Users gain the ability to create and edit videos significantly faster and more efficiently. Companies in sectors such as e-commerce and fashion can utilize these tools for automated content creation (e.g., virtual clothing try-ons or background changes) with minimal resource costs. For end consumers, this means the emergence of more affordable and faster tools for real-time video content creation.

Sources

LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing (Project Page)
[arXiv:2606.06042 [cs.CV] LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing](https://arxiv.org/abs/2606.06042)

Author

Look at AI, Editorial Staff