The Allen Institute for AI (Ai2) has introduced MolmoMotion, a 4-billion parameter vision-language model (VLM) capable of predicting 3D motion trajectories of objects based on short RGB videos and text instructions.


What Happened
The MolmoMotion model processes video data to predict the future position of selected points in space (in meters) for a horizon of up to 30 frames. Training was conducted on the specialized MolmoMotion-1M dataset, which includes 1.16 million video clips with text descriptions of actions.
Context
Unlike traditional approaches that operate in 2D pixel space, MolmoMotion translates the prediction task into a physically meaningful 3D space. This allows the model to account for real physical motion parameters rather than simply predicting pixel displacement.
Why It Matters for the Industry
For the industry, this signifies a shift toward more reliable manipulation planning systems in robotics and increased physical plausibility in video generation. In tests, MolmoMotion demonstrated significantly lower Average Displacement Error (ADE) compared to existing solutions such as Wan2.2 or ObjectForesight.
Why It Matters for Users
Developers and researchers have access to an open model (4B parameters) on Hugging Face, enabling rapid prototyping of natural language-understanding robot control systems or the creation of hyper-realistic videos without the "floating object" effect.
What Is Not Yet Known / Limitations
Questions remain regarding the legitimacy of collecting such a massive volume of video data for training, and additional data regarding latency and inference costs are required for industrial implementation.
Sources
Author
Look at AI, Editorial Team
