Ai2 Introduces MolmoMotion for 3D Motion Prediction

The Allen Institute for AI has released MolmoMotion, a vision-language model capable of predicting 3D object trajectories from video and text commands.

Compiled by Sergey KostenchukPublished 2026-06-18Updated 2026-06-18

2026-06-18 Research HuggingFace

🤖 MolmoMotion: Predicting 3D Motion from Video

Allen Institute for AI (Ai2) has introduced MolmoMotion — a vision-language model (VLM) with 4 billion parameters. It processes short RGB videos and text instructions to predict point positions in space over a horizon of up to 30 frames.

🌍 The model shifts the prediction task from pixels to physically meaningful 3D. This is critical for robotics and creating realistic video where objects do not violate the laws of physics.

👤 This is an important step toward AI that understands the physics of motion rather than just "drawing" frames. This will allow for more precise robot control and the creation of hyper-realistic content.

Source 1: https://allenai.org/blog/molmo-motion Source 2: https://huggingface.co/allenai/MolmoMotion-4B-H3-F30

Sources