๐ค MolmoMotion: Predicting 3D Motion from Video
Allen Institute for AI (Ai2) has introduced MolmoMotion โ a vision-language model (VLM) with 4 billion parameters. It processes short RGB videos and text instructions to predict point positions in space over a horizon of up to 30 frames.
๐ The model shifts the prediction task from pixels to physically meaningful 3D. This is critical for robotics and creating realistic video where objects do not violate the laws of physics.
๐ค This is an important step toward AI that understands the physics of motion rather than just "drawing" frames. This will allow for more precise robot control and the creation of hyper-realistic content.
Source 1: https://allenai.org/blog/molmo-motion Source 2: https://huggingface.co/allenai/MolmoMotion-4B-H3-F30
