Researchers have introduced a method called Modality Forcing, which enables efficient fine-tuning of diffusion transformers (DiT), such as FLUX.2, for the joint generation of RGB images and depth maps. By using independent noise for different modalities, the model is capable of handling tasks ranging from text-to-generation to reconstructing geometry from existing photographs.

What Happened
The Modality Forcing method was developed, utilizing independent noise for different modalities (RGB and Depth). This allows a single FLUX.2-based architecture to perform three types of tasks: joint generation (text -> RGB-D), image-to-depth (image -> depth), and depth-to-image (depth -> image). The method demonstrated SOTA results on 4 out of 5 depth prediction benchmarks.
Context
Traditional methods for training spatial perception often require dense depth datasets, which makes scaling difficult. Modality Forcing addresses this problem by using sparse data, making the fine-tuning process of existing SOTA image generation models more accessible and efficient.
Why It Matters for the Industry
The method proves that image generation is a scalable task for training spatial perception. The use of sparse data simplifies the training process and makes it applicable to real-world datasets. This paves the way for multimodal DiTs, which will serve as the foundation for future world models capable of understanding and reproducing the geometry of the physical world.
Why It Matters for Users
Users gain a tool for transforming ordinary 2D photos into high-quality depth maps and creating controllable 3D scenes based on powerful open-source models. This allows generative neural networks to be used not only for visual content but also for tasks involving precise control of object geometry and 3D structure reconstruction.
Sources
Author
Look at AI, Editorial Team
