🖼 Modular Image and Depth Generation
The Modality Forcing method has been introduced, allowing Diffusion Transformers (DiT) to be fine-tuned for the simultaneous generation of images and depth maps. Thanks to independent noise injection for RGB and Depth modalities, a FLUX.2-based model can perform joint generation, image-to-depth, and depth-to-image tasks.
🌍 The method shows that image generation is scalable for training spatial perception. The use of sparse depth data simplifies the training process and makes it more applicable to real-world scenarios.
👤 This paves the way for creating accurate 3D scenes, controlling object geometry, and high-quality depth reconstruction from ordinary photographs using powerful generative models.
Source 1: https://modality-forcing.github.io/ Source 2: https://huggingface.co/bartduis/modality_forcing
