Modality Forcing Method for Image and Depth Map Generation

The Modality Forcing method has been introduced, allowing Diffusion Transformers to be fine-tuned for the simultaneous generation of RGB images and depth maps.

Compiled by Sergey KostenchukPublished 2026-06-15Updated 2026-06-15

2026-06-15 Research HuggingFace

🖼 Modular Image and Depth Generation

The Modality Forcing method has been introduced, allowing Diffusion Transformers (DiT) to be fine-tuned for the simultaneous generation of images and depth maps. Thanks to independent noise injection for RGB and Depth modalities, a FLUX.2-based model can perform joint generation, image-to-depth, and depth-to-image tasks.

🌍 The method shows that image generation is scalable for training spatial perception. The use of sparse depth data simplifies the training process and makes it more applicable to real-world scenarios.

👤 This paves the way for creating accurate 3D scenes, controlling object geometry, and high-quality depth reconstruction from ordinary photographs using powerful generative models.

Source 1: https://modality-forcing.github.io/ Source 2: https://huggingface.co/bartduis/modality_forcing

Sources