TencentHY Introduces UniRL for Multimodal RL Training

🌟 TencentHY Introduces UniRL for Multimodal RL Training

Tencent Hunyuan (TencentHY) has released UniRL — a universal framework for multimodal reinforcement learning (RL). The system integrates the processes of generation, evaluation, and policy updates for text2image, text/image2video, LLMs, and diffusion models. The release includes the Flow-DPPO algorithm for stable diffusion model training and DRPO for stable text LLM training in FP8 mode.

🌍 UniRL unifies the RL process for various types of generative models, eliminating the need for specialized stacks for each modality. This simplifies scaling research in multimodal intelligence and allows for the efficient application of RL methods (such as GRPO) to video and image models.

👤 The ability to efficiently train models in low-precision (FP8) without losing stability makes advanced RL fine-tuning more accessible to researchers and developers with limited computational resources.

Source 1: https://unirl-project.github.io/unirl/ Source 2: https://arxiv.org/pdf/2606.09821

Sources