A new study, "Weight-Space Geometry of Offline Reasoning Training," accepted at an ICML 2026 workshop, proves that many popular reasoning training methods are merely mathematical variations of standard SFT, while DPO provides a fundamentally different and more effective path for model development.

What Happened

Researchers conducted a geometric analysis of weight updates when using various offline RL methods, including SFT, RFT, RIFT, DFT, GRPO, and DPO. The results showed that SFT, RFT, and RIFT methods produce nearly identical parameter changes, with a cosine similarity of at least 0.97. In contrast, the DPO algorithm demonstrated a qualitative leap in accuracy on the GSM8K (93.5% vs. 87-88%) and AIME26 benchmarks.

Context

In the modern LLM training industry, offline Reinforcement Learning methods are actively used to improve models' logical reasoning capabilities. However, it is often unclear whether methods like RFT or RIFT provide a real contribution to changing the model's internal geometry or if they simply duplicate the fine-tuning process on examples (SFT).

Why It Matters for the Industry

This work demystifies current approaches to distilling reasoning capabilities, allowing companies to avoid redundant computational costs by using methods that are effectively equivalent to SFT. This paves the way for creating more efficient pipelines that focus on architecturally distinct methods, such as DPO, and a deeper understanding of preference geometry.

Why It Matters for Users

Developers and engineers involved in model fine-tuning should reconsider their training cycles. Instead of using expensive and redundant RFT/RIFT cycles, they can focus on optimizing SFT or transitioning to DPO to achieve significant gains in reasoning quality.

Sources

Author

Look at AI, Editorial Staff