🤖 Weight Geometry in Reasoning Model Training

Researchers have presented the paper "Weight-Space Geometry of Offline Reasoning Training," accepted to an ICML 2026 workshop. The study analyzes the influence of various offline RL methods (SFT, RFT, RIFT, DFT, GRPO, DPO) on model weights during reasoning training. The results show that SFT, RFT, and RIFT actually produce nearly identical weight updates (cosine similarity ≥ 0.97), whereas DPO is a fundamentally different algorithm, providing a significant accuracy boost on GSM8K tasks (93.5% vs. 87-88%) and AIME26.

🌍 The work debunks the myth that many modern "offline RL for reasoning" methods are something more than just SFT. Understanding the geometric nature of weight updates allows for more efficient selection of reasoning distillation methods and avoids the redundant use of methods that effectively duplicate each other.

👤 If you are fine-tuning LLMs, be aware that many popular methods (RFT, RIFT) may not provide new qualitative changes compared to simple SFT, while true breakthroughs in reasoning quality are driven by architecturally different approaches, such as DPO.

Source 1: https://openreview.net/forum?id=mzgEXubB5M Source 2: https://github.com/zj-karina/conference-poster