New research by AlexWortega shows that using different optimization (alignment) methods leads to fundamentally different internal weight structures in large language models, even if external quality metrics remain identical.

image
image

What Happened

AlexWortega's research found that SFT, RFT, DFT, and offline GRPO methods form similar weight landscapes when training on the same data. At the same time, DPO, GRPO, and DAPO methods create fundamentally different weight structures. This weight geometry effect remains stable and is independent of hyperparameters such as learning rate or random seed.

Context

During LLM training, alignment methods are used to fine-tune model behavior. It is traditionally assumed that if models show the same results on benchmarks, their internal workings are similar; however, this work challenges that assumption by analyzing the weight geometry itself.

Why It Matters for the Industry

For the industry, this means that high benchmark metrics can mask deep differences in internal knowledge representations. This is critically important for developing transfer learning methods and assessing model reliability. Understanding these differences allows for the creation of unique technological advantages (moats) through specialized training methods.

Why It Matters for Users

For users and developers, this explains why models with identical accuracy scores may behave differently in non-standard tasks or edge-case scenarios. Their "internal map" of knowledge is constructed differently depending on the chosen training method, which directly affects the predictability of model behavior.

What Remains Unknown / Limitations

No direct technical disagreements regarding the research results have been identified.

Sources

Author

Look at AI, Editorial Staff