AI训练新发现:自我对弈的几何秘密
训练大模型时,有一种叫“在线蒸馏”的方法——让模型自己生成答案再学习——效果不错,但没人知道它到底在参数空间里怎么动。这篇论文用几何视角看:在线蒸馏的更新只动少数权重,且很快钻进一个狭窄的低维通道,锁死在里面。相比之下,监督微调会大范围改动,强化学习则更紧。更有意思的是,如果你把训练限制在早期形成的那个小通道里,在线蒸馏的性能几乎不变,但监督微调直接崩了。这说明在线蒸馏不是两者的中间态,而是有自己的“走位”逻辑。它不是你明天能用上的,但如果你在调模型训练策略,这个发现能帮你理解为什么在线蒸馏有时更稳、更省资源。
📄 原文摘要(英文)
On-policy distillation (OPD) is increasingly used to improve large language model reasoning, but its training dynamics remain poorly understood. We characterize the trajectory of OPD updates in parameter space and compare it with supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR). A suite of parameter-space diagnostics consistently places OPD in a relaxed off-principal regime: compared with SFT, its updates affect fewer weights and avoid principal directions more strongly, while compared with RLVR, they remain less tightly constrained. Beyond this static localization, OPD exhibits subspace locking: its cumulative updates rapidly enter a narrow low-dimensional channel. Constraining training to the update subspace formed early in training preserves OPD performance but substantially degrades SFT, indicating that the locked subspace is functionally sufficient for OPD. Control experiments further show that sparsifying the update tokens and shifting rollout generation off-policy preserve the rank dynamics, whereas mixing the OPD objective with RLVR changes them. Overall, these results suggest that OPD is not merely an intermediate point between SFT and RLVR, but induces its own update geometry in parameter space.