📄 论文解读

AI训练新发现：自我对弈的几何秘密

信赖通道 ▲ 51 在线蒸馏参数空间训练动力学大模型

训练大模型时，有一种叫“在线蒸馏”的方法——让模型自己生成答案再学习——效果不错，但没人知道它到底在参数空间里怎么动。这篇论文用几何视角看：相比普通微调，在线蒸馏只动很少的权重，而且避开主要方向；相比强化学习，它又没那么死板。更关键的是，它的更新会迅速锁进一个狭窄的低维通道，一旦锁住，你在这个通道里继续训练效果不变，但换成普通微调就崩了。这告诉你：在线蒸馏不是微调和强化学习的中间态，它有自己独特的“走法”。你明天用不上，但理解它有助于设计更高效的训练策略。

📄 原文摘要(英文)

On-policy distillation (OPD) is increasingly used to improve large language model reasoning, but its training dynamics remain poorly understood. We characterize the trajectory of OPD updates in parameter space and compare it with supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR). A suite of parameter-space diagnostics consistently places OPD in a relaxed off-principal regime: compared with SFT, its updates affect fewer weights and avoid principal directions more strongly, while compared with RLVR, they remain less tightly constrained. Beyond this static localization, OPD exhibits subspace locking: its cumulative updates rapidly enter a narrow low-dimensional channel. Constraining training to the update subspace formed early in training preserves OPD performance but substantially degrades SFT, indicating that the locked subspace is functionally sufficient for OPD. Control experiments further show that sparsifying the update tokens and shifting rollout generation off-policy preserve the rank dynamics, whereas mixing the OPD objective with RLVR changes them. Overall, these results suggest that OPD is not merely an intermediate point between SFT and RLVR, but induces its own update geometry in parameter space.

arXiv 原文

📬 订阅 AI Pulse