AI Pulse
📄 论文解读

大模型RL训练:从硬切到软调,更稳了

大模型用强化学习(RL)做后训练时,常因数据新旧不一导致优化不稳定。主流方法PPO用“裁剪”控制更新幅度,但词汇表太长时容易误判。后来DPPO改用“概率偏移”做硬边界,超限就直接丢弃梯度——这好比开车压线就熄火,反而浪费了修正机会。新方法DRPO把硬边界换成平滑的“权重调节”:超限时梯度不消失,而是逐渐减弱并给出纠正信号。实验表明,DRPO在不同模型大小和精度下训练更稳、效率更高。它不是你明天能用上的,但解释了为什么RL训练容易崩,以及如何用更聪明的数学避免崩。

📄 原文摘要(英文)

Reinforcement learning (RL) has become a key component of post-training large language models (LLMs). In practice, LLM RL is often off-policy because of training-inference mismatch and policy staleness, making trust-region control essential for stable optimization. Mainstream methods such as PPO and GRPO approximate this control with a ratio-clipping mechanism, but the importance ratio can be a poor proxy for distributional shift in long-tailed vocabularies. Recent work such as DPPO addresses this mismatch by replacing ratio-based clipping with a divergence-based mask, yielding a trust region defined by the sampled token's absolute probability shift. However, DPPO still relies on a hard mask: once a token crosses the trust-region boundary in a harmful direction, its gradient is discarded rather than corrected. To address this, we propose Divergence Regularized Policy Optimization (DRPO), which replaces the hard mask with a smooth advantage-weighted quadratic regularizer on policy shift. DRPO preserves the same trust-region geometry as DPPO while inducing bounded, continuous gradient weights that attenuate diverging updates and provide corrective signals beyond the boundary. Experiments across model scales, architectures, and precision settings show that DRPO improves the stability and efficiency of LLM RL training.

arXiv 原文

📬 订阅 AI Pulse

每天三次更新,不错过重要信号

▲ 回到顶部