📄 论文解读

大模型RL训练：从硬剪到软调，更稳了

信赖通道 ▲ 26 大模型强化学习训练稳定性PPODRPO

训练大模型时，强化学习（RL）常用PPO等方法，但有个隐患：模型更新时，如果某个词的概率变化太大，传统做法是直接扔掉这个样本的梯度（硬剪），相当于“错了就闭嘴”。新方法DRPO换了一种思路：不扔掉，而是根据变化幅度给梯度打个折扣，让模型知道“你偏了，但慢慢回来”。实验表明，这种软调方式让训练更稳定、效率更高。它不是你明天就能用的工具，但解释了为什么RL训练容易崩，以及如何更优雅地解决。

📄 原文摘要(英文)

Reinforcement learning (RL) has become a key component of post-training large language models (LLMs). In practice, LLM RL is often off-policy because of training-inference mismatch and policy staleness, making trust-region control essential for stable optimization. Mainstream methods such as PPO and GRPO approximate this control with a ratio-clipping mechanism, but the importance ratio can be a poor proxy for distributional shift in long-tailed vocabularies. Recent work such as DPPO addresses this mismatch by replacing ratio-based clipping with a divergence-based mask, yielding a trust region defined by the sampled token's absolute probability shift. However, DPPO still relies on a hard mask: once a token crosses the trust-region boundary in a harmful direction, its gradient is discarded rather than corrected. To address this, we propose Divergence Regularized Policy Optimization (DRPO), which replaces the hard mask with a smooth advantage-weighted quadratic regularizer on policy shift. DRPO preserves the same trust-region geometry as DPPO while inducing bounded, continuous gradient weights that attenuate diverging updates and provide corrective signals beyond the boundary. Experiments across model scales, architectures, and precision settings show that DRPO improves the stability and efficiency of LLM RL training.

arXiv 原文

📬 订阅 AI Pulse