大模型训练新招:给每个词不同的“犯错额度”
训练大模型时,我们通常用“奖励”来引导它输出正确答案。但现有方法对所有词一视同仁:每个词犯错扣分都一样。这忽略了生成顺序——开头一个词错了,后面全跑偏;结尾词错了,影响小得多。新方法CPPO给每个词不同的“犯错额度”:开头词额度紧,结尾词额度松,并且会累计前面已经犯的错,动态调整后面还能犯多少。实验显示,这种精细控制让训练更稳定,推理准确率明显提升。它不是你明天就能用的工具,但揭示了强化学习训练大模型的一个关键盲区。
📄 原文摘要(英文)
Reinforcement learning with verifiable rewards (RLVR) has become standard for improving LLM reasoning. However, existing PPO-style trust-region mechanisms remain position-agnostic by enforcing uniform thresholds across all tokens independently. This pointwise treatment conflicts with autoregressive generation in two critical ways. First, uniform thresholds ignore autoregressive asymmetry. Early-stage deviations produce compounding sequence-level drift, causing static thresholds to under-regulate early divergence and excessively constrain late-stage exploration. Second, evaluating token-level divergence in isolation overlooks cumulative prefix drift, granting the same divergence allowance regardless of how far the conditioning history has already deviated from the rollout policy. To address this limitation, we propose CPPO (Cumulative Prefix-divergence Policy Optimization), a token-level masking rule that aligns updates with a finite-horizon policy-improvement bound via two coupled mechanisms. First, a position-weighted threshold imposes stricter limits at early positions whose effects persist longer, relaxing constraints for late-stage tokens. Second, a cumulative prefix budget tracks historical deviations, dynamically restricting further token-level deviation to prevent compounding errors along the prefix. Empirically, CPPO enhances training stability and significantly improves reasoning accuracy across various model scales.