📄 论文解读

大模型RL训练：早犯错代价更大，新方法按位置调惩罚

信赖通道 ▲ 41 强化学习大模型训练稳定性token级约束累积偏差

训练大模型用强化学习时，现有方法对所有位置的token一视同仁地限制变化幅度。但大模型是逐字生成的：开头写错，后面会越跑越偏。这篇提出CPPO，给早期token更严格的限制，并跟踪历史偏差累积量，动态调整后续token的允许变化范围。实验显示，这种按位置和累积偏差调整的方法，在多个规模模型上提升了训练稳定性和推理准确率。它不是你明天能直接用的工具，但揭示了RL训练中一个被忽视的关键不对称性。

📄 原文摘要(英文)

Reinforcement learning with verifiable rewards (RLVR) has become standard for improving LLM reasoning. However, existing PPO-style trust-region mechanisms remain position-agnostic by enforcing uniform thresholds across all tokens independently. This pointwise treatment conflicts with autoregressive generation in two critical ways. First, uniform thresholds ignore autoregressive asymmetry. Early-stage deviations produce compounding sequence-level drift, causing static thresholds to under-regulate early divergence and excessively constrain late-stage exploration. Second, evaluating token-level divergence in isolation overlooks cumulative prefix drift, granting the same divergence allowance regardless of how far the conditioning history has already deviated from the rollout policy. To address this limitation, we propose CPPO (Cumulative Prefix-divergence Policy Optimization), a token-level masking rule that aligns updates with a finite-horizon policy-improvement bound via two coupled mechanisms. First, a position-weighted threshold imposes stricter limits at early positions whose effects persist longer, relaxing constraints for late-stage tokens. Second, a cumulative prefix budget tracks historical deviations, dynamically restricting further token-level deviation to prevent compounding errors along the prefix. Empirically, CPPO enhances training stability and significantly improves reasoning accuracy across various model scales.

arXiv 原文

📬 订阅 AI Pulse