📄 论文解读

AI画画不再“忘本”：新方法让模型越学越稳

信赖通道 ▲ 35 AI绘画强化学习训练稳定性KL散度多目标优化

训练AI画画时，一个常见问题是：学新东西就会忘掉旧技能。现有方法用“概率比裁剪”来防止模型跑偏，但作者发现这就像用一把不准的尺子量距离——有时太紧，有时太松。他们提出Flow-DPPO，直接计算新旧模型之间的“KL散度”（一种精确的距离度量），并只在模型确实跑远时才阻止它。实验表明，新方法让模型在学新风格时不会忘记旧风格，多目标优化更平衡，还能稳定地反复训练多轮，而旧方法早就崩了。这不是你明天能用的工具，但它让AI生成图像和视频的质量更可靠，离“越学越强”更近一步。

📄 原文摘要(英文)

Recent work has demonstrated that online reinforcement learning (RL) can substantially improve the quality and alignment of flow matching models for image and video generation. Methods such as Flow-GRPO and CPS cast the denoising process as a Markov Decision Process and apply PPO-style ratio clipping to enforce a trust region. However, we argue that ratio clipping is structurally ill-suited for flow models: the probability ratio between new and old policies is a noisy, single-sample estimate of the true policy divergence, leading to over-constraining in some regions of the trajectory and under-constraining in others. We propose Flow-DPPO (Flow Divergence Proximal Policy Optimization), which replaces ratio clipping with a divergence proximal constraint. A key observation is that the per-step policy in flow models is Gaussian, enabling exact and cheap computation of the KL divergence between old and new policies. Flow-DPPO employs an asymmetric divergence mask that blocks gradient updates only when they simultaneously move away from the trusted region and violate the divergence threshold. Experiments show that Flow-DPPO achieves higher rewards with better KL-proximal efficiency, alleviates catastrophic forgetting, promotes balanced multi-objective optimization, and enables stable multi-epoch training where ratio clipping degrades. Code and models are available at https://github.com/Tencent-Hunyuan/UniRL/tree/main/FlowDPPO.

arXiv 原文

📬 订阅 AI Pulse