AI Pulse
📄 论文解读

AI画画不再“忘本”:新方法让模型越学越稳

训练AI画画时,一个常见问题是“灾难性遗忘”——学新风格就忘了旧技能。现有方法用“概率比裁剪”来约束,但就像用一把不准的尺子量距离,要么太紧要么太松。这篇论文发现,在流匹配模型(当前主流图像/视频生成技术)中,每一步的“策略”天然是高斯分布,可以精确算出新旧策略的差异(KL散度)。于是他们用这个精确值代替模糊的裁剪,并设计了一个“不对称掩码”:只有当更新既偏离信任区域又超过阈值时才阻止,否则放行。实验表明,新方法在多个目标(如质量、对齐)上更平衡,训练更稳定,且能进行多轮训练而不退化。它不是你明天就能直接用的工具,但为生成模型的训练提供了一种更可靠的底层机制。

📄 原文摘要(英文)

Recent work has demonstrated that online reinforcement learning (RL) can substantially improve the quality and alignment of flow matching models for image and video generation. Methods such as Flow-GRPO and CPS cast the denoising process as a Markov Decision Process and apply PPO-style ratio clipping to enforce a trust region. However, we argue that ratio clipping is structurally ill-suited for flow models: the probability ratio between new and old policies is a noisy, single-sample estimate of the true policy divergence, leading to over-constraining in some regions of the trajectory and under-constraining in others. We propose Flow-DPPO (Flow Divergence Proximal Policy Optimization), which replaces ratio clipping with a divergence proximal constraint. A key observation is that the per-step policy in flow models is Gaussian, enabling exact and cheap computation of the KL divergence between old and new policies. Flow-DPPO employs an asymmetric divergence mask that blocks gradient updates only when they simultaneously move away from the trusted region and violate the divergence threshold. Experiments show that Flow-DPPO achieves higher rewards with better KL-proximal efficiency, alleviates catastrophic forgetting, promotes balanced multi-objective optimization, and enables stable multi-epoch training where ratio clipping degrades. Code and models are available at https://github.com/Tencent-Hunyuan/UniRL/tree/main/FlowDPPO.

arXiv 原文

📬 订阅 AI Pulse

每天三次更新,不错过重要信号

▲ 回到顶部