📄 论文解读

AI画画也能用RL调教，效果肉眼可见

信赖通道 ▲ 18 强化学习AI绘画图像生成图像编辑Qwen

AI画图现在也能用强化学习（RL）来调教了。这篇论文给Qwen-Image-2.0模型加了一套RL后训练流程，核心是让模型学会“听人话”和“画得美”。他们建了多个奖励模型，分别评估画面对齐指令的程度、美观度、人像保真度等。然后通过GRPO算法（一种强化学习变体）训练，还用了混合CFG策略防止模型忘掉旧知识。最后用“在线蒸馏”把多个专精模型合并成一个全能模型。结果：在Qwen-Image-Bench上总分提升2.61，用户对战胜率也显著提高。这不是你明天就能用的工具，但它展示了AI画图从“能画”到“画得准、画得美”的进化方向——用RL让模型自己学会优化，而不是靠人手工调参数。

📄 原文摘要(英文)

We present Qwen-Image-2.0-RL, a post-training pipeline that applies reinforcement learning from human feedback (RLHF) and on-policy distillation (OPD) to improve both the visual quality and instruction-following capability of the Qwen-Image-2.0 diffusion model. To provide reliable reward signals, we construct task-specific composite reward models by fine-tuning vision-language models with a pointwise scoring paradigm and chain-of-thought reasoning. For text-to-image generation, the reward models cover alignment, aesthetics, and portrait fidelity dimensions. For image editing tasks, the reward system addresses instruction-following accuracy and face identity preservation. Building on this reward system, we develop a scalable GRPO-based RL training framework, incorporating a hybrid classifier-free guidance (CFG) strategy to preserve pre-trained knowledge, prompt curation via intra-group reward range filtering, and per-category reward weight calibration. To merge the task-specialized RL policies for T2I and editing, we propose on-policy distillation as the final training stage, which consolidates multiple teachers into a single student model through trajectory-level velocity matching. Extensive evaluation shows that Qwen-Image-2.0-RL achieves 57.84 overall score on Qwen-Image-Bench (+2.61 over the base model), Elo ratings of 1193 in text-to-image arena (+78) and 1349 in image edit arena (+93), demonstrating consistent gains in aesthetic quality, prompt adherence, and editing accuracy.

arXiv 原文

📬 订阅 AI Pulse