📄 论文解读

AI画画终于不用靠人打分：自己就能判断好坏

信赖通道 ▲ 19 图像生成强化学习判别器无监督对齐流匹配

现在的AI图像生成模型（比如Stable Diffusion）有个怪现象：明明训练时已经看过无数真实图片，但生成的图却经常出现结构扭曲、颜色过饱和等问题。研究者发现，这是因为训练时用的损失函数（计算像素差异）和人类对“好图”的判断标准不一致。过去只能靠人类打分来修正，但成本高且主观。这篇论文提出一个巧妙方案：用预训练好的视觉模型（如DINOv2）作为“裁判”，训练一个判别器来区分真实图片和AI生成的图片，然后把这个判别器的输出作为奖励信号，用强化学习来微调生成模型。结果在多个模型上，无引导的FID（图像质量指标）从9.38降到2.62，语义一致性也大幅提升。关键是，这个奖励完全不需要人工标注，而且后续再用人偏好微调时，效果更好、伪影更少。它不是你明天就能直接用的工具，但为AI生成“更像真图”提供了一条自动化路径。

📄 原文摘要(英文)

Score- and flow-matching models often rely on preference-based reinforcement learning for two purposes: aligning with subjective preferences and, surprisingly, recovering properties such as visual realism and coherent object structure that matching-based training is intended to learn from the data itself. We argue that this reflects a structural mismatch. Matching losses measure ell_2 regression error on the velocity or score field under training-time marginals, a proxy poorly aligned with the visual and semantic properties that determine sample quality at inference. Given a reward aligned with these properties, RL sidesteps the mismatch by evaluating the model on its own samples and following the reward landscape directly. The challenge is to obtain such a reward without relying on human preferences, which are expensive and conflate data realism with annotator inclinations. We propose Discriminator-Guided RL (DRL). DRL trains a discriminator to separate data from base-model samples in a pretrained representation space and uses its logit as the reward in KL-regularized RL. The pretrained space restricts the discriminator to perceptually meaningful directions, and the logit estimates the log-likelihood ratio between data and model, which is the optimal reward for targeting the data distribution. Across SiT, JiT, REPA, and RAE, DRL reduces guidance-free FID (e.g., 9.38 to 2.62 on SiT) and semantic-space FD (e.g., 88.2 to 19.3 on DINOv3 for SiT), with consistent gains across all backbones, and improves human-preference rewards without training on them. It also yields a better Pareto frontier between preference reward and image fidelity under subsequent preference-based post-training, increasing alignment while reducing low-level artifacts such as oversaturation and excessive brightness.

arXiv 原文

订阅 AI Pulse