AI画画终于不用靠人打分:自己就能判断好坏
现在的AI图像生成模型(如Stable Diffusion)在训练时用的是“像素误差”损失,这就像让画家照着数字填色,但不管画出来像不像、结构对不对。所以模型常产出扭曲的人脸或奇怪的光影,必须再靠人类偏好打分来修正。这篇论文发现,根本问题在于训练目标错了——像素误差和人类感知的“好画”是两码事。他们提出一个巧妙的替代方案:用一个预训练的视觉模型(比如DINOv2)作为裁判,让AI自己判断生成图和真实图的差距,把这个差距作为奖励信号来微调模型。结果,在不依赖任何人工标注的情况下,模型生成的图像质量大幅提升(FID从9.38降到2.62),而且后续再用人偏好微调时,效果也更好。这不是你明天就能用的工具,但它揭示了一个趋势:AI正在学会用更接近人类感知的方式自我纠偏,未来可能不再需要大量人工反馈。
📄 原文摘要(英文)
Score- and flow-matching models often rely on preference-based reinforcement learning for two purposes: aligning with subjective preferences and, surprisingly, recovering properties such as visual realism and coherent object structure that matching-based training is intended to learn from the data itself. We argue that this reflects a structural mismatch. Matching losses measure ell_2 regression error on the velocity or score field under training-time marginals, a proxy poorly aligned with the visual and semantic properties that determine sample quality at inference. Given a reward aligned with these properties, RL sidesteps the mismatch by evaluating the model on its own samples and following the reward landscape directly. The challenge is to obtain such a reward without relying on human preferences, which are expensive and conflate data realism with annotator inclinations. We propose Discriminator-Guided RL (DRL). DRL trains a discriminator to separate data from base-model samples in a pretrained representation space and uses its logit as the reward in KL-regularized RL. The pretrained space restricts the discriminator to perceptually meaningful directions, and the logit estimates the log-likelihood ratio between data and model, which is the optimal reward for targeting the data distribution. Across SiT, JiT, REPA, and RAE, DRL reduces guidance-free FID (e.g., 9.38 to 2.62 on SiT) and semantic-space FD (e.g., 88.2 to 19.3 on DINOv3 for SiT), with consistent gains across all backbones, and improves human-preference rewards without training on them. It also yields a better Pareto frontier between preference reward and image fidelity under subsequent preference-based post-training, increasing alignment while reducing low-level artifacts such as oversaturation and excessive brightness.