📄 论文解读

AI画画：让生成器反过来教分词器怎么干活

信赖通道 ▲ 28 图像生成端到端训练向量量化自回归模型表示对齐

现在的AI画画通常是两阶段：先训练一个分词器把图像拆成离散的“视觉单词”，再训练一个生成器去预测这些单词。问题是分词器只管自己拆得准，不管生成器好不好学。GEAR让两者一起训练，生成器通过一个可微分的软分支给分词器反馈，告诉它“你拆成这样的单词我更好预测”。结果生成器学得快了10倍，而且学到的图像特征更连贯。这不是你明天能直接用的工具，但它揭示了生成式AI的一个新思路：让下游任务反过来指导上游表示，而不是单向传递。

📄 原文摘要(英文)

Visual generative models are typically trained in two stages. A tokenizer is first trained for reconstruction and then frozen, after which a generator is trained on its discrete indices or continuous latents. This decoupling leaves the tokenizer unaware of what the generator finds easy to model. We present GEAR (Guided End-to-end AutoRegression), which trains a vector-quantized (VQ) tokenizer and an autoregressive (AR) generator jointly and end-to-end, guided by representation alignment. The key obstacle is that the VQ index fed to the AR model is non-differentiable, so gradients cannot reach the tokenizer, and a straight-through estimator collapses. GEAR resolves this with a dual read-out of the codebook assignment. A hard, one-hot branch trains the AR with next-token prediction, while a differentiable soft branch carries a representation-alignment loss that flows back to guide only the tokenizer. The AR model thereby steers its tokenizer toward an index distribution it can predict more easily. This shifts the alignment burden from the tokenizer to the AR: the tokenizer's own features become less DINOv2-like while the AR's become more so, the opposite of diffusion-side recipes that make the latent itself semantic. GEAR speeds up ImageNet gFID convergence by up to 10x relative to the strong LlamaGen-REPA baseline, learns markedly better patch-level and spatially-coherent features, and generalizes across quantizers (VQVAE, LFQ, IBQ) and to text-to-image generation.

arXiv 原文

📬 订阅 AI Pulse