把图像压缩成文字一样的离散符号,还能不丢细节
AI 处理图像时,通常把图像变成连续的高维向量,这很占算力。这篇论文反其道而行,把图像压缩成和文字一样的离散符号(类似把一张图翻译成几个“词”),但以往这么做会丢失细节或语义。ViQ 用两阶段训练:先让视觉编码器对齐语言模型学到的语义,再通过“渐进式特征压缩”和“位置感知量化”把图像转成离散码,同时保留低层细节和高层语义。在多个多模态任务上,ViQ 用离散表示达到了和连续向量相当的效果,且训练速度提升 20%-70%。它不是你明天能用上的,但指向一个方向:未来多模态模型可能像处理文字一样高效地处理图像。
📄 原文摘要(英文)
A unified representation for text and vision is a natural pursuit, as it enables simpler multimodal modeling and more efficient training. However, representing images as discrete signals in the same way as text inevitably introduces severe information loss. Existing work struggles to balance low-level details and high-level semantics in discrete representations: reconstruction-oriented representations often lack semantic information, whereas semantically stronger features typically suffer from severe loss of detail. We present ViQ, a Visual Quantized Representations framework, which is designed to balance semantics and details in discrete representations while supporting inputs at native resolutions, thereby enabling it to serve as a unified and general discrete representation for arbitrary visual inputs. Our approach structures quantization learning into two stages: text-aligned pre-training and feature discretization. With text-aligned pre-training, we enhance the visual encoder semantic-rich supervision from the pretrained language model and enable it to process native-resolution visual inputs. During discretization, we propose a proximal representation learning strategy to progressively compact the feature space, along with a position-aware head-wise quantization mechanism that enables flexible processing of arbitrary resolutions. Extensive experiments on multimodal tasks demonstrate that ViQ achieves competitive performance compared to state-of-the-art multimodal vision encoders with continuous and high-dimensional visual features, while maintaining high precision in low-level reconstruction. We also show that multimodal training with visual quantized representations largely improves efficiency, yielding up to 20\%-70\% acceleration with different base LLMs and training recipes.