AI终于能同时看懂图片和视频了
现在的多模态AI通常把图片和视频分开处理,就像用两套不同的眼睛。HYDRA-X首次用一个统一的视觉编码器同时处理图像和视频,它发现:重建视频时,只看相邻帧的因果注意力比看所有帧的全注意力效果更好;分层压缩视频比一步到位更高效。更关键的是,它把图片和视频的语义知识注入到同一个紧凑的潜在空间里,让模型既能理解又能生成。这还不是你明天能用的产品,但它指向一个趋势:未来的AI会像人一样,用同一套视觉系统理解世界,不分静态还是动态。
📄 原文摘要(英文)
Holistic visual tokenizers are fundamental to unified multimodal models (UMMs) as they map diverse visual inputs into a unified representation space. In this paper, we present HYDRA-X, the first UMM that unifies image and video tokenization within a single Vision Transformer (ViT). Our design is driven by two core challenges: efficiently injecting spatiotemporal reconstruction capability into a native ViT, and embedding image- and video-level semantic awareness into the latent space. To address the first, comprehensive ablations reveal two key findings: (1) frame-level causal temporal attention suffices for visual reconstruction, whereas full spatiotemporal attention degrades it; and (2) hierarchical temporal compression substantially outperforms single-step alternatives. To tackle the second, we propose a lightweight decompressor that upsamples temporally compressed features under joint image-video teacher supervision, thereby enforcing complementary semantic structures within the compact latent space. Building on this holistic tokenizer, we further propose a principled improvement of the editing pipeline: source-target interaction should occur at the latent level inside the tokenizer rather than at the semantic level inside the LLM, substantially improving editing consistency and accelerating convergence. Instantiated at the 7B dense model, HYDRA-X achieves strong performance across image and video understanding and generation tasks, paving the way for future unified-tokenizer UMMs.