📄 论文解读

AI看视频终于能听声辨位了

信赖通道 ▲ 23 视频理解多模态问答系统AI推理

现在的AI看视频，要么只看画面，要么把声音和画面分开处理，结果就是：听到狗叫却不知道狗在哪，同一个物体在不同片段里被描述成两个东西。这篇论文搞了个新方法：先给视频写一份“结构化剧本”，把主要实体（比如人、狗、车）列成清单，再按片段描述每个实体对应的声音和画面，保证前后一致。然后让AI先根据剧本挖掘跨片段、跨模态的线索，再基于这些线索出题。用这个方法训练后的模型，在测试集上准确率提升了20%，在现有基准上也能涨12%。它不是你明天就能用的工具，但说明AI理解视频的方式正在从“看片段”进化到“看全局”，离真正看懂视频更近了一步。

📄 原文摘要(英文)

Current automated pipelines for audio-visual Question Answering (QA) generally adopt a ``video-caption-QA'' paradigm. However, these methods typically segment videos into short clips and generate separate descriptions for audio and visual modalities. This decoupled processing severs inherent associations between sounds and their visual sources, while independent clip processing often causes inconsistent descriptions of the same entity across segments. Furthermore, coupling long-text comprehension and QA synthesis into a single step often restricts models to localized events, yielding questions lacking long-term temporal connections and deep cross-modal reasoning. To address these issues, we propose an automated data engine featuring two mechanisms: (1) Entity-Anchored Video Scripting transforms videos into structured scripts, comprising summaries, main entity lists, and segment-wise audio-visual descriptions. The entity list serves as a global prior to ensure cross-segment referential consistency and reconstruct audio-visual associations. (2) Clue-Guided QA Generation prompts models to first mine cross-segment, multimodal clues from the script, and subsequently generate QA pairs based on these high-value clues. Leveraging this pipeline, we construct the instruction-tuning dataset OmniVideo-100K and a human-verified test set, OmniVideo-Test. Fine-tuning VITA-1.5, Qwen2.5-Omni-7B and Qwen3-Omni-30B on OmniVideo-100K yields performance gains of up to 20.59% on OmniVideo-Test, demonstrating strong generalization (up to 12.64% improvements) across established benchmarks like Daily-Omni and JointAVBench.

arXiv 原文

📬 订阅 AI Pulse