视频RAG:不是所有片段都该用同一种方式检索
现在的AI问答系统开始处理长视频了,但有个反直觉的问题:很多测试集里的问题,不看视频也能答对,导致检索错误被掩盖。这篇论文做了两件事:一是建了一个新测试集,确保每个问题必须依赖视频中的具体片段才能回答;二是提出一个方法,让系统对视频的不同片段采用不同的检索策略(比如有的用文字搜,有的用画面搜,有的看短片段,有的看长片段),最后再自适应地选出每个片段的最佳方式。结果比现有方法好不少。它不是你明天能用上的,但指出了视频RAG的一个关键方向:别再一刀切地检索了。
📄 原文摘要(英文)
Retrieval-augmented generation is moving beyond text into long, egocentric video, where systems must select query-relevant chunks across multiple modalities and temporal granularities. Yet progress in VideoRAG is limited by two gaps: existing benchmarks allow queries to be answered without the video, obscuring retrieval errors, and prior methods apply a single modality-granularity configuration per query, ignoring chunk-level variability. We address both by introducing V-RAGBench, a benchmark of langlequery, evidence chunk, answerrangle triplets that enables faithful, decoupled evaluation of retrieval and generation, and CARVE, a simple method that runs parallel retrievers across configurations and employs chunk-adaptive reranking to identify the winning configuration for each chunk. Each chunk then enters the generator under its winning configuration selected during retrieval, yielding an interleaved evidence form where the chunk-level decision propagates across both stages. CARVE outperforms eight recent VideoRAG baselines, with the chunks supplied to the generator interleaving multiple configurations rather than sharing a single one, a behavior unattainable by query-level methods.