📄 论文解读

视频嵌入评测：没有全能冠军，音频是双刃剑

趋势通道 ▲ 13 视频嵌入基准评测多模态音频影响模型选择

你可能会以为，视频AI评测就是比谁更准。但最新的大规模基准MVEB告诉你：没有模型能通吃所有任务。它用23项任务测试了33个模型，发现多模态大模型在分类、聚类、问答上领先，而多模态绑定模型在检索和零样本分类上更强。更反直觉的是音频的作用：如果数据集标签来自视觉+音频，音频能提升性能；但如果标签只来自视觉，加音频反而会拉低6个百分点——因为模型会被无关声音干扰。这不是你明天能用的工具，但它揭示了视频AI的现状：选模型得看具体任务，加模态不一定更好。

📄 原文摘要(英文)

We introduce the Massive Video Embedding Benchmark (MVEB), a 23-task benchmark for video embeddings spanning classification, zero-shot classification, clustering, pair classification, retrieval, and video-centric question answering. We evaluate 33 models and find that no single model dominates: MLLM-based embeddings lead on classification, clustering, pair classification, and QA; multimodal binding leads on retrieval and zero-shot classification; generative MLLMs without contrastive adaptation collapse on cross-modal tasks. Paired video-only vs. audio+video evaluations show that audio's contribution depends on dataset annotation provenance: audio helps when labels were produced from both modalities and hurts when they were produced from visuals alone, a six-point gap consistent across model families. MVEB is derived from MVEB+, a 184-task pool, and is designed to maintain task diversity while reducing evaluation cost. It integrates into the MTEB ecosystem for unified evaluation across text, image, audio, and video. We release MVEB and all 184 tasks along with code and a leaderboard at https://github.com/embeddings-benchmark/mteb.

arXiv 原文

订阅 AI Pulse