AI看视频只会认物体,不会推理逻辑
现在的多模态大模型看视频,能认出“猫”和“沙发”,但如果你问它“猫先跳上沙发,再跳下,最后去了哪里”,它大概率答错。这不是因为它眼神不好,而是因为它缺乏“时间逻辑推理”能力——它不会把不同时刻的画面串起来,像人一样做因果推断。
研究者设计了一个专门测试这种能力的基准,把推理拆成5种基本操作:跟踪状态变化、按顺序计数、判断时间先后、理解动态空间关系、组合多个逻辑步骤。比如,一个任务里,屏幕上先出现一个红球,然后蓝球从左边移到右边,最后红球消失——模型需要回答“蓝球移动时,红球还在吗?”这种问题对人类很简单,但最强模型在复杂任务上准确率不到30%,而人类接近90%。
他们甚至用50万条合成数据去微调模型,效果有提升,但离人类还差一大截。这说明,当前AI的“推理”更多是模式匹配,不是真正的逻辑推演。
这不是你明天能用上的技术,但它划了一条线:别被AI“看懂视频”的演示骗了,它离理解“发生了什么”还很远。
📄 原文摘要(英文)
Recent interest in multimodal large language models (MLLMs) raises a central question: can they reason over dynamic visual evidence rather than merely recognize objects or events in individual frames? This ability, which we refer to as video temporal-logical reasoning, requires models to maintain, update, and compose evidence as visual states evolve across frames. Existing video benchmarks often conflate this capability with scene complexity, static recognition, or uncontrolled temporal variation. To isolate this capability, we introduce Video-MME-Logical, a controlled benchmark organized around five temporal-logical operations: state tracking, sequential counting, temporal ordering, dynamic spatiality, and structural composition. The benchmark contains 25 fine-grained task categories generated with controlled object states, transitions, temporal dependencies, and logical compositions. It enables difficulty-controlled final-answer evaluation by varying temporal horizon and reasoning complexity, and supports intermediate-state diagnostics by verifying whether models recover the required logical reasoning trace before producing the final answer. Experiments with state-of-the-art MLLMs reveal a substantial human-model gap, especially as temporal-logical complexity increases. Supervised fine-tuning on up to 500K generated samples improves performance but remains insufficient to close the reasoning gap, positioning Video-MME-Logical as a scalable testbed for analyzing and improving temporal-logical reasoning in MLLMs.