AI看视频:像人一样看、记、想
现在的AI看视频,要么只看几秒,要么记不住前面发生了什么。这篇综述把问题拆成三个能力:看(捕捉细节)、记(保持上下文)、想(推理因果)。它不只是在说模型多强,而是指出一个反直觉的点:AI的“记忆”不是存下所有帧,而是学会忘掉无关信息、只记住关键线索。比如看一部电影,AI得知道哪个镜头是伏笔、哪个是废话。这篇不是教你明天就能用的工具,但它给出了一个框架:未来视频AI的竞争,不是比谁看得多,而是比谁记得巧、想得对。
📄 原文摘要(英文)
Video understanding is being rapidly transformed by multimodal large language models (MLLMs), as research moves from short clips to long, multimodal, and knowledge-intensive video scenarios. These scenarios require models to handle sparse evidence, long-range dependencies, multimodal alignment, and reliable inference under limited computational budgets. This work presents a human-view perspective on LLM-based video understanding, organized around three functional abilities: watching, remembering, and reasoning. Rather than treating video tasks as isolated benchmarks, this view provides a unified structure for analyzing how video MLLMs acquire evidence, preserve context, and produce grounded outputs. We introduce a formulation that characterizes video understanding systems by their perceptual representations, memory states, reasoning traces, and final predictions. Based on this formulation, we identify challenges in spatio-temporal perception, efficient long-video processing, memory modeling, streaming understanding, and faithful reasoning. Representative methods are organized by their roles in video MLLM systems. Watching covers fine-grained, comprehensive, audio-visual, and efficient perception. Remembering includes offline and streaming memory, while reasoning covers text-only reasoning and thinking with videos. We further examine application domains such as egocentric, sports, instructional, medical, and narrative videos, and cover training datasets and evaluation benchmarks across task types, supervision formats, modalities, and capability dimensions. Finally, we outline open problems and future directions for scalable, memory-aware, and evidence-grounded video intelligence. Related works will be continuously traced at https://github.com/marinero4972/Awesome-HumanView-VideoUnderstanding.