AI看长视频终于不“失忆”了:只读2%内容,准确率反升12.5点
现在的AI看几分钟视频还行,但看几小时的电影或监控录像,它会“失忆”——要么被海量画面撑爆算力,要么注意力被稀释到啥也记不住。这篇论文的解法反直觉:它不试图让AI记住所有画面,而是让AI像侦探一样,先快速扫一遍视频,把关键事件和因果关系画成一张“知识地图”(分层图记忆),然后只在需要回答问题时,才用工具去地图上精准查找线索。结果,它只用了完整视频2%的上下文,就在长视频问答任务上准确率提升了12.5个百分点,与人类专家的差距缩小到仅3.7%。更关键的是,研究者发现AI的逻辑推理能力越强,看长视频的表现就越好——这意味着,未来让AI“看懂”长视频,可能不是靠堆算力,而是靠提升它的“思考”能力。
📄 原文摘要(英文)
Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple perception and reasoning, shifting long-video understanding into an agentic exploration process. As a plug-and-play framework, it incrementally streams videos to construct a Hierarchical Graph Memory, a top-down three-tier architecture for semantic abstraction, anchored by a foundational graph capturing spatiotemporal and causal relations. During inference, the reasoning model employs agentic tool-augmented retrieval, navigating hierarchies, searching nodes, and traversing logical edges via an Observation-Reason-Action loop. Experiments show MemDreamer achieves SOTA results across four mainstream benchmarks, narrowing the gap with human experts to only 3.7 points. It constrains the reasoning context window to merely 2% of full-context ingestion while delivering a 12.5 point absolute accuracy gain. Furthermore, statistical analysis uncovers a strong positive linear correlation between an VLM's performance on logic reasoning and long-video understanding benchmarks, establishing agentic capability scaling as a new paradigm for multimodal comprehension.