📄 论文解读

AI看长视频终于不“失忆”了：只读2%内容，准确率反升12.5点

趋势通道 ▲ 33 长视频理解分层图记忆智能体检索推理与感知解耦上下文压缩

现在的AI看几分钟视频还行，但看几小时的电影或监控录像，它会“失忆”——要么被海量画面撑爆算力，要么注意力被稀释到啥也记不住。这篇论文的解法反直觉：它不试图让AI记住所有画面，而是让AI像侦探一样，先快速扫一遍视频，把关键事件和因果关系画成一张“知识地图”（分层图记忆），然后只在需要回答问题时，才用工具去地图上精准查找线索。结果，它只用了完整视频2%的上下文，就在长视频问答任务上准确率提升了12.5个百分点，与人类专家的差距缩小到仅3.7%。更关键的是，研究者发现AI的逻辑推理能力越强，看长视频的表现就越好——这意味着，未来让AI“看懂”长视频，可能不是靠堆算力，而是靠提升它的“思考”能力。

📄 原文摘要(英文)

Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple perception and reasoning, shifting long-video understanding into an agentic exploration process. As a plug-and-play framework, it incrementally streams videos to construct a Hierarchical Graph Memory, a top-down three-tier architecture for semantic abstraction, anchored by a foundational graph capturing spatiotemporal and causal relations. During inference, the reasoning model employs agentic tool-augmented retrieval, navigating hierarchies, searching nodes, and traversing logical edges via an Observation-Reason-Action loop. Experiments show MemDreamer achieves SOTA results across four mainstream benchmarks, narrowing the gap with human experts to only 3.7 points. It constrains the reasoning context window to merely 2% of full-context ingestion while delivering a 12.5 point absolute accuracy gain. Furthermore, statistical analysis uncovers a strong positive linear correlation between an VLM's performance on logic reasoning and long-video understanding benchmarks, establishing agentic capability scaling as a new paradigm for multimodal comprehension.

arXiv 原文

📬 订阅 AI Pulse