📄 论文解读

AI视频模型终于有了长期记忆

信赖通道 ▲ 18 视频生成长期记忆世界模型场景一致性

现在的AI视频生成模型有个致命缺陷：它记不住之前生成的画面。比如你让它生成一个角色在房间里走动的视频，几秒后房间的摆设可能就变了样，因为模型没有「长期记忆」。这篇论文让AI学会自己决定该记住哪些历史帧——不是靠人写规则，而是让模型自己用「查询令牌」去翻找之前的关键画面。在遮挡和物体移动的复杂场景下，画面一致性比之前的方法强很多。它不是你明天就能用的工具，但解决了视频世界模型走向实用化的一个核心障碍。

📄 原文摘要(英文)

Video World Models are interactive video generation models that predict future world states based on user actions and history video frames. A critical challenge in video world models is the lack of memory, causing inconsistent generated scenes over extended durations. Previous methods explored rule-based context frame retrieval as memory, but they fail to generalize in scenarios with scene occlusions and dynamic objects. We propose MemLearner, a learning-based adaptive context query method using query tokens to bridge context and predicted tokens. By leveraging the video generation model itself for context querying, MemLearner exploits pre-trained visual priors without training additional modules from scratch, and incorporates efficient strategies for training and inference. We collect a dataset of long videos with scene occlusions and dynamic objects, paired with camera pose annotations, and propose a multi-dataset training strategy leveraging both annotated rendered and unannotated real-world videos. Extensive experiments demonstrate that MemLearner significantly outperforms prior video world models in terms of scene consistency and memory, particularly under challenging occlusion and dynamic scenarios.

arXiv 原文

📬 订阅 AI Pulse