📄 论文解读

视频世界模型：把3D记忆藏在潜空间

趋势通道 ▲ 31 视频世界模型潜空间3D一致性扩散模型内存优化

视频世界模型要生成连贯的3D场景，传统做法是把画面转成点云存起来，每次生成新视角都要把点云渲染成像素图、再编码回潜空间，既慢又丢信息。这篇直接把3D场景信息存在扩散模型的潜空间里，用深度信息把潜空间里的特征点投射到3D位置，需要新视角时直接在潜空间里扭曲合成，省掉了像素空间的来回折腾。结果：视频生成快了10倍，内存占用降到原来的1/55，而且因为保留了潜空间的丰富特征，生成质量反而更好。它不是你明天能用上的，但指明了视频生成模型从“2D拼贴”走向“3D理解”的一条高效路径。

📄 原文摘要(英文)

Video world models that maintain 3D spatial consistency across generated frames typically rely on explicit point cloud memory constructed in RGB space. This design is both computationally expensive, requiring repeated rendering and VAE encoding, and inherently lossy, as the round trip through pixel space discards rich features of the learned latent representation. In this paper, we introduce latent spatial memory for video world models, a persistent 3D cache that stores scene information directly in the diffusion latent space, avoiding pixel-space reconstruction. Building on this, we propose Mirage, a latent-space spatial memory framework that constructs the memory by lifting latent tokens into 3D via depth-guided back-projection and queries it by synthesizing novel views through direct latent-space warping. This unified formulation eliminates both the information loss of pixel-space reconstruction and the computational burden of repeated encoding and rendering. Experiments show that latent spatial memory achieves up to 10.57times faster end-to-end video generation and 55times reduction in memory footprint relative to explicit 3D baselines. Leveraging the geometric prior of the diffusion model, Mirage attains state-of-the-art performance on WorldScore and strong reconstruction quality on RealEstate10K.

arXiv 原文

📬 订阅 AI Pulse