📄 论文解读

让AI拍视频时不再“手抖”：用点追踪锁住运动

信赖通道 ▲ 26 4D视频生成新视角合成点追踪运动一致性几何一致性

现在的AI能从一段普通视频生成“新视角”视频——比如把手机拍的街景变成无人机环绕视角。但问题在于：AI经常搞错物体怎么动，或者不同视角下物体位置对不上，看起来像“穿帮”。这篇论文发现，AI模型里某些注意力层其实已经知道“哪个点对应哪个点”，只是没用好。于是他们加了一个“多视角点追踪”的辅助任务，让模型在训练时额外学习跟踪每个点在时间和空间上的对应关系。结果，生成的视频里物体运动更连贯，不同视角看过去位置也更准。它不是你能直接用的工具，但指明了提升视频生成一致性的一个关键方向：与其硬算3D，不如让模型自己学会“盯住点”。

📄 原文摘要(英文)

Synthesizing a novel-view video from a monocular reference video along a target camera trajectory requires both geometric consistency and motion fidelity with respect to the reference video. Existing methods based on explicit 3D representations are limited by the accuracy of off-the-shelf reconstruction modules, which often produce inaccurate geometry for dynamic objects in monocular videos. In contrast, camera-conditioning-only methods can achieve high visual quality but often struggle to preserve geometric and motion consistency. In this work, we introduce MVTrack4Gen (Multi-View point Tracking for Novel-View Generation), a motion-aware training framework that leverages multi-view point tracking as an additional geometric and motion supervision signal for camera-conditioning-only novel-view video diffusion models. Our key finding is that specific attention layers encode strong correspondence cues, where query features attend to key features at geometrically corresponding locations across views and over time, and the misalignment of these correspondences causes motion inconsistency. Based on this observation, we route these features into an auxiliary multi-view tracking head and jointly train the diffusion model with a point-tracking objective. By explicitly strengthening these motion-aware correspondences, MVTrack4Gen improves existing models to better follow the motion in the reference view and maintain cross-view geometric consistency. Across diverse benchmarks, our method achieves state-of-the-art geometric consistency and competitive camera accuracy.

arXiv 原文

📬 订阅 AI Pulse