让AI视频生成不再“鬼影重重”:用追踪点教模型看懂运动
现在的AI能从一段普通视频生成“新视角”视频——比如你拍了一段街舞,AI能模拟从侧面看的效果。但问题在于:AI经常搞错物体怎么动,导致画面扭曲或“鬼影”。这篇论文发现,AI模型内部某些层其实知道物体该往哪动,只是没用好。于是他们加了一个“追踪点”监督:让模型在生成新视角时,同时预测几个关键点(比如舞者的肩膀、膝盖)在下一帧和另一个视角的位置。训练时强制这些预测和真实运动一致,结果生成视频的运动连贯性和视角一致性大幅提升。它不是你能直接用的工具,但揭示了“让AI自己学会追踪运动”是解决视频生成抖动的关键路径。
📄 原文摘要(英文)
Synthesizing a novel-view video from a monocular reference video along a target camera trajectory requires both geometric consistency and motion fidelity with respect to the reference video. Existing methods based on explicit 3D representations are limited by the accuracy of off-the-shelf reconstruction modules, which often produce inaccurate geometry for dynamic objects in monocular videos. In contrast, camera-conditioning-only methods can achieve high visual quality but often struggle to preserve geometric and motion consistency. In this work, we introduce MVTrack4Gen (Multi-View point Tracking for Novel-View Generation), a motion-aware training framework that leverages multi-view point tracking as an additional geometric and motion supervision signal for camera-conditioning-only novel-view video diffusion models. Our key finding is that specific attention layers encode strong correspondence cues, where query features attend to key features at geometrically corresponding locations across views and over time, and the misalignment of these correspondences causes motion inconsistency. Based on this observation, we route these features into an auxiliary multi-view tracking head and jointly train the diffusion model with a point-tracking objective. By explicitly strengthening these motion-aware correspondences, MVTrack4Gen improves existing models to better follow the motion in the reference view and maintain cross-view geometric consistency. Across diverse benchmarks, our method achieves state-of-the-art geometric consistency and competitive camera accuracy.