📄 论文解读

用语言指令预测物体3D运动轨迹

信赖通道 ▲ 39 3D运动预测语言指令机器人操作视频生成

你对着视频说一句「把杯子推到左边」，AI就能预测出杯子上每个点接下来会怎么移动——不是2D画面里的像素，而是真实世界3D坐标中的轨迹。研究者构建了百万级数据集，让模型学会理解「推」「拉」「旋转」等61种动作指令，并能在机器人操作和视频生成中直接使用。它不是你明天能用上的，但展示了AI从「看懂画面」到「理解物理运动」的关键一步。

📄 原文摘要(英文)

Motion forecasting is central to visual intelligence: agents must anticipate how objects will move in order to plan actions, reason about physical interactions, and synthesize realistic futures. We argue that 3D points in world coordinates provide a general representation that is class-agnostic, view-stable, compact, and directly useful for downstream tasks. We formalize the task of goal-conditioned 3D point motion forecasting: given a short visual history, a set of 3D query points on an object of interest, and a language description of the intended goal, the model predicts the future 3D trajectory of each point. We introduce a full stack to study this task at scale: (1) MolmoMotion-1M is a large corpus of action-described, object-grounded 3D point trajectories annotated from 1.16M unconstrained videos; (2) PointMotionBench is a human-verified benchmark spanning 111 object categories and 61 motion types; and (3) MolmoMotion is a general motion forecasting model that supports both autoregressive coordinate prediction and flow-matching-based trajectory generation. MolmoMotion accurately predicts diverse motion patterns with different language instructions, and significantly outperforms existing motion prediction baselines on PointMotionBench. Finally, we show that the learned 3D motion prior transfers well to downstream applications: it improves training efficiency and generalization for robot manipulation, and its predicted trajectories provide effective motion guidance for generative models to synthesize videos with more realistic object motion.

arXiv 原文

📬 订阅 AI Pulse