📄 论文解读

让机器人视频不再“穿模”：物理一致性训练

趋势通道 ▲ 17 机器人操作视频生成物理一致性世界模型DiT

现在的AI生成机器人操作视频，经常出现物体突然跳帧、手穿过物体等物理错误。研究者发现根源在于运动物体变形和交互时时空关系混乱。他们提出PhysisForcing框架，在训练时额外监督两类信息：一是像素级的轨迹对齐（让物体运动轨迹连续），二是语义级的关系对齐（让手和物体的相对位置合理）。在多个基准测试上，生成质量提升明显，更关键的是，作为世界模型用于机器人规划时，闭环成功率从16%提升到24%。这不是你明天能用的工具，但它指向一个趋势：AI模拟物理世界的能力正在从“看起来像”走向“物理上对”，这对未来机器人自主操作至关重要。

📄 原文摘要(英文)

Video generation models have emerged as a promising paradigm for embodied world simulation. However, both general-domain video generators and robot-specific data fine-tuned models can still produce physically implausible manipulations, including discontinuous motion trajectories and inconsistent robot-object interactions, which limits their reliability as world simulators. Through extensive experiments, we find that such physical instability mainly arises from two factors: deformation of moving objects and implausible spatio-temporal correlations among interacting entities, particularly during contact. Building on this observation, we propose PhysisForcing, a scalable training framework that strengthens physical consistency by focusing supervision on physics-informative regions through joint optimization of pixel-level and semantic-level features. The framework consists of a pixel-level trajectory alignment loss, which supervises DiT features using reference point trajectories, and a semantic-level relational alignment loss, which aligns DiT features with inter-region relations extracted from a frozen video understanding encoder. Extensive experiments on R-Bench, PAI-Bench, and EZS-Bench show that PhysisForcing consistently improves embodied video generation over strong baselines, improving the Wan2.2-I2V-A14B and Cosmos3-Nano base models on R-Bench by 22.3\% and 9.2\% (7.1\% and 3.7\% over vanilla finetuning), with the Cosmos3-Nano variant attaining the best overall score. Beyond generation, as a world model under the WorldArena action-planner protocol it raises the closed-loop success rate from 16.0\% to 24.0\% and further improves downstream policy success, indicating that physically aligned video models yield stronger representations for robotic manipulation.

arXiv 原文

📬 订阅 AI Pulse