AI Pulse
📄 论文解读

给机器人装个「世界模拟器」:它终于能猜出下一秒场景

现在的机器人操作模型(VLA)虽然能看懂指令和物体,但它们的「世界知识」来自静态图片,完全不知道物体被碰后会怎么动——比如推杯子时杯子会倒、捏毛巾时毛巾会变形。这篇研究给模型加了一个「世界动作模型」(WAM),相当于在决策链条里插了两根天线:一根让模型提前「看见」场景如何演变(比如手伸过去后物体位置会变),另一根直接给出一条预期轨迹作为运动参考。结果在零样本的陌生任务中成功率冲到84.7%,尤其当视角、物体形状、软硬状态突然变化时,提升最明显。它不是你明天能用上的,但方向很明确:让机器人学会「猜物理」,而不是死记硬背动作。

📄 原文摘要(英文)

Vision-Language-Action (VLA) models inherit semantic grounding from large-scale pretraining and perform competently across in-distribution manipulation tasks. This grounding, however, is built on static image-text pairs, whereas manipulation is a continuous, contact-rich process whose dynamics this pretraining cannot capture. We present World Pilot, a VLA framework that augments the policy with priors from a World-Action Model (WAM), routed into the decision chain through two complementary pathways. Latent Steering conditions the perception layer on a scene-evolution latent, and Action Steering supplies an anticipated trajectory as a motion prior to the action generator. Together the two priors equip the VLA with an anticipated view of the scene and a trajectory-level motion hint alongside its semantic conditioning, and the scene-evolution prior remains effective even when supplied by a video-pretrained world model that has not been action-post-trained. World Pilot attains a state-of-the-art Total success rate of 84.7% on the LIBERO-Plus zero-shot OOD benchmark and the highest success rate on every real-robot setting across four manipulation tasks, with the largest margins under shifts in viewpoint, geometry, deformable state, and pose. Project Website: https://world-pilot.github.io/

arXiv 原文

📬 订阅 AI Pulse

每天三次更新,不错过重要信号

▲ 回到顶部