📄 论文解读

让机器人看人干活学动作，效果更好

趋势通道 ▲ 42 机器人训练人类视频动作学习VLA模型数据融合

机器人学动作通常靠人类遥控它做一遍，成本高、数据少。这篇论文换了个思路：让人戴着头戴相机干自己的活，然后把视频转成机器人能懂的“伪动作指令”，再和真机器人数据混在一起训练。关键是把人的手、身体和机器人的机械臂统一成“相机视角下的动作”，并给不可靠的人数据打折扣。结果在桌面操作任务上刷新了纪录，还能直接迁移到真实双臂机器人。它不是你明天就能用的工具，但指明了降低机器人训练成本的方向——用人已有的海量视频替代部分遥控采集。

📄 原文摘要(英文)

Vision-Language-Action (VLA) models benefit from large-scale and diverse embodied data, yet scaling robot trajectory collection is costly and labor-intensive. Recent advances show that large-scale egocentric human videos provide complementary real-world supervision in pretraining. However, joint training on human and robot data remains challenging due to divergences in action spaces, embodiment structures, temporal dynamics, and supervision quality. We introduce ACE-EGO-0, a unified VLA pretraining framework jointly leveraging heterogeneous data sources. To extract large-scale pretraining supervision from egocentric human videos, we build a scalable egocentric video-to-action pipeline that converts raw human videos into robot-format pseudo-action trajectories. To make these labels comparable with robot demonstrations, ACE-EGO-0 uses a unified action representation based on camera-space actions, morphology conditioning, and time-aligned action chunking. To robustly leverage noisy pseudo-action supervision from egocentric human videos, we formulate a reliability-aware training objective with a human auxiliary loss that concentrates supervision on reliable signals. We instantiate ACE-EGO-0 on 4.53K hours of robot and simulation data, together with 1.48K hours of pseudo-action-labeled egocentric human data. Experiments show that incorporating large-scale human supervision under reliability-aware weighting consistently improves both unified joint pretraining and supervised fine-tuning. ACE-EGO-0 achieves state-of-the-art performance on RoboCasa GR1 TableTop and RoboTwin 2.0, while demonstrating strong transfer to real-world bimanual manipulation.

arXiv 原文

订阅 AI Pulse