让机器人看人干活学动作,效果更好
训练机器人需要大量操作数据,但让机器人自己动手收集又贵又慢。这篇论文发现,人类第一视角视频(比如你戴个摄像头做菜)能当免费教材。他们搞了个叫ACE-EGO-0的框架,先把人类视频转成机器人能懂的“伪动作指令”,再和真机器人数据一起训练。关键一招是:人类视频里有些动作不靠谱,他们给每个动作标了个“可信度”,靠谱的才重点学。结果在多个测试里拿了第一,还能直接用在真实双臂机器人上。虽然你明天用不上,但这是机器人学技能的新路子——不用再死磕昂贵的数据采集,看人类干活就行。
📄 原文摘要(英文)
Vision-Language-Action (VLA) models benefit from large-scale and diverse embodied data, yet scaling robot trajectory collection is costly and labor-intensive. Recent advances show that large-scale egocentric human videos provide complementary real-world supervision in pretraining. However, joint training on human and robot data remains challenging due to divergences in action spaces, embodiment structures, temporal dynamics, and supervision quality. We introduce ACE-EGO-0, a unified VLA pretraining framework jointly leveraging heterogeneous data sources. To extract large-scale pretraining supervision from egocentric human videos, we build a scalable egocentric video-to-action pipeline that converts raw human videos into robot-format pseudo-action trajectories. To make these labels comparable with robot demonstrations, ACE-EGO-0 uses a unified action representation based on camera-space actions, morphology conditioning, and time-aligned action chunking. To robustly leverage noisy pseudo-action supervision from egocentric human videos, we formulate a reliability-aware training objective with a human auxiliary loss that concentrates supervision on reliable signals. We instantiate ACE-EGO-0 on 4.53K hours of robot and simulation data, together with 1.48K hours of pseudo-action-labeled egocentric human data. Experiments show that incorporating large-scale human supervision under reliability-aware weighting consistently improves both unified joint pretraining and supervised fine-tuning. ACE-EGO-0 achieves state-of-the-art performance on RoboCasa GR1 TableTop and RoboTwin 2.0, while demonstrating strong transfer to real-world bimanual manipulation.