📄 论文解读

让机器人看人类第一人称视频学干活

信赖通道 ▲ 34 机器人训练第一人称视频视觉语言动作模型预训练人类数据

训练机器人需要大量操作数据，但让机器人自己动手又慢又贵。这篇论文的思路是：让机器人看人类的第一人称视频（比如你戴着头戴相机做饭、修东西），从中学会动作。研究者建了一条流水线，把人类视频自动转成机器人能懂的“伪动作轨迹”，再和真实机器人数据一起训练。关键技巧是：用统一动作表示（基于相机空间坐标、形态条件、时间对齐）让两种数据能混用，同时加一个“可靠性权重”来过滤人类视频中的噪声。实验表明，加入人类视频数据后，机器人在桌面操作和双臂操作任务上都达到了新高度。它不是你明天就能用的，但指明了降低机器人训练成本的一条路：用海量人类视频替代部分机器人实操。

📄 原文摘要(英文)

Vision-Language-Action (VLA) models benefit from large-scale and diverse embodied data, yet scaling robot trajectory collection is costly and labor-intensive. Recent advances show that large-scale egocentric human videos provide complementary real-world supervision in pretraining. However, joint training on human and robot data remains challenging due to divergences in action spaces, embodiment structures, temporal dynamics, and supervision quality. We introduce ACE-EGO-0, a unified VLA pretraining framework jointly leveraging heterogeneous data sources. To extract large-scale pretraining supervision from egocentric human videos, we build a scalable egocentric video-to-action pipeline that converts raw human videos into robot-format pseudo-action trajectories. To make these labels comparable with robot demonstrations, ACE-EGO-0 uses a unified action representation based on camera-space actions, morphology conditioning, and time-aligned action chunking. To robustly leverage noisy pseudo-action supervision from egocentric human videos, we formulate a reliability-aware training objective with a human auxiliary loss that concentrates supervision on reliable signals. We instantiate ACE-EGO-0 on 4.53K hours of robot and simulation data, together with 1.48K hours of pseudo-action-labeled egocentric human data. Experiments show that incorporating large-scale human supervision under reliability-aware weighting consistently improves both unified joint pretraining and supervised fine-tuning. ACE-EGO-0 achieves state-of-the-art performance on RoboCasa GR1 TableTop and RoboTwin 2.0, while demonstrating strong transfer to real-world bimanual manipulation.

arXiv 原文

📬 订阅 AI Pulse