把人的手巧教给机器人,关键在“翻译”
机器人学人做精细操作,过去总想直接复制人的手部6D位姿,但人手和机械爪的接触方式完全不同,数据噪声大。这篇论文换了个思路:不学手怎么转,只学手腕在头戴相机视野里的相对平移——这个动作人和机器人都能做。他们用类似π_0的视觉-语言-动作模型,把不同动作组件拆成可选的token,用注意力掩码处理缺失部分。结果在双机械臂任务上,这种“桥接动作”比直接学人手位姿效果好得多,而且数据越多越强。它不是你明天能用上的,但指明了机器人技能迁移的一个更靠谱的方向。
📄 原文摘要(英文)
We study whether we can learn novel manipulation skills from human actions to a bi-manual robot with parallel grippers. Human action data is cheap, abundant, and diverse, making it one of the most promising resources for scaling up robot learning. Yet transferring skills from humans to robots remains hard: most prior work treats humans as just another bi-manual 6DoF embodiment, where hand-pose estimates are noisy and the contact patterns of human fingers differ fundamentally from those of a parallel gripper. We argue that learning rotation-inclusive action signals from human data is therefore sub-optimal, and instead propose a bridging action representation: the relative wrist translation within the initial head-camera frame, an action space shared by humans and robots. To handle the potential absence of certain action components in different embodiments, we build a π_0-like vision-language-action model with interleaved action tokens and attention masking. On a suite of novel bi-manual manipulation tasks, our bridging action transfers human manipulation knowledge to robots far more effectively than noisy 6DoF human actions and scales with the amount of human data.