📄 论文解读

机器人缺的不是数据，是翻译官

趋势通道 ▲ 23 机器人数据利用接口VLA世界模型

机器人智能的瓶颈不是数据不够，而是数据用不上。人类走路、搬东西、开门的视频遍地都是，但机器人看不懂——因为它没有手、没有关节角度标签，不知道哪个动作对应哪个目标。这篇论文点出四个缺失的接口：自动给视频打标签、把人的动作映射到机器人身体、从视频里学物理规律、从语言和画面推断任务是否成功。它不是你明天能用上的，但解释了为什么机器人学了那么多视频还是笨手笨脚——缺的不是数据量，是数据到动作的翻译层。

📄 原文摘要(英文)

Generalist robot intelligence is often framed as a policy-scaling problem: collect more robot demonstrations, train larger Vision-Language-Action (VLA) models, and expect broader generalisation. In this position paper, we argue that this framing is incomplete. The central bottleneck is not only policy learning, but the absence of mechanisms that convert the world's abundant unstructured behavioural data into grounded robot supervision. Human motion, internet video, simulation rollouts, and interactive demonstrations contain rich information about tasks, goals, contacts, failures, and physical constraints, yet most of this information is not directly usable by robot policies because it lacks embodiment-specific action labels, task semantics, and reward structure. We identify four missing components for the next generation of robotics: data interfaces for autolabelling unstructured behaviour, embodiment interfaces for retargeting human motion to robot actions, world-model interfaces for physics-grounded 3D reasoning, and reward interfaces for inferring task progress and success from video and language. We survey recent progress in robot foundation models, cross-embodiment datasets, learning from video, world models, and reward modelling, and propose a research agenda for building robotics systems that can learn not only from robot demonstrations, but from the broader physical world.

arXiv 原文

订阅 AI Pulse