机器人界的GPT时刻?对齐异构数据实现泛化
机器人操作一直有个死结:数据太杂——不同机器人、不同任务、不同动作格式,根本没法像训练大语言模型那样堆数据。这篇论文直接把这个结剪开了。他们搞了个统一对齐框架,把机器人的“看”(视觉)、“想”(语言指令)、“动”(动作轨迹)三个维度强行对齐成一套标准格式,然后从网上扒了3.8万小时的人类手部视频和开源数据集,自动转成机器人能用的训练数据。结果呢?训练出来的模型Qwen-RobotManip在没见过的场景、没见过的机器人上都能零样本执行指令,甚至能自己从错误中恢复——比如抓东西掉了会重新抓。在多个极端测试中,它把之前最强的模型π0.5都甩开了,真实机器人上也能跑。这不是你明天就能用的产品,但它证明了机器人操作可以走大模型那条路:只要对齐格式、堆数据,泛化能力就会自己冒出来。
📄 原文摘要(英文)
Foundation models in language and multimodality achieve strong generalization by aligning heterogeneous data under a unified formulation and training at scale. In this report, we investigate whether this scaling recipe can be applied to robotic manipulation to achieve genuine generalization. This is challenging because, unlike text, manipulation data is heterogeneous by nature, expensive to collect, and narrow in diversity, making alignment and scale simultaneously difficult. We present Qwen-RobotManip, a generalizable Vision-Language-Action foundation model built on Qwen-VL. Qwen-RobotManip introduces a unified alignment framework across the representation, motion, and behavioral dimensions of manipulation, making large-scale multi-source training coherent rather than conflicting. This alignment capability in turn enables Qwen-RobotManip to absorb manipulation data at a scale that prior training regimes could not sustain. A human-to-robot synthesis pipeline converts egocentric hand demonstrations into robot trajectories across 15 platforms, and a rigorous curation pipeline harmonizes heterogeneous datasets. Using only open-source datasets and human videos without proprietary data collection, Qwen-RobotManip constructs a ~38,100-hour pretraining corpus and exhibits emergent generalization capabilities, including zero-shot instruction following, robustness to perturbations, reactive error recovery, and cross-embodiment transfer. We find that standard benchmarks fail to capture pretraining quality and instead adopt OOD settings including RoboCasa365, LIBERO-Plus, EBench, RoboTwin-Clean2Rand, RoboTwin-IF, and RoboTwin-XE. Qwen-RobotManip substantially outperforms prior state-of-the-art models, including π0.5, across all OOD settings, ranks 1st in RoboChallenge with a 20% relative improvement, and is validated on real-robot platforms including AgileX ALOHA, Franka, UR, and ARX.