给机器人装个万能遥控器,小模型也能干大活
想让机器人听懂人话干活,通常得给它装个超大脑子。但这篇论文反着来:给机器人配一套「万能遥控器」——把感知、规划、控制拆成独立模块,大模型只负责动嘴指挥。关键三招:边看边想边动、用语义动作(比如“抓杯子”而不是坐标)、同时看图片和文字。结果用不到2000条模拟数据,就把一个40亿参数的小模型训练到能跟GPT-4o掰手腕,还能处理没见过的物体和长任务。它不是你明天能用上的,但暗示了一条路:机器人智能可能不靠更大模型,而靠更聪明的接口。
📄 原文摘要(英文)
Language models trained on large-scale vision-language data have demonstrated strong potential for embodied agents. Harnessing models through embodied tools use offers a promising alternative to end-to-end vision-language-action systems by combining high-level reasoning with external modules for perception, planning, and control. However, it remains unclear what makes an effective harness for embodied manipulation, and to what extent such a harness can unlock embodied capabilities in a wide range of reasoning models. In this work, we present Guava, a harness framework for embodied tool use developed through systematic exploration of the design space of agent workflows, action spaces, and observation spaces. Our study identifies three key ingredients for effective embodied agents: iterative perception-reasoning-action loops, semantic action abstractions, and multimodal observations. To understand whether these design principles are universal even to small models, we develop an end-to-end training pipeline that distills embodied manipulation capabilities into a 4B open-source model using fewer than 2K trajectories collected entirely in simulation. Experimental results in both simulation and real-world environments show performance comparable to frontier proprietary models while exhibiting strong generalization to unseen objects, novel instructions, and long-horizon tasks. Results suggest that a well-designed harness can serve as a scalable, model-agnostic interface for embodied manipulation, enabling strong emergent embodied capabilities in compact open-source models with minimal training data.