📄 论文解读

给AI装个万能手柄，小模型也能干大活

信赖通道 ▲ 23 具身智能工具使用小模型操作任务

想让AI像人一样动手操作物体，通常需要庞大的端到端模型。但Guava框架反其道而行：它不训练AI从头学动作，而是设计了一套“工具接口”——把感知、规划、控制拆成独立模块，让AI像用遥控器一样调用它们。关键三要素：反复看-想-做循环、用语义指令（如“抓杯子”）代替原始坐标、同时看图片和文字。用这套接口，一个40亿参数的小模型只学2000条仿真数据，就能在真实世界完成复杂任务，效果媲美GPT-4等大模型。它不是你明天能用上的，但提示了一个方向：未来AI的动手能力可能不靠更大模型，而靠更聪明的“外挂”。

📄 原文摘要(英文)

Language models trained on large-scale vision-language data have demonstrated strong potential for embodied agents. Harnessing models through embodied tools use offers a promising alternative to end-to-end vision-language-action systems by combining high-level reasoning with external modules for perception, planning, and control. However, it remains unclear what makes an effective harness for embodied manipulation, and to what extent such a harness can unlock embodied capabilities in a wide range of reasoning models. In this work, we present Guava, a harness framework for embodied tool use developed through systematic exploration of the design space of agent workflows, action spaces, and observation spaces. Our study identifies three key ingredients for effective embodied agents: iterative perception-reasoning-action loops, semantic action abstractions, and multimodal observations. To understand whether these design principles are universal even to small models, we develop an end-to-end training pipeline that distills embodied manipulation capabilities into a 4B open-source model using fewer than 2K trajectories collected entirely in simulation. Experimental results in both simulation and real-world environments show performance comparable to frontier proprietary models while exhibiting strong generalization to unseen objects, novel instructions, and long-horizon tasks. Results suggest that a well-designed harness can serve as a scalable, model-agnostic interface for embodied manipulation, enabling strong emergent embodied capabilities in compact open-source models with minimal training data.

arXiv 原文

订阅 AI Pulse