机器人看一眼自己动一动,就能适应新环境
现在的机器人视觉-语言-动作模型很死板:换个摄像头角度或换条机械臂,它就不认了,得重新喂大量数据训练。这篇论文让机器人自己先瞎动几下(比如转个圈、伸个手),从这几下里自动摸清新环境的物理规律(比如摄像头歪了多少、手臂多长),然后直接干活,不用调参数。在模拟和真实机器人上,换摄像头角度后成功率远超传统方法。它不是你明天就能用上的,但方向很明确:让机器人像人一样,到了新地方先试探两下再行动。
📄 原文摘要(英文)
Modern Vision-Language-Action (VLA) models often fail to generalize to novel setups, such as altered camera viewpoints or robot morphologies, because they are typically conditioned only on current observations and language instructions. By ignoring the underlying system configuration as a variable, these models implicitly assume a fixed execution context encountered during training, necessitating data-intensive fine-tuning for any new environment. In this work, we introduce In-Context World Modeling (ICWM), a framework that treats system identification as an in-context adaptation problem. ICWM enables robot policies to autonomously infer essential system variables from a short history of self-generated, task-agnostic interactions. Unlike traditional In-Context Learning that uses demonstrations to specify what task to perform, ICWM leverages the context window to understand how the system operates. By processing these interactions before task execution, the model implicitly captures the world dynamics of the current system, enabling adaptation to novel configurations without parameter updates. Extensive experiments in simulation and on real-world robot platforms demonstrate that ICWM significantly outperforms standard VLA baselines on novel camera viewpoints.