机器人学几何:一个模型搞定感知、预测和动作
现在的机器人操作模型大多在2D图像上思考,但真实世界是3D的。这篇论文直接把一个预训练的3D几何模型(原本用于理解物体形状和空间关系)改造成机器人操控策略:它用同一个模型的前半部分看世界、后半部分预测未来并输出动作,中间只加了一个语言条件的时间预测模块。结果在模拟和真实机器人任务上,比当前主流的视觉-语言-动作模型更准、更鲁棒、更快、更轻量。它不是你明天能用上的,但指明了一个方向:与其从零训练或拼接大模型,不如直接复用已有的3D理解模型来做机器人控制。
📄 原文摘要(英文)
Generalist robot policies must follow user instructions while reasoning about how objects, cameras, and robot actions interact in the 3D physical world. Recent vision-language-action models (VLAs) and video world-action models (WAMs) inherit strong semantic or temporal priors from large-scale foundation models, but they still operate primarily on 2D image frames or 2D-derived latent spaces, leaving implicit the 3D geometry required for contact-rich manipulation. We propose the Geometric Action Model (GAM), a language-conditioned manipulation policy that directly repurposes a pretrained geometric foundation model (GFM) as a shared substrate for perception, temporal prediction, and action decoding. GAM splits the GFM at an intermediate layer: the shallow layers serve as an observation encoder, and a causal future predictor inserted at the split layer forecasts future latent tokens conditioned on language, proprioception, and action history. The predicted future tokens are then routed through the remaining GFM blocks for feature propagation and decoding, allowing a single backbone to produce both future geometry and actions. This design equips the GFM with language-conditioned temporal world modeling through minimal architectural modification while preserving its rich geometric priors. Across a broad suite of simulation and real-robot manipulation benchmarks, GAM is more accurate, more robust, faster, and lighter than current foundation-model-scale baselines.