让AI像程序员一样思考空间问题
现在的AI看3D场景就像隔着毛玻璃——能认出物体,但搞不清它们怎么摆、怎么动。研究者发现,问题出在AI调用工具的“动作接口”上:要么一次性写完所有代码,看不到中间结果;要么只能按固定套路出牌,没法灵活调整。SpatialClaw换了个思路:让AI像程序员写代码一样,每步只写一行,执行完看到结果再写下一行。它背后有个常驻的Python内核,随时能调用视觉和几何工具,还能回头看之前的文字和图像输出。在20个空间推理测试中,平均准确率59.9%,比之前最好的方法高出11.2个百分点。这不是你明天能用的产品,但它指出了一个方向:让AI学会“边看边想”,而不是“想好再看”。
📄 原文摘要(英文)
Spatial reasoning, the ability to determine where objects are, how they relate, and how they move in 3D, remains a fundamental challenge for vision-language models (VLMs). Tool-augmented agents attempt to address this by augmenting VLMs with specialist perception modules, yet their effectiveness is bounded by the action interface through which those tools are invoked. In this work, we study how the design of this interface shapes the agent's capacity for open-ended spatial reasoning. Existing spatial agents either employ single-pass code execution, which commits to a full analysis strategy before any intermediate result is observed, or rely on a structured tool-call interface that often offers less flexibility for freely composing operations or tailoring the analysis to each task. Both designs offer limited flexibility for open-ended, complex 3D/4D spatial reasoning. We therefore propose SpatialClaw, a training-free framework for spatial reasoning that adopts code as the action interface. SpatialClaw maintains a stateful Python kernel pre-loaded with input frames and a suite of perception and geometry primitives, letting a VLM-backed agent write one executable cell per step conditioned on all prior outputs, enabling the agent to flexibly compose and manipulate perception results and adapt its analysis to both intermediate text and visual observations and the demands of each problem. Evaluated across 20 spatial reasoning benchmarks spanning a broad range of static and dynamic 3D/4D spatial reasoning tasks, SpatialClaw achieves 59.9% average accuracy, outperforming the recent spatial agent by +11.2 points, with consistent gains across six VLM backbones from two model families without any benchmark- or model-specific adaptation.