📄 论文解读

AI终于学会在3D世界里“想”了

信赖通道 ▲ 22 空间智能3D推理视觉语言模型工具增强时序记忆

现在的AI看照片很厉害，但让它理解一个连续变化的3D空间——比如判断“这个物体在我左边多远、朝哪个方向”——它就懵了，因为它只会看单张静态图。这篇论文让AI像人一样，先看多张不同角度的照片或视频，然后调用不同的“工具”去测量、计数、判断方位，最后把信息拼起来。它不是你明天就能用的功能，但它是AI从“看图说话”走向“空间推理”的关键一步。

📄 原文摘要(英文)

Real-world spatial intelligence requires reasoning over a continuous and evolving 3D world, yet existing VLMs and tool-augmented agents largely remain tied to static, stateless inference from isolated visual observations. We introduce \textsc{S-Agent}, a spatial tool-use agentic paradigm for understanding and reasoning over continuous multi-view images and videos. By formulating spatial reasoning as spatio-temporal evidence accumulation rather than isolated frame-level prediction, S-Agent reshapes spatial perception into scene-centric understanding beyond frame-centric recognition. Specifically, S-Agent casts the VLM as a semantic planner that decides what evidence is needed, while a hierarchy of spatial tools and experts grounds objects in 2D, lifts them into 3D geometric evidence, and aggregates this evidence into high-level spatial knowledge (e.g., counting, measurement, orientation, and relative position). Additionally, a temporal memory mechanism, including Scene Memory for maintaining the evolving scene state and Agent Memory for accumulating reasoning context, enables evidence integration across frames and reasoning steps. Comprehensive experiments on multi-view and video spatial reasoning benchmarks show that S-Agent consistently improves both open-source and closed-source VLMs in a training-free manner. Beyond inference-time augmentation, supervised fine-tuning (SFT) on S-Agent-generated spatial trajectories S-300K yields S-Agent-8B, a compact spatial agent that significantly surpasses similar-scale baselines (e.g., Qwen3-VL-8B) and performs comparably to advanced closed-source models (e.g., GPT-5.4 and Gemini 3).

arXiv 原文

📬 订阅 AI Pulse