📄 论文解读

AI看空间：两条路，一条靠嘴，一条靠眼

信赖通道 ▲ 13 空间推理视觉语言模型强化学习3D感知双路径推理

现在的AI看图片能认出“杯子在桌子左边”，但让它推理“杯子离桌子边缘多远、能不能被手够到”就卡壳。这篇论文发现，空间推理其实需要两种截然不同的策略：有些问题靠语言逻辑就能推（比如“A在B左边，B在C左边，所以A在C左边”），另一些必须先把物体在3D空间里的位置算出来，再定量推理。研究者给AI装了两条路：一条纯语言推理，一条先检测3D坐标再推理，并用强化学习让AI自己学会选哪条路。结果，同一个模型在多个空间测试中大幅领先，而且两条路互相促进——会算3D的AI，语言推理也变强了。这不是你明天能用的功能，但它解释了为什么现在的AI看空间总差一口气，以及怎么补上。

📄 原文摘要(英文)

Spatial VLMs have made substantial progress in geometric perception, yet complex spatial reasoning requiring multi-step inference over depth, distance, and scene relations remains challenging. Moreover, different spatial queries call for fundamentally different strategies: some are best addressed through purely linguistic, step-by-step deduction, while others require explicit 3D grounding before quantitative inference. We present Dual-Path Spatial Reasoning via Reinforcement Learning for Spatial VLMs (SR-REAL), a unified framework that equips a spatial VLM with two complementary reasoning paths: Language-Only Reasoning (LOR), which performs step-by-step linguistic deduction, and Detect-Then-Reason (DTR), which detects 3D geometric cues (e.g., centers or bounding boxes) via region tokens before explicit geometric inference. SR-REAL begins with a cold-start supervised fine-tuning stage that constructs LOR and DTR chain-of-thought supervision and exposes a region-to-3D interface, followed by RL that optimizes the policy model with accuracy and format rewards; for DTR, a discrete center-based detection reward further refines geometric alignment. Across diverse spatial benchmarks, SR-REAL significantly outperforms spatial VLM baselines: (i) a single RL-trained model supports both reasoning paths, with DTR excelling in region-aware tasks through precise 3D localization and LOR enhancing general spatial reasoning; (ii) jointly training both paths fosters mutual reinforcement; (iii) high-quality, blended cold-start data is crucial for stable RL optimization; and (iv) the model generalizes across datasets and domains without per-task tuning, demonstrating positive transfer between LOR and DTR.

arXiv 原文

📬 订阅 AI Pulse