📄 论文解读

AI的空间推理有多差？最强模型成功率仅17%

趋势通道 ▲ 33 空间推理多模态AI基准测试GPT-5Qwen-3.5

你让AI帮你找钥匙、规划路线、布置房间——它真的理解空间吗？新基准SpatialWorld把AI扔进8种模拟环境（家庭、旅行、社交协作等），要求它像人一样只靠视觉观察、主动探索、用文字指令完成任务。结果：最强模型GPT-5平均成功率仅17.4%，开源冠军Qwen-3.5也只有14.1%。AI在主动探索和长期规划上严重拉胯，不同场景表现天差地别。这不是你明天能用上的工具，但它告诉你：AI离真正理解物理世界还差得远。

📄 原文摘要(英文)

Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and operate within the physical world. However, existing benchmarks predominantly rely on passive evaluation (e.g., static VQA) or simulator-specific pipelines, failing to assess general interactive spatial understanding. We introduce SpatialWorld, a unified benchmark designed specifically for evaluating the interactive spatial understanding of multimodal agents in complex real-world tasks. Integrating eight heterogeneous simulation backends under a shared, simulator-agnostic protocol, SpatialWorld features 760 human-annotated tasks across diverse domains (e.g., household routines, travel, social collaboration). Agents must solve tasks under vision-only partial observability, actively gathering egocentric visual evidence and expressing decisions via a unified, text-based action interface native to MLLMs. For reliable evaluation, each task includes a human-validated initial state, a reference trajectory, and a terminal-state verifier. Evaluating 15 advanced agents reveals that robust spatial task solving remains challenging: the strongest model, GPT-5, achieves an average task success rate (TSR) of only 17.4%, while the leading open-source model, Qwen-3.5, reaches 14.1%. Further analysis exposes a clear mismatch between task success and execution efficiency, alongside substantial domain-specific performance variations. These bottlenecks in active exploration and long-horizon planning position SpatialWorld as a rigorous testbed for future spatial agents.

arXiv 原文

📬 订阅 AI Pulse