📄 论文解读

AI看视频猜空间位置，现在能回头检查了

信赖通道 ▲ 28 空间推理第一视角视频多模态大模型3D几何推理框架

AI看一段第一视角视频猜物体位置，过去只能凭记忆一次猜完，错了就错了。这篇让AI先猜一次，再根据猜出的3D结构生成一个新视角的视频（比如从高处俯瞰），然后回头重新审视自己的答案，错了就改。在两项空间推理测试中，开源模型用上这个框架后，性能追平了闭源最强模型。它不是你明天能用上的，但说明了一个趋势：AI推理不该是单程票，能回头检查才是更接近人类的方式。

📄 原文摘要(英文)

Spatial reasoning from egocentric videos is inherently challenging because the observable evidence is constrained by the camera trajectory. Existing methods rely on single-turn inference, forcing models to resolve geometric ambiguity through semantic priors rather than verifiable evidence. We argue that spatial reasoning should be revisitable: conclusions formed under limited evidence should remain open to revision when complementary viewpoints become available. Building on this insight, we propose Reason, then Re-reason (ReRe), a training-free, inference-time framework with two phases: in the Reason Phase, an MLLM forms a spatial hypothesis from the original video; in the Re-reason Phase, it verifies or revises the hypothesis by observing a synthesized novel-view video. To enable effective cross-view revisiting, we design a Geometry-to-Video pipeline that renders strategically complementary novel views from predicted 3D geometry. These views feature an elevated, oblique perspective with scene-spanning coverage, while preserving the MLLM's native video interface without architectural modifications. Extensive evaluations on VSI-Bench and STI-Bench demonstrate that ReRe substantially boosts open-source MLLMs to rival proprietary state-of-the-art performance. Project page: https://zhenjiemao.github.io/ReRe/

arXiv 原文

📬 订阅 AI Pulse