📄 论文解读

AI视频模型的三项全能：物理、几何、交互全挂科

趋势通道 ▲ 29 世界模型视频生成物理推理3D一致性交互评估

现在的AI视频生成模型，能生成以假乱真的短视频，但一测「世界模型」能力就露馅。这篇论文搞了个三项全能测试：物理（物体掉落、热传导是否合理）、几何（3D结构是否一致、多视角是否自洽）、交互（能否按复杂指令持续生成连贯动作）。结果：最强模型在物理和几何上得分惨淡，交互任务几乎全军覆没。它不是你明天能用上的，但揭示了AI离真正理解物理世界还有多远——生成漂亮画面和模拟真实世界是两码事。

📄 原文摘要(英文)

We introduce WorldOlympiad, a benchmark for diagnosing video-based world models across physical faithfulness, geometric consistency, and interaction fidelity. While existing benchmarks often focus on visual quality, semantic alignment, or short-term temporal coherence, they provide limited insight into whether generated videos obey physical rules, preserve coherent 3D structure, and sustain controllable interactions over long horizons. To address this gap, WorldOlympiad decomposes world-model evaluation into three complementary dimensions. The physical track uses object segmentation and MLLM-as-judge to assess whether generated videos follow interpretable rules in mechanics, thermal phenomena, and material properties. The geometry track reconstructs generated videos with Gaussian splatting and evaluates structural consistency, cross-view coherence, and camera-trajectory alignment. The interaction track assesses whether generated rollouts follow complex action prompts and maintain smooth, coherent transitions across consecutive video chunks. WorldOlympiad further covers three major downstream scenarios, including gaming, robotics, and general real-world videos, capturing diverse challenges from interactive control and embodied manipulation to open-domain motion and camera dynamics. Together, these tracks and scenarios form a scalable and interpretable evaluation suite that exposes failure modes beyond generic video quality. Experiments on state-of-the-art models reveal substantial gaps in physical reasoning, 3D consistency, and long-horizon interaction, underscoring the need for more structured evaluation protocols for generative world models.

arXiv 原文

📬 订阅 AI Pulse