📄 论文解读

AI预测未来，但只挑有用的看

趋势通道 ▲ 32 世界模型动作预测视频生成AI决策综述

AI 预测未来不再追求全息视频，而是只生成决策所需的最小未来。这篇综述梳理了「世界动作模型」——一种能预测未来并据此行动的 AI 系统。它发现，当前主流做法分两派：一派用视频生成模型直接渲染未来画面，另一派则用语言或视觉语言模型跳过视频，直接推理动作。但真正的趋势是：模型正在刻意「少生成」——只保留控制所需的那部分未来，以节省算力、降低延迟。这不是你明天能用的工具，但它揭示了 AI 从「模拟一切」转向「只模拟关键」的设计转向。

📄 原文摘要(英文)

World Action Models (WAMs) are embodied predictive-action models that make a forecast of the future available to action. Recent WAMs repurpose large video generation models, and a parallel line relies on language or vision-language backbones without a video-generation core. This rapid expansion has blurred the boundary among broad world models, video generation models, action-grounded video world models, Vision-Language-Action policies, and WAMs. This survey gives the field a common account. It first clarifies these boundaries, then organizes existing works through two complementary views. The first view asks what each method is required to generate, spanning rendered futures, latent futures, and video-generation-free action reasoning. The second view decomposes each method by predictive substrate, backbone, action coupling, and deployment regime. This anatomy supports a unified discussion of interactability, causality, persistence, physical plausibility, and generalization, followed by data, evaluation, and open challenges. Across these axes, a consistent design pattern emerges: WAMs are not simply video generators with action heads, but predictive-action methods whose design choices trade representational richness against compute, memory, latency, and action-label cost. The field is moving toward methods that generate less of the future while preserving what control requires. The survey homepage is available at https://world-action-models.github.io/.

arXiv 原文

📬 订阅 AI Pulse