📄 论文解读

AI预测未来，但只挑有用的看

信赖通道 ▲ 24 世界行动模型预测模型视频生成行动推理AI决策

AI预测未来，但只挑有用的看。World Action Models（WAMs）不是简单的视频生成器，而是为行动而生的预测模型。它们不追求生成完整的未来画面，而是只生成决策所需的那部分未来——比如机器人只需知道“手伸过去会不会碰到杯子”，而不需要整个房间的4K视频。这种“少即是多”的设计，让模型在计算、内存和延迟上更高效。目前主流做法是改造大型视频生成模型，但另一条路线直接用语言或视觉语言模型做推理，跳过视频生成。这篇综述帮你理清：哪些是真正的“行动预测”，哪些只是花哨的视频生成。它不是你明天能用上的工具，但如果你想理解AI如何从“看世界”进化到“干预世界”，这是当前最清晰的路线图。

📄 原文摘要(英文)

World Action Models (WAMs) are embodied predictive-action models that make a forecast of the future available to action. Recent WAMs repurpose large video generation models, and a parallel line relies on language or vision-language backbones without a video-generation core. This rapid expansion has blurred the boundary among broad world models, video generation models, action-grounded video world models, Vision-Language-Action policies, and WAMs. This survey gives the field a common account. It first clarifies these boundaries, then organizes existing works through two complementary views. The first view asks what each method is required to generate, spanning rendered futures, latent futures, and video-generation-free action reasoning. The second view decomposes each method by predictive substrate, backbone, action coupling, and deployment regime. This anatomy supports a unified discussion of interactability, causality, persistence, physical plausibility, and generalization, followed by data, evaluation, and open challenges. Across these axes, a consistent design pattern emerges: WAMs are not simply video generators with action heads, but predictive-action methods whose design choices trade representational richness against compute, memory, latency, and action-label cost. The field is moving toward methods that generate less of the future while preserving what control requires. The survey homepage is available at https://world-action-models.github.io/.

arXiv 原文

📬 订阅 AI Pulse