给AI世界模型下毒:一张图让未来崩坏
你给AI一张照片,它就能预测接下来会发生什么——比如自动驾驶看到路况后模拟出几秒后的画面。但研究者发现,只要在原始照片上加上人眼看不出的微小噪点,AI预测的未来就会彻底乱套:画面扭曲、结构崩塌、动作不连贯。更可怕的是,这种攻击不需要知道未来真实画面,也不需要猜你会怎么操作,它自己就能找到最致命的扰动方式。这暴露了世界模型在安全场景(如自动驾驶、机器人)中的脆弱性,但也意外提供了一种隐私保护思路——你可以用类似方法让AI无法准确模拟你的行为。
📄 原文摘要(英文)
Visual world models (VWMs) synthesize interactive, action-conditioned rollouts from a single context image. However, it remains an open question how robust these models are to adversarial perturbations. Standard adversarial attacks fail to assess this vulnerability because attackers lack ground-truth future videos and cannot predict subsequent user controls. We introduce BadWorld, a label-free adversarial framework tailored for autoregressive VWMs that systematically overcomes both constraints. First, to bypass the need for future supervision, we propose a self-supervised velocity attack that directly disrupts the early denoising dynamics of the model. Second, to ensure the attack generalizes across unpredictable user actions, we formulate a trajectory-adaptive bi-level optimization that actively mines hard control sequences to forge control-agnostic perturbations. Evaluated on representative VWMs with continuous and discrete controls, BadWorld exposes severe structural fragility. Visually indistinguishable adversarial images reliably trigger catastrophic degradation in future rollouts, leading to incomplete denoising, structural collapse, and control inconsistency. These findings reveal critical risks for deploying VWMs in safety-critical systems while highlighting a practical mechanism for privacy protection.