AI视频生成:只会演日常,一遇怪事就崩
现在的AI视频生成模型看起来很逼真,但只擅长模拟日常场景——比如人拿锤子钉钉子。一旦换成非常规工具(比如用砖头钉钉子),或者完全不可能的组合(比如用海绵钉钉子),模型的表现就急剧下降。研究者设计了一个新基准测试,发现模型在“不可能”场景下的成功率几乎为零。更关键的是,模型其实没理解物理原理,只是在模仿表面视觉模式:图像模型搞错状态变化,视频模型连时间连贯性都保不住。这不是你明天能用上的工具,但它提醒你:别被AI生成的“逼真”日常视频骗了,它离真正理解世界还差得远。
📄 原文摘要(英文)
Physical interactions follow a long-tailed distribution: a set of common and regular interactions dominates human experience and visual data, while a broad spectrum of rare and irregular interactions remains underrepresented. Although recent visual world models, including image and video generation models, achieve impressive realism on existing benchmarks, they primarily focus on simulating common physical interactions. This raises a central question: Do current visual world models internalize and generalize physical principles? In this work, we introduce Tailor-Bench, a benchmark that challenges world models to simulate irregular physical interactions. To enable systematic evaluation, we design three scenario modes that progressively challenge model reasoning: Regular scenarios reflect common tool-task pairs, Unconventional scenarios replace conventional tools with attribute-compatible substitutes to test affordance generalization, and Impossible scenarios introduce attribute-violating tools to probe constraint awareness. Additionally, we design two complementary settings under a unified evaluation protocol: predictive generation requires inferring outcomes without guidance, while descriptive generation specifies the target outcome for faithful realization. Our experimental results reveal a clear long-tail gap in physical world modeling: performance degrades from Regular to Unconventional and Impossible scenarios, indicating limited generalization beyond common interactions. Failure analysis further shows that models rely on superficial visual patterns: image models fail to realize correct state changes, while video models further suffer from temporal inconsistencies.