AI Pulse
📄 论文解读

你也能当导演:让AI生成视频像玩游戏一样操控镜头

你见过AI生成的视频吗?通常你只能输入一句话,然后看它自由发挥,镜头怎么动、场景怎么变,你完全控制不了。DreamX-World 1.0打破了这一点:它让你像玩游戏一样,用文字或图片生成视频,还能自由控制镜头——想拉近看细节?想绕到物体背后?甚至让之前出现过的场景再次出现,它都能做到。

研究者做了三件事:第一,他们用游戏引擎和真实视频训练AI,让AI学会理解镜头移动和场景变化;第二,发明了一种叫E-PRoPE的技术,让AI在生成视频时能精确响应你的镜头指令;第三,通过“记忆条件场景持久化”机制,AI能记住之前生成过的场景,当你回头再看时不会穿帮。

在测试中,DreamX-World 1.0的镜头控制得分达到73.75,综合得分84.76,超过了同类模型HY-WorldPlay 1.5(80.79)和LingBot-World(80.45)。而且它能在8张RTX 5090显卡上以每秒16帧的速度实时生成。

这不是你明天就能用的工具,但它展示了一个趋势:AI视频生成正在从“随机生成”走向“可控创作”。未来,你或许能像剪辑师一样,精确控制AI生成的每一帧画面。

📄 原文摘要(英文)

DreamX-World 1.0 is a general-purpose interactive text/image-to-video world model for controllable long-horizon generation. It supports camera navigation, revisits to previously observed regions, and promptable events across photorealistic, game-style, and stylized domains. Our data engine combines camera-accurate Unreal Engine rendering, action-rich gameplay recordings, and real-world videos with recovered camera geometry. For camera control, we introduce E-PRoPE, a lightweight variant of projective positional encoding that retains PRoPE's projective camera geometry while applying camera-aware attention to spatially reduced tokens. We convert a bidirectional video generator into a few-step autoregressive world model using causal forcing, DMD-style distillation, and long-rollout training. Training on self-generated long-horizon contexts exposes the model to its own generated history and reduces the style and color drift that accumulates across autoregressive chunks. Memory-Conditioned Scene Persistence retrieves earlier views through camera-geometry-based retrieval, while residual recycling makes the conditioning path less sensitive to imperfect memory latents. Event Instruction Tuning adds composable event control, and reinforcement learning alignment recovers camera control and visual quality after distillation. With mixed-precision DiT execution, residual reuse, 75\%-pruned VAE decoding, and asynchronous pipeline parallelism, DreamX-World 1.0 reaches up to 16\,FPS on eight RTX\,5090 GPUs. On our 5-second basic evaluation, DreamX-World 1.0 achieves a camera-control score of 73.75 and an overall score of 84.76, outperforming HY-WorldPlay 1.5 and LingBot-World in overall score, which achieve 80.79 and 80.45, respectively.

arXiv 原文

📬 订阅 AI Pulse

每天三次更新,不错过重要信号

▲ 回到顶部