📄 论文解读

用一句话让机器人想象未来画面

信赖通道 ▲ 11 世界模型视频生成机器人语言控制统一框架

你给机器人说一句“把杯子推到左边”，它就能生成一段视频，展示接下来几秒会发生什么——杯子被推、手移动、场景变化。这不是真的视频，而是模型在脑子里“想象”出来的物理世界走向。Qwen-RobotWorld 把机器人操作、自动驾驶、室内导航等不同场景统一到一个模型里，用自然语言当遥控器。它靠一个60层双流扩散变压器，把语言和视觉信息一层层融合，再配合一个860万条视频-文本的数据库（覆盖20多种机器人和500多种动作）来学习。在多个评测基准上它都排第一。但它不是你明天就能用的东西——它更像一个“世界模拟器”，用来帮机器人训练、测试和规划，而不是直接控制机器人干活。

📄 原文摘要(英文)

We introduce Qwen-RobotWorld, a language-conditioned video world model for embodied intelligence. With natural language as a unified action interface, it predicts physically grounded future visual trajectories from current observations across robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer. This unified formulation provides three promising application directions: synthetic data generation for policy training augmentation, scalable virtual environments for policy evaluation, and language-guided planning signals for downstream robot control. This is achieved through a three-part design: a) Double-Stream MMDiT with MLLM Action Encoding, where a 60-layer double-stream diffusion transformer couples frozen Qwen2.5-VL semantics with video-VAE latents through layer-wise joint attention; b) Embodied World Knowledge (EWK), an 8.6M video-text corpus (200M+ frames) with action-language mapping over 20+ embodiments and 500+ action categories; and c) General+Expert Progressive Curriculum, a two-stage training strategy that first learns general visual priors and then injects embodied specialization under a shared language interface. Extensive results show strong competitiveness: ranks 1st overall on EWMBench and DreamGen Bench, outperforms all open-source models on WorldModelBench and PBench. Additional zero-shot analyses on RoboTwin-IF benchmark further support robust generalization and multi-view consistency.

arXiv 原文

📬 订阅 AI Pulse