AI学会了“世界模型”:不只看下一帧,而是理解世界状态
大多数AI模型只做一件事:预测下一个词、下一帧画面或下一个动作。Orca打破了这种割裂,它学习一个统一的“世界潜在空间”,把视频、语言、动作等不同信号压缩成同一个内部表示,然后通过轻量级的解码器去完成文本生成、图像预测、机器人动作等不同任务。它的训练方式也很有意思:一部分像潜意识一样从连续视频中捕捉密集的状态变化,另一部分像意识一样通过语言描述的事件和问答来学习稀疏但有意义的状态转移。预训练用了12.5万小时视频和1.6亿事件标注,但训练好的主干冻结后,下游任务只需训练很小的解码器。实验表明,更强的世界潜在空间能带来更强的下游表现,甚至超过了专门为某个任务训练的模型。这不是你明天就能用的工具,但它指向了一个方向:AI不再只是模式匹配,而是开始构建对世界如何运转的通用理解。
📄 原文摘要(英文)
We introduce Orca, an initial instantiation of a general world foundation model. Orca learns a unified world latent space from multimodal world signals and exposes it through multimodal readout interfaces. Rather than optimizing isolated next-token, next-frame, or next-action prediction, we are centered on Next-State-Prediction modeling, offering a unified state-transition modeling route toward understanding, predicting, and acting upon the world. Orca learns through two complementary paradigms: unconscious learning captures dense natural state transitions from continuous videos, and conscious learning models sparse meaningful state transitions by language-described events and VQA supervision. For pre-training, we construct a large-scale world-learning inventory data, including 125K hours of video data and 160M event annotations. After pre-training, Orca learns a unified world latent space. To examine whether the learned latent supports downstream, we evaluate it by three representative downstream readouts: text generation, image prediction, and embodied action generation. Orca's backbone is frozen, and only the lightweight modality-specific decoders are trainable. Experiments show the scalability of the proposed paradigm and verify that stronger world latent enables stronger downstream readouts. Orca outperforms similar-sized specialized baselines. These results show that Orca, as a general world foundation model, presents a promising approach to understanding, predicting, and acting upon the world. Finally, we discuss the current limitations, aiming to provide useful insights and inspiration for the community.