AI 团队协作:让不同模态的 AI 各司其职
现在的 AI 智能体(agent)往往只擅长处理一种信息,比如文本或图像。但真实世界的任务常常需要同时理解文字、图片、音频和视频。这篇论文提出的 Orchestra-o1 框架,就像一个聪明的“指挥家”,能把一个复杂任务拆解成多个子任务,然后分配给专门处理不同模态的 AI 子智能体,让它们并行工作、协同完成。在 OmniGAIA 基准测试中,它的准确率比第二名高出 10.3%。虽然你明天用不上它,但它展示了未来 AI 系统如何像人类团队一样分工协作,处理更复杂的现实问题。
📄 原文摘要(英文)
The recent success of agent swarms has shifted the paradigm of large language model (LLM)-based agents from single-agent workflows to multi-agent systems, highlighting the importance of agent orchestration for task decomposition and collaboration. However, existing orchestration frameworks are limited to a narrow set of modalities and struggle to generalize to more complex settings where heterogeneous modalities coexist and interact. This limitation becomes particularly pronounced in omnimodal scenarios, where tasks require the unified understanding and coordination of diverse inputs such as text, image, audio, and video. In this work, we propose Orchestra-o1, an omnimodal agent orchestration framework designed to support efficient agent collaboration across multiple modalities. Orchestra-o1 introduces a unified orchestration mechanism that enables modality-aware task decomposition, online sub-agent specialization, and parallel sub-task execution. This scalable design allows agent systems to effectively tackle complex real-world tasks involving heterogeneous information sources, surpassing the second-best approach by 10.3% accuracy on the OmniGAIA benchmark. Furthermore, we introduce decision-aligned group relative policy optimization (DA-GRPO), an efficient agentic reinforcement learning approach for training Orchestra-o1-8B, which also achieves state-of-the-art performance against all existing open-source omnimodal agents.