一个模型搞定所有导航任务,还能现场切换模式
现在的机器人导航模型通常只能干一件事:要么跟着指令走,要么找物体,要么自动驾驶。但真实场景里,机器人需要随时切换任务——比如先找目标物体,再跟踪它,最后自主驾驶到目的地。这篇论文让一个模型能同时处理所有这些任务,而且不需要换模型或改代码。
研究者设计了一个参数化接口,把导航行为拆成两个可调维度:任务模式(决定当前要做什么)和观察参数(控制看多少、怎么看)。训练时随机组合这些参数,模型就学会了在任何配置下都能工作。更关键的是,他们用15.6M样本训练,并混入视觉-语言数据,防止模型变成只会机械反应的“动作序列映射器”。
实际效果:在多个导航基准上刷新了纪录,从2B参数扩展到8B参数时性能持续提升,而且零样本迁移到真实机器人上也能用。
这不是你明天就能用的技术,但它展示了一个重要趋势:未来的机器人可能不再需要为每个任务单独训练模型,而是用一个通用模型加上动态配置来应对所有场景。
📄 原文摘要(英文)
Agentic navigation systems require a base navigation model whose observation strategy can be externally reconfigured at inference time, because instruction following, object search, target tracking, and autonomous driving share the same perception-planning backbone yet demand fundamentally different strategies for consuming the visual stream. We present Qwen-RobotNav, a scalable navigation model built on Qwen-RobotNav that addresses it through a parameterised interface with two complementary dimensions: multiple task modes that select the navigation behaviour, and controllable observation parameters (e.g., token budget, per-camera weights) that govern how visual history is encoded. With training-time randomization over all parameters, Qwen-RobotNav is robust to any inference-time configuration requiring zero architectural modification to the Qwen-RobotNav backbone. We train Qwen-RobotNav on 15.6M samples; co-training with vision-language data prevents the collapse into reactive action-sequence mappers observed in trajectory-only training. The parameterised interface also makes Qwen-RobotNav a natural building block for agentic systems: for long-horizon scenarios, an upper-level planner decomposes goals into sub-tasks and dynamically switches Qwen-RobotNav's task mode and context strategy mid-episode, composing complex behaviours from repeated calls to the same model. Extensive experiments show that Qwen-RobotNav sets new state-of-the-art results across major navigation benchmarks. The model exhibits favourable scaling from 2B to 8B parameters, with joint multi-task training developing a shared spatial-planning substrate that transfers across task families, and demonstrates strong zero-shot generalisation to real-world robots across diverse environments.