AI在千种工具中规划任务,一遇故障就崩
现在的AI助手能调用大量工具完成复杂任务,但一旦工具出故障,它们就手足无措。新基准PlanBench-XL模拟了零售场景中1665种工具的真实环境,测试AI能否在工具不可见、故障或干扰下自主调整计划。结果:最强模型GPT-5.4在无故障时准确率51.9%,但遇到严重故障时暴跌至11.36%。AI尤其害怕没有明确错误提示的故障,或需要绕更远路才能恢复的情况。这不是你明天能用上的技术,但它揭示了当前AI在真实世界中脆弱的一面——工具越多,越容易在意外面前崩溃。
📄 原文摘要(英文)
LLM agents increasingly operate in large tool ecosystems, where real-world tasks require discovering relevant tools, inferring implicit sub-goals, and adapting to dynamic environments over long horizons. However, existing benchmarks rarely evaluate planning under retrieval-limited tool visibility. To address this gap, we introduce PlanBench-XL, an interactive benchmark of 327 retail tasks over 1,665 tools that tests whether agents can iteratively retrieve usable tools, invoke them to uncover intermediate evidence for subsequent calls toward the final goal. PlanBench-XL further features an optional blocking mechanism that simulates real-world unpredictability through missing, failing, or distracting tool functions, forcing agents to detect disrupted paths and adapt at runtime. Experiments on ten leading LLMs show that massive-tool planning remains challenging: while GPT-5.4 achieves 51.90% accuracy in block-free settings, it collapses to 11.36% under the most severe blocking condition. Further analysis shows that agents are especially vulnerable when failures lack explicit error signals or when recovery requires longer alternative tool-use paths. These results establish PlanBench-XL as a testbed for diagnosing agentic planning failures and highlight the need for robust adaptive planning in long-horizon tasks with large, imperfect tool environments.