AI 工具越多越容易翻车:新测试暴露致命短板
AI 智能体正在被塞进越来越大的工具库——比如帮你订机票、查库存、调物流,但现实是工具越多,它越容易迷路。最新发布的 PlanBench-XL 测试了 10 个顶尖大模型在 1665 个零售工具中完成 327 个任务的能力,结果令人意外:最强模型 GPT-5.4 在理想环境下准确率 51.9%,但一旦模拟真实世界的工具故障(比如某个接口突然失效或返回错误),准确率暴跌至 11.36%。问题出在 AI 不会“绕路”:当它依赖的工具链中断时,它无法像人类一样临时找替代方案,尤其当故障没有明确报错时,几乎直接瘫痪。这个测试不是让你明天就能用上的产品,但它揭示了一个关键趋势:AI 的“工具使用”能力远没有看起来那么可靠,尤其在复杂、不可预测的真实场景中。
📄 原文摘要(英文)
LLM agents increasingly operate in large tool ecosystems, where real-world tasks require discovering relevant tools, inferring implicit sub-goals, and adapting to dynamic environments over long horizons. However, existing benchmarks rarely evaluate planning under retrieval-limited tool visibility. To address this gap, we introduce PlanBench-XL, an interactive benchmark of 327 retail tasks over 1,665 tools that tests whether agents can iteratively retrieve usable tools, invoke them to uncover intermediate evidence for subsequent calls toward the final goal. PlanBench-XL further features an optional blocking mechanism that simulates real-world unpredictability through missing, failing, or distracting tool functions, forcing agents to detect disrupted paths and adapt at runtime. Experiments on ten leading LLMs show that massive-tool planning remains challenging: while GPT-5.4 achieves 51.90% accuracy in block-free settings, it collapses to 11.36% under the most severe blocking condition. Further analysis shows that agents are especially vulnerable when failures lack explicit error signals or when recovery requires longer alternative tool-use paths. These results establish PlanBench-XL as a testbed for diagnosing agentic planning failures and highlight the need for robust adaptive planning in long-horizon tasks with large, imperfect tool environments.