📄 论文解读

AI排行榜骗了你：高分不等于好用

信赖通道 ▲ 17 AI评测智能体排行榜预测有效性部署

你看到的AI智能体排行榜可能全是错的。这篇研究发现，当前所有智能体基准测试加起来也覆盖不了实际部署中遇到的四五个维度，而且排名在真实场景下会剧烈波动——一个在测试集上拿高分的智能体，换个环境可能直接垫底。研究者提出了新标准：用“预测有效性”代替平均分，即看一个智能体在已知任务上的排名能否预测它在未知任务上的表现。他们设计了12层测量框架，暴露了现有评测忽略的关键维度（如多模态、推理模式、基础设施优化）。这不是你明天就能用的工具，但它解释了为什么你试用的AI助手有时很聪明、有时很蠢——因为排行榜本身就在误导你。

📄 原文摘要(英文)

Agent benchmarks are growing fast, but no single benchmark touches more than four or five of the dimensions that deployment exposes. This paper aggregates the largest coordinated deep-dive of one MCP-based industrial-agent benchmark to date: fourteen parallel implementation studies covering new asset classes (including a multi-modal visual extension), alternative orchestrations, retrieval strategies, reasoning modes, infrastructure optimizations, and evaluation-methodology probes. Consolidating those studies with seven prior agent benchmarks, we argue that aggregate-score leaderboards systematically underspecify deployed-agent evaluation. Rankings derived from aggregate scores do not transfer to out-of-distribution settings; recent public-to-hidden competition retrospectives provide direct empirical evidence of this rank instability. We propose ranking configurations by predictive validity, the correlation between in-sample and out-of-sample rank, rather than in-sample mean, and report a twelve-tier measurement apparatus that exposes the deployment-relevant dimensions HELM and its agent-era successors collapse. The position is operationalized through three falsifiable out-of-distribution criteria with explicit thresholds; existing evidence partly supports it but is too thin to confirm. We close with a pre-registered pilot design and a field-level vision for what the next generation of agentic benchmarks should report.

arXiv 原文

📬 订阅 AI Pulse