📄 论文解读

AI 考试全优，但一上班就露馅？新基准专测“真干活”

趋势通道 ▲ 68 AI基准真实任务经济价值行业应用性能差距

AI 在各类考试中拿高分，但放到真实工作场景里就掉链子。这篇论文认为，问题出在考试本身：现有基准测的都是短平快的任务，跟实际经济价值脱节。他们联合 250 多位行业专家，搞了个新基准 Agents' Last Exam (ALE)，覆盖 13 个行业集群、1000 多个真实工作流——从写法律文书到做财务分析，全是需要长时间、多步骤才能完成的活。结果呢？当前最强 AI 在最难的那档任务里，平均完整通过率只有 2.6%。这不是你明天就能用上的工具，但它告诉你一个信号：别被刷榜成绩骗了，AI 离真正替你干活还差得远。

📄 原文摘要(英文)

Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not translated into economically meaningful deployment across many professional domains. We argue that this gap is largely an evaluation problem: widely used benchmarks lack sustained performance measurement on real and economically valuable workflows. This paper introduces Agents' Last Exam (ALE), a benchmark designed to evaluate AI agents on long-horizon, economically valuable, real-world tasks with verifiable outcomes. Developed in collaboration with 250+ industry experts, ALE covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy). It is organized around a task taxonomy with 55 subfields grouped into 13 industry clusters covering 1K+ tasks. Current results show that the hardest tier remains far from saturated: across mainstream harness and backbone configurations, the average full pass rate is 2.6%. ALE is designed as a living benchmark: its task pool grows continuously as new workflows and industries are onboarded. More broadly, ALE is intended not merely as another leaderboard, but as an instrument for closing the gap between benchmark success and GDP-relevant impact.

arXiv 原文

📬 订阅 AI Pulse