📄 论文解读

AI 智能体在复杂任务上成功率仅 19%，人类超 80%

趋势通道 ▲ 18 AI智能体基准测试能力评估视觉推理人类对比

AI 智能体在简单任务上已经接近满分，但一碰到需要时间感知、图形理解或 3D 推理的复杂场景就露馅了。研究者搞了个叫 GauntletBench 的测试，包含视频编辑、工作流搭建、3D 建模、飞行分析和电路设计 5 个专业领域，每个 20 个视觉密集型任务。结果最强 AI 智能体成功率只有 19.1%，而普通人能达到 80% 以上。这不是你明天能用上的东西，但它划了一条线：别被 AI 在聊天和写代码上的表现骗了，在需要真正理解世界、处理复杂视觉信息的场景里，它离人类还差得远。

📄 原文摘要(英文)

As agentic systems continue to evolve and are widely deployed in real-world scenarios, there is a growing demand to faithfully evaluate their capabilities. However, current benchmarks are typically built on popular applications with relatively simple tasks and focus on a narrow set of capabilities while overlooking broader dimensions, resulting in saturated performance on modern agents and failing to probe their limitations. To this end, we introduce GauntletBench, a web-based benchmark for evaluating agent generalisation in challenging scenarios, focusing on three underexplored capabilities (temporal perception, graphical understanding, and 3D reasoning), across five less-covered professional applications (Video Editor, Workflow Builder, 3D Modeller, Flight Analyser, and Circuit Designer), each with 20 vision-intensive tasks (100 in total). Our benchmark provides a modular pipeline that comprises an environment compatible with both open- and closed-source agent frameworks, a controlled web-based application, a well-structured task suite, and an automated evaluation engine with diverse metrics. Contrary to widespread expectations, our empirical results reveal that frontier agentic systems remain far from achieving human-level performance. Even the state-of-the-art agent achieves only a 19.1% success rate on our GauntletBench, highlighting the limitations in these overlooked capabilities and generalisation. By comparison, non-expert human annotators achieve over 80% success on our challenging yet feasible tasks, revealing the substantial gap between current agent capabilities and those required for complex real-world scenarios.

arXiv 原文

📬 订阅 AI Pulse