AI操作电脑:点鼠标不如写命令?
我们总以为AI用图形界面(GUI)像人一样点鼠标更自然,但新研究告诉你:在同等任务下,最强的GUI智能体成功率59.1%,而用命令行(CLI)的智能体只有48.2%。不过,一旦给CLI智能体补上缺失的“技能”(比如特定软件的命令),它的成功率立刻飙升到69.3%,反超GUI。这说明:GUI的瓶颈在于长流程中稳定点击,而CLI的瓶颈只是技能库不全——后者更容易通过扩展解决。这不是你明天能用的技巧,但它揭示了AI自动化办公的一个关键取舍:教AI点鼠标还是教它写命令?
📄 原文摘要(英文)
Computer-use agents can execute software tasks through either graphical interfaces or programmatic command interfaces, but existing evaluations confound interaction modality with differences in tasks, initial states, verifiers, and permitted actions. We introduce a matched execution-layer benchmark of 440 desktop tasks across 18 applications and 12 workflow categories, where screen-only GUI agents and skill-mediated CLI agents receive identical goals, states, and final-state verifiers while being restricted to modality-native actions. In this controlled setting, the strongest GUI agent reaches a 59.1% full pass rate, outperforming the strongest original-skill CLI agent at 48.2%; however, verifier-guided skill augmentation raises CLI success to 69.3%, showing that much of the CLI deficit comes from incomplete skill coverage rather than model capability alone. These results suggest that GUI and CLI expose different execution bottlenecks: GUI agents are limited by reliable grounded interaction over long-horizon workflows, whereas CLI agents are limited by the coverage and scalability of their skill interfaces.