AI写代码,但找bug还得靠人?新测试暴露短板
现在的AI编程助手能直接改代码修bug,但有个隐藏前提:它得先找到bug在哪。这篇论文发现,AI在仓库里翻找相关代码的能力,才是真正的瓶颈。研究者造了个测试集,让AI在限定行数内,从几千行代码的仓库里找出和某个问题相关的代码区域,并排序。结果:顶尖AI模型在文件级别定位上还行,但精确到行级别时,覆盖率和排序质量都很差——它可能找到了文件,但没找到关键行。这解释了为什么AI修bug经常修错地方。对你来说,这意味着:别指望AI能自己搞定复杂项目的调试,它连门都还没摸到。
📄 原文摘要(英文)
Repository-level coding benchmarks such as SWE-bench have driven a rapid surge in the capabilities of coding agents. Yet they usually treat coding tasks as a holistic, binary prediction problem (e.g., resolved or unresolved), neglecting fine-grained agent capabilities such as repository understanding, context retrieval, code localization, and bug diagnosis. In this paper, we introduce SWE-Explore, a benchmark that isolates the evaluation of repository exploration, a critical capability of coding agents. Given a repository and an issue, SWE-Explore asks an explorer to return a ranked list of relevant code regions under a fixed line budget. SWE-Explore covers 848 issues across 10 programming languages and 203 open-source repositories. For each instance, we derive line-level ground truth from independent agent trajectories that successfully solved the same issue, distilling the specific code regions their solution paths actually consulted. We evaluate exploration along coverage, ranking, and context-efficiency dimensions, showing that these metrics strongly track downstream repair behavior. Across a broad set of retrieval methods, general coding agents, and specialized localizers, we find that agentic explorers form a clear tier above classical retrieval. While file-level localization is already strong for modern methods, line-level coverage and efficient ranking remain the key axes differentiating state-of-the-art explorers.