AI Pulse
📄 论文解读

不用跑代码就能验证AI修Bug,准确率反超传统方法

训练AI写代码时,通常需要搭建Docker环境来运行测试,验证它生成的补丁是否正确——这很慢、很贵。这篇论文反其道而行:让AI自己通过阅读仓库代码、分析上下文来“推理”补丁对不对,完全不执行代码。结果在验证准确率上比最强的开源方案高出14.3个点,并且用这个验证器训练的模型,在SWE-bench等基准上的修复率甚至追平了需要跑测试的传统方法。它不是你明天就能直接用的工具,但指向一个趋势:未来AI可能不再需要“跑一下试试”,而是靠理解代码本身来判断对错。

📄 原文摘要(英文)

Program verifiers play a central role in training coding agents, including selecting trajectories for supervised fine-tuning (SFT) and providing rewards for reinforcement learning (RL). Standard execution-based verification requires running unit tests inside per-repository environments such as Docker images, incurring substantial environment setup costs. We propose Dockerless, an environment-free agentic patch verifier that evaluates generated code patches without executing them. Rather than simply matching candidate patches to references, Dockerless judges patch correctness using evidence gathered through agentic repository exploration. On a verifier evaluation benchmark, Dockerless outperforms the strongest open-source verifier by 14.3 AUC points. Using Dockerless as both the SFT trajectory filter and the RL reward enables a fully environment-free post-training pipeline. The resulting model reaches 62.0%, 50.0%, and 35.2% resolve rate on SWE-bench Verified, Multilingual, and Pro, respectively. It surpasses the Qwen3.5-9B baseline by 2.4, 8.7, and 2.9 points, matching environment-based post-training.

arXiv 原文

📬 订阅 AI Pulse

每天三次更新,不错过重要信号

▲ 回到顶部