AI Pulse
📄 论文解读

AI 智能体学会从失败中迭代,不再每次从头开始

现在的 AI 智能体(比如能上网查资料、操作工具的那种)每次执行任务都是“一次性”的:它做完就忘,下次遇到类似问题还得重新摸索。这篇论文让智能体像人类一样,把每次决策的“思考过程”和“试错记录”都存下来,下次直接复用经验。具体做法是:让一个“诊断”子智能体分析失败原因,另一个“修订”子智能体根据历史记录提出改进方案,再通过“实践测试”验证效果。在 GAIA 和 WebWalkerQA-EN 等深度研究基准上,它比商业级智能体高出 15.8 和 3.2 个百分点,在内部工具分析场景中准确率平均提升 18.8 个百分点。这不是你明天就能用的功能,但它揭示了 AI 从“一次性工具”向“持续学习伙伴”演进的关键路径。

📄 原文摘要(英文)

Agent skills extend language-model agents with task-specific procedures, scripts, and references, but the tasks and environments they target continually change. Existing methods improve skills in bounded runs and retain only the final artifact, discarding the decision history that later agents need to interpret prior revisions, evaluations, and rejected alternatives. We introduce SkillHone, a harness for continual agent skill evolution grounded in persistent decision history. SkillHone pairs skill revisions with evaluation-side evidence that supplies practice feedback, recording structured histories of diagnoses, revisions, evidence, and outcomes. Role-separated subagents run candidate skills on practice probes with redacted reporting and propose revisions informed by prior decisions, enabling cross-session refinement without rediscovering past rationale. On deep-research benchmarks, SkillHone runs without a pre-integrated search stack and outperforms the commercially backed deep-research agent by 15.8 points on GAIA and 3.2 points on WebWalkerQA-EN, while also exceeding prior skill-evolution methods. We further deploy SkillHone on internal tool-mediated analysis scenarios, where it improves accuracy by an average of 18.8 points across seven settings.

arXiv 原文

📬 订阅 AI Pulse

每天三次更新,不错过重要信号

▲ 回到顶部