AI Pulse
📄 论文解读

AI学会自己挑经验、写笔记、升级,9B模型挑战397B

现在的AI智能体虽然能记住对话历史,但不会从经验中学习——就像你记了一堆笔记却从不翻看。这篇论文让AI学会一套完整的自我进化流程:先判断哪些经验有用,然后执行,再写成可复用的知识,最后整理归档。研究者用「快慢双循环」实现:快循环让AI在任务中实时调用记忆,慢循环则通过事后复盘把好经验蒸馏进模型本身。结果,一个90亿参数的模型靠这套机制,在多个任务上超过了3970亿参数的巨无霸。它不是你明天能用上的,但指向了AI从「记忆增强」到「真正会学习」的转变。

📄 原文摘要(英文)

Memory has become a standard substrate for self-evolving agents, yet retaining experience is not the same as learning how to evolve through it. Existing memory agents can store trajectories, retrieve reflections, or accumulate skills, but often lack the holistic competence to select useful experience, act on it, write reusable knowledge, and maintain a growing repository. We introduce OPD-Evolver, a slow-fast co-evolution framework that cultivates such an agent evolver through on-policy self-distillation. In the fast loop, OPD-Evolver interacts with a four-level memory hierarchy to read, use, write, and maintain experience for rapid test-time evolution. In the slow loop, outcome-calibrated memory attribution and privileged hindsight distill these four abilities into the deployable policy. Across multi-domain benchmarks, OPD-Evolver surpasses memory systems such as ReasoningBank by up to 11.5%, and training-based methods such as Skill0 by ~5.8%. Further analysis shows that OPD-Evolver internalizes high-value experience and memory management, enabling OPD-Evolver-9B to challenge giant counterparts such as Qwen3.5-397B-A17B and Step-3.5-Flash, pointing beyond memory-augmented agents toward genuinely qualified agent evolvers.

arXiv 原文

📬 订阅 AI Pulse

每天三次更新,不错过重要信号

▲ 回到顶部