📄 论文解读

AI学会自己挑经验、写笔记、升级，9B模型挑战397B

信赖通道 ▲ 25 AI智能体自我进化记忆管理知识蒸馏小模型挑战大模型

现在的AI智能体虽然能记住对话历史，但记下来不等于学会怎么用。这篇论文让AI像人一样，在快速试错中学会四件事：挑出有用的经验、立刻用上、写成可复用的知识、整理好记忆库。慢速训练则把这种能力内化成习惯。结果一个9B参数的模型，靠这套方法在多个任务上超过了397B的巨无霸。它不是你明天就能用的工具，但指向了一个方向：AI不再只是被动记忆，而是主动学会如何学习。

📄 原文摘要(英文)

Memory has become a standard substrate for self-evolving agents, yet retaining experience is not the same as learning how to evolve through it. Existing memory agents can store trajectories, retrieve reflections, or accumulate skills, but often lack the holistic competence to select useful experience, act on it, write reusable knowledge, and maintain a growing repository. We introduce OPD-Evolver, a slow-fast co-evolution framework that cultivates such an agent evolver through on-policy self-distillation. In the fast loop, OPD-Evolver interacts with a four-level memory hierarchy to read, use, write, and maintain experience for rapid test-time evolution. In the slow loop, outcome-calibrated memory attribution and privileged hindsight distill these four abilities into the deployable policy. Across multi-domain benchmarks, OPD-Evolver surpasses memory systems such as ReasoningBank by up to 11.5%, and training-based methods such as Skill0 by ~5.8%. Further analysis shows that OPD-Evolver internalizes high-value experience and memory management, enabling OPD-Evolver-9B to challenge giant counterparts such as Qwen3.5-397B-A17B and Step-3.5-Flash, pointing beyond memory-augmented agents toward genuinely qualified agent evolvers.

arXiv 原文

订阅 AI Pulse