AI Pulse
📄 论文解读

AI学会自己挑经验、写笔记、升级,9B模型挑战397B

现在的AI智能体虽然能记住对话历史,但不会判断哪些经验有用、怎么用、怎么记下来。这篇论文让AI学会一套完整的自我进化流程:先快速试错积累经验,再慢速提炼出可复用的知识。具体做法是给AI设计四层记忆结构(读、用、写、维护),并通过自我蒸馏把进化能力内化到模型参数中。结果很惊人:9B参数的模型在多个任务上超过了397B的巨无霸。这不是你明天能用的功能,但它指向一个方向——未来的AI可能不再需要频繁更新版本,而是自己就能在运行中变强。

📄 原文摘要(英文)

Memory has become a standard substrate for self-evolving agents, yet retaining experience is not the same as learning how to evolve through it. Existing memory agents can store trajectories, retrieve reflections, or accumulate skills, but often lack the holistic competence to select useful experience, act on it, write reusable knowledge, and maintain a growing repository. We introduce OPD-Evolver, a slow-fast co-evolution framework that cultivates such an agent evolver through on-policy self-distillation. In the fast loop, OPD-Evolver interacts with a four-level memory hierarchy to read, use, write, and maintain experience for rapid test-time evolution. In the slow loop, outcome-calibrated memory attribution and privileged hindsight distill these four abilities into the deployable policy. Across multi-domain benchmarks, OPD-Evolver surpasses memory systems such as ReasoningBank by up to 11.5%, and training-based methods such as Skill0 by ~5.8%. Further analysis shows that OPD-Evolver internalizes high-value experience and memory management, enabling OPD-Evolver-9B to challenge giant counterparts such as Qwen3.5-397B-A17B and Step-3.5-Flash, pointing beyond memory-augmented agents toward genuinely qualified agent evolvers.

arXiv 原文

📬 订阅 AI Pulse

每天三次更新,不错过重要信号

▲ 回到顶部