AI Pulse
📄 论文解读

AI 自己跑科研:一个树状大脑,六项任务全胜

科研的本质是反复试错、积累经验。但 AI 做科研往往每次都是“从头开始”,不会把前一次教训带到下一次。这篇论文的团队设计了一个叫 Arbor 的框架,核心是一棵“假设树”:AI 的长期策略官(coordinator)在树上规划方向,短期执行者(executor)在独立分支上做实验,结果反馈后树会更新,把可复用的经验、失败的教训都挂到树上,下次决策时就能看到。在六个真实科研任务(训练模型、优化代码、合成数据等)上,Arbor 全部拿到最好成绩,平均提升是其他 AI 工具的 2.5 倍以上。在 MLE-Bench Lite 基准上,它用 GPT-5.5 达到了 86.36% 的奖牌率,是目前最强。它不是你明天能用上的,但它在展示一个方向:AI 做科研不再是一次性尝试,而是能积累、能迭代的“长期记忆体”。

📄 原文摘要(英文)

Scientific progress depends on a repeated loop of exploration, experimentation, and abstraction. Researchers test candidate directions, interpret the evidence, and carry the resulting lessons into later attempts. We study how an AI agent can run this loop autonomously over long horizons. We introduce Arbor, a general framework for autonomous research that combines a long-lived coordinator, short-lived executors, and Hypothesis Tree Refinement (HTR), a persistent tree that links hypotheses, artifacts, evidence, and distilled insights across time. The coordinator manages global research strategy over the tree, while executors implement and test individual hypotheses in isolated worktrees. As results return, Arbor updates the tree, propagates reusable lessons, refines the search frontier, and admits verified improvements. This design turns autonomous research from a sequence of local attempts into a cumulative process in which strategy, execution, and evidence are carried across time. We evaluate Arbor under Autonomous Optimization (AO), an operational setting where an agent improves an initial research artifact through iterative experimentation without step-level human supervision. Across six real research tasks in model training, harness engineering, and data synthesis, Arbor achieves the best held-out result on all six tasks, attaining more than 2.5x the average relative held-out gain of Codex and Claude Code under the same task interface and resource budget. On MLE-Bench Lite, Arbor reaches 86.36% Any Medal with GPT-5.5, the strongest result in our comparison.

arXiv 原文

📬 订阅 AI Pulse

每天三次更新,不错过重要信号

▲ 回到顶部