📄 论文解读

AI 自己搞科研，比人类助手强 2.5 倍

信赖通道 ▲ 68 自主科研假设树AI 智能体长期规划

你让 AI 帮你调模型、写代码，它只能做一步算一步。这篇让 AI 自己当「科研老板」：一个长期存在的协调器，搭配一堆短期打工的「执行器」，中间用一棵「假设树」把所有尝试、结果、教训串起来。实验做完，树更新，下次直接复用经验。在 6 个真实科研任务（训练模型、写工程代码、合成数据）上，它比当前最强的 Codex 和 Claude Code 平均多出 2.5 倍的提升。在 MLE-Bench Lite 上，用 GPT-5.5 拿到 86.36% 的奖牌率，是目前最高。它不是你明天能用上的，但告诉你：AI 离「自己跑通一个研究项目」又近了一大步。

📄 原文摘要(英文)

Scientific progress depends on a repeated loop of exploration, experimentation, and abstraction. Researchers test candidate directions, interpret the evidence, and carry the resulting lessons into later attempts. We study how an AI agent can run this loop autonomously over long horizons. We introduce Arbor, a general framework for autonomous research that combines a long-lived coordinator, short-lived executors, and Hypothesis Tree Refinement (HTR), a persistent tree that links hypotheses, artifacts, evidence, and distilled insights across time. The coordinator manages global research strategy over the tree, while executors implement and test individual hypotheses in isolated worktrees. As results return, Arbor updates the tree, propagates reusable lessons, refines the search frontier, and admits verified improvements. This design turns autonomous research from a sequence of local attempts into a cumulative process in which strategy, execution, and evidence are carried across time. We evaluate Arbor under Autonomous Optimization (AO), an operational setting where an agent improves an initial research artifact through iterative experimentation without step-level human supervision. Across six real research tasks in model training, harness engineering, and data synthesis, Arbor achieves the best held-out result on all six tasks, attaining more than 2.5x the average relative held-out gain of Codex and Claude Code under the same task interface and resource budget. On MLE-Bench Lite, Arbor reaches 86.36% Any Medal with GPT-5.5, the strongest result in our comparison.

arXiv 原文

📬 订阅 AI Pulse