AI智能体训练数据配方首次公开
训练一个能完成多种任务的AI智能体,数据配方一直是黑箱。OpenThoughts-Agent项目首次公开了完整的数据筛选流程,并通过100多次对照实验,发现任务来源和多样性比数据量更关键。他们用这套方法训练的模型在7个智能体基准测试中平均得分44.8%,比之前最好的开源模型高出近4个百分点。更重要的是,他们的数据在任意规模下都优于其他开源数据集。这不是你明天就能用的工具,但它为开源社区提供了可复现的智能体训练蓝图,让更多人能参与改进。
📄 原文摘要(英文)
Agentic language models dramatically expand the applications of AI yet little is publicly known about how to curate training data for broadly capable agents. Existing open efforts such as SWE-Smith, SERA, and Nemotron-Terminal typically target a single benchmark, leaving open the question of how to train models that generalize across diverse agentic tasks. The OpenThoughts-Agent (OT-Agent) project addresses this gap with a fully open data curation pipeline for training agentic models. We conduct more than 100 controlled ablation experiments to systematically investigate each stage of the pipeline, yielding insights on the importance of task sources and diversity. We then assemble a training set of 100K examples from our pipeline and fine-tune Qwen3-32B on this dataset, which yields an average accuracy of 44.8% across seven agentic benchmarks and a 3.9 percentage point improvement over the strongest existing open data agentic model (Nemotron-Terminal-32B, 40.9%). Moreover, our training data exhibits strong scaling properties, outperforming alternative open datasets at every training set size in compute-controlled comparisons. We publicly release our training sets, data pipeline, experimental data, and models at openthoughts.ai to support future open research on agentic model training.