AI Pulse
📄 论文解读

AI终端操作员缺好数据?这篇用“考试出题”思路造了6000份高质量考题

想让AI学会操作终端(比如命令行),最大的瓶颈不是模型不够大,而是缺真正靠谱的训练数据。现有方法常是“凑题”:指令模糊、步骤太浅、测试也脆弱。这篇论文像一位严谨的考官:先按“领域、技能类型、能力、工程支柱”四个维度出题,再查真实技术文档验证题目是否合理,最后把题目放进Docker容器里跑一遍,只有能执行、有挑战、不模棱两可的才留下——三分之二的候选题目被淘汰。最终产出的6000条高质量轨迹,让一个32B参数的小模型在终端任务评测上超越了多个大它10倍的模型。它不是你明天就能用的工具,但揭示了“数据质量比数量更重要”这一趋势:未来AI训练可能从“堆数据”转向“精选题”。

📄 原文摘要(英文)

While recent LLM-based terminal agents have demonstrated promising capabilities, the scarcity of high-quality, executable training data remains a critical bottleneck. Existing synthesis pipelines typically scale by retrofitting surface-level artifacts into tasks, frequently yielding ambiguous instructions, shallow execution paths, and brittle tests that provide weak learning signals. To overcome this, we introduce CLI-Universe, a principled synthesis engine that constructs terminal-agent tasks. CLI-Universe generates candidate tasks by sampling combinations across a multi-dimensional capability taxonomy (domain, skill type, capability, and engineering pillar), then grounds each candidate through evidence-guided deep research over real-world technical materials. To ensure rigorous supervision, validated blueprints are instantiated into Dockerized environments and subjected to a multi-stage executable verification pipeline featuring rubric-gated test construction, hint-conditional filtering, and strict fail-to-pass checking. Across the full pipeline, from candidate generation to verification, approximately two-thirds of candidates are discarded, retaining only those that are genuine, verifiable, and non-trivially challenging. To validate our framework, we instantiate a highly distilled dataset of 6,000 trajectories called CLI-Universe-6K. Remarkably, fine-tuning Qwen3-32B on CLI-Universe-6K achieves 33.4% on Terminal-Bench 2.0. This sets a new state-of-the-art for models trained on open-source data at or below 32B parameters, and outperforms several models an order of magnitude larger, demonstrating the profound data efficiency of structured, high-fidelity synthesis.

arXiv 原文

📬 订阅 AI Pulse

每天三次更新,不错过重要信号

▲ 回到顶部