📄 论文解读

AI自己设计训练题，比人类更会补短板

趋势通道 ▲ 16 强化学习大模型训练自动环境设计自我改进

训练AI就像教学生做题，但老师（人类）得不断猜下一轮该练什么。这篇让AI自己当出题人：它先分析自己答错的题，然后调整下一轮训练的环境参数（比如地图难度、障碍物数量），专攻薄弱环节。用Qwen3-4B模型测试，这种“自出题”训练法比GPT等大模型和固定题海战术效果都好。更意外的是，训练到一半的AI比从头开始的AI更会出题——它更清楚自己哪里不行。这不是你明天能用的工具，但它指向一个趋势：AI可能很快就能自主设计自己的学习路径，不再依赖人类教练。

📄 原文摘要(英文)

Reinforcement learning pipelines for Large Language Model (LLM) training often rely on manually redesigned environments between stages, requiring practitioners to heuristically infer which configuration will best improve the current policy. To automate this process, we propose the LLM-as-Environment-Engineer framework in which the current policy model analyzes failure trajectories together with contextual information and proposes modifications to the next-stage training environment configuration. We also introduce MAPF-FrozenLake, a controllable testbed whose generator exposes multi-dimensional environment configurations, making it suitable for studying and benchmarking environment redesign. On this testbed, we condition the environment engineer on structured summaries of policy behavior, failure cases, and environment statistics, from which it produces the configuration for the next training stage. With Qwen3-4B as the backbone, our framework achieves the strongest aggregate performance on our benchmarks, outperforming larger proprietary LLMs (e.g., GPT, Gemini) and fixed-environment training baselines. We further analyze which forms of context are most effective, finding that successful environment updates rely on failure evidence and preserve configurations that already work. Interestingly, the current RL checkpoint serves as a better environment engineer than the original base model, suggesting that policy learning improves the model's ability to diagnose its remaining weaknesses.

arXiv 原文

订阅 AI Pulse