📄 论文解读

AI学会从自己的错误中学习

信赖通道 ▲ 12 自蒸馏推理能力错误修正强化学习大语言模型

现在的AI训练方式像让学生抄标准答案——它不知道自己错在哪。这篇论文让AI自己犯错、自己诊断、自己修正：当模型对同一问题给出正确和错误两种回答时，系统保留错误推理直到出错点，然后插入一段自然语言诊断和修正推理，形成新的训练轨迹。这比单纯模仿正确答案更有效，在AIME 2024/2025和HMMT 2025数学竞赛题上，用同样训练步数，新方法比现有最强方法GRPO表现更好。它不是你明天能用上的，但揭示了AI自我改进的新方向：从“模仿答案”转向“诊断错误”。

📄 原文摘要(英文)

Self-distillation improves reasoning in large language models by using the model's own rollouts as training signal, typically through implicit logit-level alignment that minimizes KL divergence toward a privileged target distribution. However, because this supervision is generated via uncontrolled sampling, it provides no diagnostic insight into the model's specific errors or corrective guidance for its individual failure patterns. Consequently, the model learns to imitate a privileged distribution rather than receiving fine-grained corrections that pinpoint where and why its reasoning fails. In this paper, we propose Trajectory-Augmented Policy Optimization (TAPO), which advances self-distillation from implicit distributional alignment to explicit trajectory construction. During RL training, the model produces both correct and incorrect rollouts to the same query, and TAPO leverages this contrastive structure to construct micro-reflective corrections, new training trajectories that retain the model's erroneous reasoning up to the point of failure, then insert a natural-language diagnosis and corrected reasoning guided by a correct reference from the same sampling group. Since each trajectory is anchored in the learner's own prefix and solutions, the corrective signal preserves the model's on-policy distribution to a greater extent than the position-wise alignment imposed by KL-based methods. To integrate these trajectories, TAPO introduces difficulty-aware candidate selection at the model's capability boundary and decoupled advantage estimation to prevent gradient contamination. Experiments on AIME 2024, AIME 2025, and HMMT 2025 show that TAPO achieves consistent improvements over GRPO under the same number of training steps. Further analysis demonstrates that TAPO strengthens both first-pass reasoning and error-correction effectiveness.

arXiv 原文

📬 订阅 AI Pulse