📄 论文解读

AI学会从“未来的自己”身上偷师

信赖通道 ▲ 70 扩散语言模型自蒸馏自我学习推理能力

大模型训练通常需要大量人工标注数据，但这篇论文让AI自己教自己——而且是从“未来的自己”那里学。研究者针对扩散语言模型（一种非逐字生成、而是整体“显影”出文本的模型）设计了一套新方法：让模型先自己生成答案，再把这个答案当作“未来经验”去指导当前版本的学习。这就像让一个学生先自己考一遍试，再用考卷上的正确答案来改进。结果在多个推理测试中，只用传统强化学习十分之一的训练步数就达到了更好效果。它不是你明天能用上的，但指向一个方向：AI可能不再需要那么多人类反馈，自己就能迭代进化。

📄 原文摘要(英文)

On-policy self-distillation (OPSD) has proven effective for post-training large language models (LLMs), yet its application to diffusion LLMs (dLLMs) remains unexplored. Existing OPSD methods are inherently autoregressive-centric. They inject privileged information via left-to-right prefix conditioning with token-level divergence supervision, a design that fundamentally conflicts with the arbitraryorder generation of dLLMs. We introduce d-OPSD, the first OPSD framework tailored for dLLMs. Our approach makes two core contributions. First, we reframe self-teacher construction by using self-generated answers as suffix conditioning, enabling the student model to learn from "self future-experience" rather than privileged prefixes. Second, we shift supervision from token-level to step-level, aligning training with the iterative denoising process of dLLMs. Experiments across four reasoning benchmarks show that d-OPSD consistently outperforms RLVR and SFT baselines with superior sample efficiency, requiring only around 10% of the optimization steps by RLVR and opening a promising pathway for dLLM posttraining. The code is available at https://github.com/xingzhejun/d-OPSD.

arXiv 原文

订阅 AI Pulse