AI蒸馏新方法:让老师教得更准,学生学得更真
AI模型蒸馏(用大模型教小模型)有个隐藏陷阱:老师用了学生没有的“特权信息”(比如未来答案),学生学到的其实是“模仿特权”而非“真正能力”。这篇提出DOPD方法,动态判断每个词该由老师教还是学生自己学,避免学生误把“信息差”当“能力差”来学。实验证明在语言和视觉模型上效果更好,且更稳定、抗干扰。它不是你明天能用上的,但解释了为什么有些蒸馏模型“学歪了”。
📄 原文摘要(英文)
On-policy distillation (OPD) offers superior capacity transfer by supervising student-sampled trajectories with dense token-level signals. To furnish high-quality supervision sources and thereby elevate the performance frontier of distillation, an intuitive direction is to infuse privileged information to either teacher or student itself. However, this additional input induces a potential failure mode we dub privilege illusion: a pattern that conflates the transferable capability gap that students are meant to close, and the information asymmetry gap that can only be mimicked but never replicated. This issue is further amplified by the inherent non-uniformity of token-level supervision, where only a small subset of tokens carries pivotal capability-bearing signals. To this end, we propose DOPD, an advantage-aware dual distillation paradigm that dynamically routes token-level supervision between privileged teacher and privileged student policies based on their advantage gap and relative probabilities. Each token receives supervision of different strength, objective, and strategy from either teacher or student itself, which transfers credible capability while simultaneously receiving auxiliary signals, to alleviate privilege illusion. Extensive experiments on both large language model (LLM) and vision-language model (VLM) settings demonstrate that DOPD consistently outperforms Vanilla OPD and other counterparts. Further results on stability, robustness, continual learning, and out-of-distribution tasks validate its superiority.