老师不教梯度,只改题目,小模型反而学得更好
传统知识蒸馏让大模型当老师,逼小模型模仿它的输出(logits),但老师太大时,小模型只学到老师最尖锐的“口音”,反而在没见过的新题上翻车。强化学习(RL)让小模型自己试错,可一旦所有尝试都失败(得分为零),老师强行插入正确答案会打乱学习节奏。这篇论文的解法很反直觉:老师不碰梯度,只改题目。遇到难题,老师把一道题变成两道:一道把正确答案和小模型的错误答案匿名混在一起,让小模型自己选;另一道把小模型的所有错误答案打包,让它看清自己错在哪。这些题被反复投喂,直到小模型答对一半以上才毕业。在Qwen3.5系列(0.8B到9B)上,用27B老师训练视觉语言模型,在31个基准测试中,ZPPO全面超过传统蒸馏和GRPO,模型越小提升越大。它不是你明天能用上的,但给了一个新思路:教AI不一定靠灌输答案,而是靠设计问题。
📄 原文摘要(英文)
Knowledge distillation transfers a teacher's competence to a small student but is brittle in the small-student regime: forcing the student to imitate logits from a much larger teacher concentrates it on the teacher's sharpest modes, hurting generalization on benchmark families beyond the training corpus. Reinforcement learning (RL) avoids logit imitation by training on the student's own rollouts. However, on questions where every rollout fails-yielding zero advantage and being silently discarded-injecting a stronger teacher's response into the policy gradient breaks the on-policy assumption and induces drift. We introduce Zone of Proximal Policy Optimization (ZPPO), inspired by Vygotsky's zone of proximal development, which keeps the teacher inside the prompt rather than the policy gradient. On hard questions, ZPPO constructs two reformulated prompts: a Binary Candidate-included Question (BCQ) pairs one correct teacher response with one incorrect student response as anonymized candidates the student must discriminate, and a Negative Candidate-included Question (NCQ) aggregates the student's wrong rollouts into a single prompt to surface their shared failure modes. A prompt replay buffer recirculates each hard question until it either graduates-the student's mean rollout accuracy on it reaches half- or is FIFO-evicted under finite capacity, amplifying BCQ and NCQ inside the student's current zone of proximal development. On the Qwen3.5 family at four student scales (0.8B-9B) with a 27B teacher, post-trained as vision-language models and evaluated on a 31-benchmark suite (16 VLM, 10 LLM, 5 Video), ZPPO outperforms off/on-policy distillation and GRPO, with the largest gains at the smallest scale.