📄 论文解读

老师不教梯度，只改题目：小模型也能学得更好

信赖通道 ▲ 47 知识蒸馏小模型提示工程强化学习教育启发

传统知识蒸馏让大模型当老师，小模型模仿它的输出，但小模型容易只记住老师最尖锐的答案，反而学偏。这篇论文换了个思路：老师不直接给答案，而是把难题改写成两种新题型——一种把正确答案和小模型的错误答案混在一起让学生选，另一种把小模型犯过的错集中展示。这样小模型在自己的能力边界附近反复练习，直到正确率过半才算过关。在0.8B到9B的四个小模型上，这种方法比传统蒸馏和强化学习都强，模型越小提升越明显。它不是你明天能用上的，但提示了一个方向：教AI不一定靠灌输，可以靠设计更好的题目。

📄 原文摘要(英文)

Knowledge distillation transfers a teacher's competence to a small student but is brittle in the small-student regime: forcing the student to imitate logits from a much larger teacher concentrates it on the teacher's sharpest modes, hurting generalization on benchmark families beyond the training corpus. Reinforcement learning (RL) avoids logit imitation by training on the student's own rollouts. However, on questions where every rollout fails-yielding zero advantage and being silently discarded-injecting a stronger teacher's response into the policy gradient breaks the on-policy assumption and induces drift. We introduce Zone of Proximal Policy Optimization (ZPPO), inspired by Vygotsky's zone of proximal development, which keeps the teacher inside the prompt rather than the policy gradient. On hard questions, ZPPO constructs two reformulated prompts: a Binary Candidate-included Question (BCQ) pairs one correct teacher response with one incorrect student response as anonymized candidates the student must discriminate, and a Negative Candidate-included Question (NCQ) aggregates the student's wrong rollouts into a single prompt to surface their shared failure modes. A prompt replay buffer recirculates each hard question until it either graduates-the student's mean rollout accuracy on it reaches half- or is FIFO-evicted under finite capacity, amplifying BCQ and NCQ inside the student's current zone of proximal development. On the Qwen3.5 family at four student scales (0.8B-9B) with a 27B teacher, post-trained as vision-language models and evaluated on a 31-benchmark suite (16 VLM, 10 LLM, 5 Video), ZPPO outperforms off/on-policy distillation and GRPO, with the largest gains at the smallest scale.

arXiv 原文

订阅 AI Pulse