📄 论文解读

小模型当探路者，大模型学得更快更好

信赖通道 ▲ 21 小模型大模型训练效率数学推理GRPO

训练大模型时，通常靠增加随机性来让模型尝试不同思路，但这容易让推理变得混乱。这篇论文发现，同一个模型家族里的小模型天生就比大模型更“爱探索”——它们给出的答案更多样，而且逻辑连贯。研究者干脆让小模型当“探路者”，先跑出各种靠谱的解题路径，再让大模型跟着学。结果在数学推理测试中，用1.7B的小模型带8B的大模型，准确率提升了8.8%，训练还更快。这不是你明天能直接用的工具，但它提示了一个新思路：有时候，小模型不是累赘，而是最好的教练。

📄 原文摘要(英文)

We identify a new dimension for enhancing rollout diversity in Group Relative Policy Optimization (GRPO) for LLMs. While GRPO relies on diverse rollouts, prevailing strategies primarily increase diversity by injecting more token-level randomness, which may introduce step-wise noise and lead to incoherent trajectories. We uncover that smaller models within the same model family inherently exhibit higher policy-level diversity, indicated by their superior pass@k relative to larger counterparts as sample counts increase. Unlike token-level noise, this diversity is temporally correlated, preserves logical consistency, and provides structured exploration signals for gradient estimation. We thus propose S2L-PO (Small-to-Large Policy Optimization), a framework that leverages fixed small models as natural explorers to train larger models. To balance exploration and exploitation, we design a progressive annealing strategy that transitions from offline small-model rollouts to the large learner's own sampling. This shift elegantly avoids mid-training performance drops caused by the small model's capacity limits, achieving faster convergence and unlocking a higher performance ceiling. S2L-PO improves accuracy on diverse mathematical reasoning benchmarks (e.g., +8.8% on AIME 24 using a 1.7B explorer to guide the 8B model) while reducing rollout compute.

arXiv 原文

📬 订阅 AI Pulse