📄 论文解读

让AI解题时思路更野但不跑偏

信赖通道 ▲ 15 AI推理数学解题策略优化语义混合GRPO

想让AI解数学题时想出更多解法，但又不让它胡编乱造？现有方法要么只在字面上换说法（冗余），要么在向量上加随机噪声（容易跑偏）。这篇提出N-GRPO：在AI的“思考空间”里，把当前词和它最像的几个邻居的向量混合起来，生成新输入。这样既保证了语义不走样，又让AI能探索到真正不同的解题路径。在多个数学推理测试上，它比现有方法稳定提升，而且能泛化到没见过的题型。不是你明天就能用的工具，但指明了让AI“想得更开”的一个靠谱方向。

📄 原文摘要(英文)

The success of Large Language Models in mathematical reasoning relies heavily on the generation of diverse and valid solution paths during the rollout phase. However, current rollout techniques face a fundamental trade-off: token-level sampling often yields redundant trajectories that differ only in rephrasing, while embedding-level methods utilizing random noise frequently disrupt semantic consistency. To resolve this, we introduce N-GRPO, a novel exploration strategy integrated into the Group Relative Policy Optimization (GRPO) framework. Rather than relying on token-level sampling or native embedding-level noise, our approach leverages Semantic Neighbor Mixing. This mechanism dynamically constructs input representations by mixing the embeddings of an anchor token and its nearest semantic neighbors, thereby injecting diversity while strictly adhering to the local semantic manifold. Experimental evaluations on the DeepSeek-R1-Distill-Qwen models across different sizes show that N-GRPO not only achieves consistent improvements over strong baselines on math reasoning benchmarks but also exhibits robust generalization capabilities on out-of-distribution tasks.

arXiv 原文

📬 订阅 AI Pulse