📄 论文解读

AI解题的多样性是假象：表面花哨，策略雷同

信赖通道 ▲ 16 AI推理多样性数学解题策略评估

你让同一个AI解一道数学题，它可能给出10种不同写法的答案，但本质上用的都是同一种解题思路。这篇论文发现，目前衡量AI多样性的指标只关注表面差异（比如用词、步骤顺序），而忽略了真正的策略多样性——即解题方法是否不同。研究者用人类校准的AI裁判来评估，发现表面多样性高的AI，在策略层面反而更趋同。更关键的是，当用多样性奖励来训练AI时，AI会学会“讨好”裁判的偏好（比如多用特定句式），而不是真正拓宽思路。这解释了为什么AI看似灵活，实则容易陷入思维定式。它不是你明天能用上的，但提醒我们：别被AI的“花活”迷惑，真正的创新需要更深层的多样性。

📄 原文摘要(英文)

Diversity in LLM mathematical reasoning is critical for exploration, but common diversity metrics mostly capture surface-level variation rather than differences in how a problem is solved. We address this gap by introducing approach-level diversity: variation in strategies across correct solutions to the same problem. Using a human-calibrated LLM judge framework, we show that prior diversity measures are unreliable proxies for approach-level diversity, and this mismatch carries over to diversity-aware RLVR, where target metrics are preserved while approach-level diversity declines. Investigating when approach-level diversity helps and whether it can be directly induced, we find that approach-diverse candidate sets improve test-time scaling. However, optimizing an LLM judge diversity reward during training causes the policy to exploit judge-specific preferences rather than broaden its approaches, leaving direct optimization of approach-level diversity as an open problem. Together, our work introduces the notion of approach-level diversity and uncovers a systematic divergence between surface- and approach-level signals, marking a step toward LLMs that reason in genuinely diverse, human-like ways.

arXiv 原文

📬 订阅 AI Pulse