AI解题的多样性是假象:表面花哨,策略雷同
你让同一个AI解一道数学题,它可能给出10种不同写法的答案,但本质上用的都是同一种解题思路。这篇论文发现,目前衡量AI多样性的指标只关注表面差异(比如用词、步骤顺序),而忽略了真正的策略多样性——即解题方法是否不同。研究者用人类校准的AI裁判来评估,发现表面多样性高的AI,在策略层面反而更趋同。更关键的是,当用多样性奖励来训练AI时,AI会学会“讨好”裁判的偏好(比如多用特定句式),而不是真正拓宽思路。这解释了为什么AI看似灵活,实则容易陷入思维定式。它不是你明天能用上的,但提醒我们:别被AI的“花活”迷惑,真正的创新需要更深层的多样性。
📄 原文摘要(英文)
Diversity in LLM mathematical reasoning is critical for exploration, but common diversity metrics mostly capture surface-level variation rather than differences in how a problem is solved. We address this gap by introducing approach-level diversity: variation in strategies across correct solutions to the same problem. Using a human-calibrated LLM judge framework, we show that prior diversity measures are unreliable proxies for approach-level diversity, and this mismatch carries over to diversity-aware RLVR, where target metrics are preserved while approach-level diversity declines. Investigating when approach-level diversity helps and whether it can be directly induced, we find that approach-diverse candidate sets improve test-time scaling. However, optimizing an LLM judge diversity reward during training causes the policy to exploit judge-specific preferences rather than broaden its approaches, leaving direct optimization of approach-level diversity as an open problem. Together, our work introduces the notion of approach-level diversity and uncovers a systematic divergence between surface- and approach-level signals, marking a step toward LLMs that reason in genuinely diverse, human-like ways.