📄 论文解读

大模型最后一层反而更笨？选对中间层推理更强

信赖通道 ▲ 19 大模型推理对齐税解码策略层选择

大模型生成回答时，默认用最后一层输出——但最新研究发现，最后一层可能被“对齐训练”带偏，反而让推理变差。模型内部有个规律：早期层瞎猜，中间层认真推理，最后一层却可能把正确答案“修正”成更安全、更通用的词。研究者提出一种无需训练的方法：让模型在生成每个词时，自动从倒数几层里挑出最靠谱的那一层（通过熵值判断），跳过最后一层的干扰。在GPQA、奥数等难题上，推理准确率提升，且几乎不增加计算成本。这不是你明天能直接用的功能，但它揭示了一个反直觉的事实：对齐训练可能让模型变“乖”但变“笨”，而绕过最后一层反而能释放真实推理能力。

📄 原文摘要(英文)

Autoregressive generation in large language models (LLMs) conventionally decodes from the final layer, assuming that deeper representations yield more reliable next-token predictions. We revisit this assumption by revealing a recurring Guess-Refine-Perturb dynamic: early layers form coarse guesses, intermediate layers refine reasoning-relevant semantics, and final layers can perturb these refined predictions toward generic or alignment-preferred tokens. We introduce Confident Decoding, a training-free decoding strategy that dynamically selects the most reliable near-final layer through entropy-guided conservative backward search. We further provide a theoretical formulation of layer selection as an optimal stopping problem, showing that under bounded projection noise and dominant late-stage alignment perturbation, our search rule filters perturbation while bounding the loss relative to the oracle refinement layer. Experiments across dense and Mixture-of-Experts LLMs demonstrate consistent gains on challenging reasoning benchmarks, including GPQA-Diamond, Omni-MATH, and HLE, with zero memory overhead and less than 2% latency increase. These results suggest dynamically bypassing final-layer perturbations can unlock stronger reasoning behavior from aligned LLMs.

arXiv 原文

📬 订阅 AI Pulse