大模型最后一层反而更笨?选对中间层推理更强
大模型生成回答时,默认用最后一层输出——但最新研究发现,最后一层可能被“对齐训练”带偏,反而让推理变差。模型内部有个规律:早期层瞎猜,中间层认真推理,最后一层却可能把正确答案“修正”成更安全、更通用的词。研究者提出一种无需训练的方法:让模型在生成每个词时,自动从倒数几层里挑出最靠谱的那一层(通过熵值判断),跳过最后一层的干扰。在GPQA、奥数等难题上,推理准确率提升,且几乎不增加计算成本。这不是你明天能直接用的功能,但它揭示了一个反直觉的事实:对齐训练可能让模型变“乖”但变“笨”,而绕过最后一层反而能释放真实推理能力。
📄 原文摘要(英文)
Autoregressive generation in large language models (LLMs) conventionally decodes from the final layer, assuming that deeper representations yield more reliable next-token predictions. We revisit this assumption by revealing a recurring Guess-Refine-Perturb dynamic: early layers form coarse guesses, intermediate layers refine reasoning-relevant semantics, and final layers can perturb these refined predictions toward generic or alignment-preferred tokens. We introduce Confident Decoding, a training-free decoding strategy that dynamically selects the most reliable near-final layer through entropy-guided conservative backward search. We further provide a theoretical formulation of layer selection as an optimal stopping problem, showing that under bounded projection noise and dominant late-stage alignment perturbation, our search rule filters perturbation while bounding the loss relative to the oracle refinement layer. Experiments across dense and Mixture-of-Experts LLMs demonstrate consistent gains on challenging reasoning benchmarks, including GPQA-Diamond, Omni-MATH, and HLE, with zero memory overhead and less than 2% latency increase. These results suggest dynamically bypassing final-layer perturbations can unlock stronger reasoning behavior from aligned LLMs.