大模型的「思考」其实没你想的那么深
我们总以为大模型在「思考」——它内部应该有一些抽象的概念、推理步骤。但一篇新论文用一套公理框架去检查,发现这些所谓的「潜在思维」其实很浅。研究者定义了四条公理:因果性(表示要能反映真实因果)、最小性(不冗余)、可分离性(不同任务能分开)、稳定性(同任务内一致)。他们在23个推理任务上测试了多个开源模型,结果没有一个表示能同时满足所有公理。更扎心的是:模型能区分不同任务类型,但同一个任务里的两个不同问题,它的内部表示几乎分不清;而且这些表示携带的信息,大部分在输入嵌入里就已经有了。这个失败是结构性的,跟模型大小、训练方式无关。它不是你明天能用上的,但它提醒你:别把大模型的「内部推理」想得太高级。
📄 原文摘要(英文)
We introduce an axiomatic evaluation framework for latent thought representations in LLMs, comprising metrics that are independent of downstream benchmark scores and reveal representational failures that benchmark accuracy masks. Existing evaluations conflate representation quality with model capacity. Therefore, failures cannot be attributed to the representation rather than to the model that processes it. We formalize four functional axioms (Causality, Minimality, Separability, and Stability) and define a quantitative measure for each, computed directly on the representation independently of downstream accuracy. We audit open-weight LLMs across 23 reasoning tasks (e.g., Spatial Reasoning, Factual QA). We find that no candidate satisfies all four axioms simultaneously, that the representations distinguish task type reliably but cannot distinguish between two questions within the same task, and that the representations encode little information beyond what is already present in the input embedding. The failure is consistent across dense, reasoning-distilled, and RL-trained model families, indicating that the gap is structural rather than a property of model size or training procedure.