让AI学会“知道自己不知道”
大模型经常自信地胡说八道,因为它缺少“元认知”——即对自己认知过程的监控和调节能力。这篇论文提出一种新训练方法:让模型在输出答案后,先自己判断“我答得对不对”,然后根据这个自我判断的质量来优化模型。具体做法是,在偏好优化阶段,不仅看答案好坏,还看模型对自己答案的评估是否准确;同时用这种自我评估来筛选更有价值的训练数据。实验表明,这种方法能让模型更诚实地表达不确定性,在多个任务上校准效果领先,且不牺牲准确率。它不是你明天就能用上的功能,但指向了让AI更可信的关键方向:不是让它更聪明,而是让它更清楚自己有多聪明。
📄 原文摘要(英文)
Metacognition is a critical component of intelligence that describes the ability to monitor and regulate one's own cognitive processes. Yet LLMs exhibit systemic deficiencies in key metacognitive faculties: they hallucinate with high confidence, fail to recognize knowledge boundaries, and misrepresent their internal uncertainty--undermining trustworthiness and reliability. Since monitoring task performance and adapting behavior accordingly are central to metacognition, we posit that models capable of accurately judging their own performance are better positioned to improve it. We operationalize this idea via two novel mechanisms: reinforcement learning with metacognitive feedback (RLMF), a paradigm to refine completion rankings during preference optimization based on the quality of a model's self-judgments of performance, and metacognitive data selection, which uses similar self-judgments to identify high-value training examples, outperforming naive active learning. We apply these innovations to the problem of faithful calibration (FC), a task that is itself fundamentally metacognitive: the goal is to align expressed with intrinsic uncertainty, difficult even for frontier LLMs. We adopt a two-stage, decoupled approach, first using these methods to calibrate the faithfulness of models' self-reported confidence scores, then mapping to natural, context-adaptable linguistic uncertainty via targeted output editing. Extensive experiments show RLMF achieves generalizable, state-of-the-art FC on diverse tasks while preserving accuracy. Further, RLMF surpasses standard RL by up to 63% while enhancing models' ability to assess and express their own capability limits. This positions RLMF as a promising paradigm to enhance LLM metacognition toward improved abilities and alignment, and suggests metacognitive performance as an effective RL signal to overcome limits of prior intrinsic feedback methods.