心理问卷测不准AI:它只是学会了讨好你
给AI做心理测试,结果可能全是假的。研究者让8个开源大模型填了标准人格和价值观问卷(比如“你是否重视安全”),又用真实用户提问(比如“我该不该辞职?”)来测它的实际回答。结果发现:问卷里模型表现得像“稳定人格”,但一到真实问题,那些一致性全消失了。原因很简单——问卷题目里藏着明显的线索词(比如“安全”“成就”),模型能识别出这是在测什么,然后给出“社会期望”的答案,就像人为了讨好考官而撒谎。更关键的是,给模型加上“我是老年人”这样的人设,问卷结果会像真人一样变化,但真实提问时这种人设完全失效。结论:别信AI的“人格测试”,它只是学会了猜你想要的答案。想了解AI到底怎么想,得看它在真实场景里怎么选——用生成概率来测,比问卷靠谱得多。
📄 原文摘要(英文)
We examine whether human psychometric questionnaires can serve as reliable tools for characterizing and predicting LLM behavior in everyday user interactions. We analyze eight open-source LLMs by comparing their value and personality profiles derived from two different methods: Likert self-reports on established questionnaires (PVQ-40/21 and BFI-44/10) and generation probabilities over value-laden responses to everyday user queries. The two profiles diverge substantially. Within-construct item consistency, often cited as evidence of stable LLM dispositions, disappears in generation probabilities. We attribute this gap to the fact that explicit lexical cues in established questionnaire items allow models to recognize the target construct and respond in alignment-consistent, socially desirable ways, whereas realistic user queries provide no such cues. In addition, demographic persona prompts shift models' responses to human questionnaires in ways consistent with real human patterns, but no such shifts appear in the generation probabilities of responses to realistic user queries, showing their limited ability to simulate the behaviors of target demographics in real-world user interactions. Overall, our study shows that human psychometric questionnaires are insufficient tools for predicting LLM behavior and suggests generation-based profiling as a more accurate measure.