📄 论文解读

AI问诊精神科：聊天流畅但诊断不准

信赖通道 ▲ 18 精神科诊断大模型多轮对话中文医疗基准测试

AI在精神科问诊中，聊天技巧和诊断能力是两回事。这篇论文让大模型模拟医生与患者对话，发现：模型对抑郁症和焦虑症的简单分类准确率高达92.3%，但一旦涉及两者共病（同时存在）或12种精神疾病的鉴别诊断，准确率骤降到28.5%。更反直觉的是，模型在动态问诊（多轮对话）中的表现反而比直接给病历做诊断更差——说明它们问问题的方式反而干扰了判断。研究者还发现，用AI评判问诊质量（比如问得是否全面）与最终诊断正确率只有中等相关性，意味着问得再好也不代表能诊断对。这套基准（含1.6万份模拟对话）是目前中文精神科AI评估最全面的工具，但它揭示的是当前模型的短板：它们能模仿医生聊天，却还没学会真正看病。

📄 原文摘要(英文)

Mental disorders are highly prevalent worldwide, but the shortage of psychiatrists and the inherent subjectivity of interview-based diagnosis create substantial barriers to timely and consistent mental-health assessment. Progress in AI-assisted psychiatric diagnosis is constrained by the absence of benchmarks that simultaneously provide realistic patient simulation, clinician-verified diagnostic labels, and support for dynamic multi-turn consultation. We present LingxiDiagBench, a large-scale multi-agent benchmark that evaluates LLMs on both static diagnostic inference and dynamic multi-turn psychiatric consultation in Chinese. At its core is LingxiDiag-16K, a dataset of 16,000 EMR-aligned synthetic consultation dialogues designed to reproduce real clinical demographic and diagnostic distributions across 12 ICD-10 psychiatric categories. Through extensive experiments across state-of-the-art LLMs, we establish key findings: (1) although LLMs achieve high accuracy on binary depression--anxiety classification (up to 92.3%), performance deteriorates substantially for depression--anxiety comorbidity recognition (43.0%) and 12-way differential diagnosis (28.5%); (2) dynamic consultation often underperforms static evaluation, indicating that ineffective information-gathering strategies significantly impair downstream diagnostic reasoning; (3) consultation quality assessed by LLM-as-a-Judge shows only moderate correlation with diagnostic accuracy, suggesting that well-structured questioning alone does not ensure correct diagnostic decisions. We release LingxiDiag-16K and the full evaluation framework to support reproducible research at https://github.com/Lingxi-mental-health/LingxiDiagBench.

arXiv 原文

📬 订阅 AI Pulse