📄 论文解读

AI医生一被误导就翻车，准确率从71%暴跌到38%

信赖通道 ▲ 40 大模型医学误导鲁棒性评估

大模型在医学考试里能拿专家级分数，但一遇到故意误导的上下文，正确率就从71%掉到38%。研究者造了5万条误导性选项去攻击模型，发现最狠的是「权威口吻的假规则」——成功率近70%。14位医生看了案例后，认为38%的情况会造成严重伤害。这不是你明天能用上的工具，但它告诉你：别把AI的考试分数当成临床判断力。

📄 原文摘要(英文)

Large language models (LLMs) now reach expert-level scores on medical licensing exams, encouraging the assumption that high scores imply safe medical judgment while patients increasingly use them for health advice. We show this assumption is fragile: when misleading context is injected into questions that LLMs originally answer correctly, they abandon the correct answer. We call the ability to maintain correct judgment under adversarial context epistemic resilience, and introduce MedMisBench to measure it. MedMisBench contains 10,932 medical question items and 48,889 misleading context-option pairs spanning medical reasoning, agentic capability, and patient-journey evaluation. Across 11 model configurations, mean accuracy falls from 71.1% on original questions to 38.0% under focused misleading context, with 51.5% attack success. The most damaging injections are formal, rule-like fabrications: authority-framed falsehoods reach 69.5% attack success and exception-poisoning claims reach 64.1%. A 14-member clinical panel from 7 countries identified serious potential harm in 38.2% of reviewed cases. MedMisBench exposes a structural blind spot in LLM evaluation in medical settings: existing benchmarks measure what models know, but not whether they preserve correct medical judgment under misleading context.

arXiv 原文

📬 订阅 AI Pulse