📄 论文解读

AI评测的皇帝新衣：高分模型在密集细节上集体翻车

趋势通道 ▲ 24 AI评测多模态模型感知能力评测基准可靠性

你看到的AI评测高分，可能只是幻觉。这篇论文发现，当前多模态模型在信息密集的图片上，即使能答对大部分元素，一旦要求同时满足多个精确细节（比如“红车旁边有只戴蓝帽子的狗”），成功率骤降。研究者设计了一套“原子级”评测：每张图配几十条必须答对的硬规则，答错一条就整题零分。结果：最强模型在简单任务上接近满分，但在这种“连坐”测试中直接腰斩。更意外的是，开源模型与闭源模型之间始终存在8%的感知差距，与近期“开源追上闭源”的推理趋势相反。这不是你明天能用的工具，但它提醒你：别被榜单上的数字骗了，AI的“看见”和“看懂”之间，还隔着一条鸿沟。

📄 原文摘要(英文)

We introduce PerceptionRubrics, a rubric-based evaluation framework that addresses the gap between saturated benchmark scores and real-world brittleness. Shifting evaluation from holistic semantic matching to rigorous atomic auditing, PerceptionRubrics pairs 1,038 information-dense images with over 12,000 instance-specific rubrics. These criteria are derived from golden captions constructed via a novel Circular Peer-Review consensus pipeline and then distilled into a dual-stream system of Must-Right (essential facts) and Easy-Wrong (fine-grained details) rubrics. Crucially, PerceptionRubrics implements a Gated Scoring mechanism: unlike linear averages, failure on mandatory visual facts triggers sharp binary penalties. Extensive evaluation yields critical insights: (1) The Reliability Gap: models often verify fragmented elements correctly yet fail strict conjunctive constraints, exposing brittleness in dense domains; (2) Open-Closed Stratification: contrary to reasoning trends, we reveal a persistent 8% perception deficit between open-source and proprietary frontiers; and (3) Human-Aligned Rigor: our gated metrics substantially out-align conventional benchmarks, validating that strict perceptual fidelity is the prerequisite for reliable generation.

arXiv 原文

📬 订阅 AI Pulse