AI Pulse
📄 论文解读

AI评测的皇帝新衣:高分模型在密集细节上集体翻车

你看到的AI评测高分,可能只是幻觉。这篇论文发现,当前多模态模型在信息密集的图片上,即使能答对大部分元素,一旦要求同时满足多个精确细节(比如“红车旁边有只戴蓝帽子的狗”),成功率骤降。研究者设计了一套“原子级”评测:每张图配几十条必须答对的硬规则,答错一条就整题零分。结果:最强模型在简单任务上接近满分,但在这种“连坐”测试中直接腰斩。更意外的是,开源模型与闭源模型之间始终存在8%的感知差距,与近期“开源追上闭源”的推理趋势相反。这不是你明天能用的工具,但它提醒你:别被榜单上的数字骗了,AI的“看见”和“看懂”之间,还隔着一条鸿沟。

📄 原文摘要(英文)

We introduce PerceptionRubrics, a rubric-based evaluation framework that addresses the gap between saturated benchmark scores and real-world brittleness. Shifting evaluation from holistic semantic matching to rigorous atomic auditing, PerceptionRubrics pairs 1,038 information-dense images with over 12,000 instance-specific rubrics. These criteria are derived from golden captions constructed via a novel Circular Peer-Review consensus pipeline and then distilled into a dual-stream system of Must-Right (essential facts) and Easy-Wrong (fine-grained details) rubrics. Crucially, PerceptionRubrics implements a Gated Scoring mechanism: unlike linear averages, failure on mandatory visual facts triggers sharp binary penalties. Extensive evaluation yields critical insights: (1) The Reliability Gap: models often verify fragmented elements correctly yet fail strict conjunctive constraints, exposing brittleness in dense domains; (2) Open-Closed Stratification: contrary to reasoning trends, we reveal a persistent 8% perception deficit between open-source and proprietary frontiers; and (3) Human-Aligned Rigor: our gated metrics substantially out-align conventional benchmarks, validating that strict perceptual fidelity is the prerequisite for reliable generation.

arXiv 原文

📬 订阅 AI Pulse

每天三次更新,不错过重要信号

▲ 回到顶部