📄 论文解读

AI做奥数组合题：能构造但不会证明

信赖通道 ▲ 18 奥数组合数学AI推理基准测试证明能力

顶尖AI在奥数组合题上暴露了分裂：能构造出正确答案，但写不出严谨证明。新基准ComBench用100道奥赛题测试，发现最强模型整体正确率仅65%，而构造类题最难。更意外的是，Kimi-K2.6在构造题上反超GPT-5.5，但证明题却落后——说明AI的“动手”和“动脑”是两种能力。这不是你明天能用上的工具，但它揭示了AI推理的边界：能猜对答案，不等于能讲清道理。

📄 原文摘要(英文)

Combinatorics is central to Olympiad-level mathematical problem solving, requiring deep discrete reasoning, creative constructions, and rigorous structural insight. Recent evidence suggests that even today's strongest frontier models remain uneven on Olympiad combinatorics, revealing a gap in creative mathematical reasoning. We introduce ComBench, an Olympiad-level combinatorics benchmark for evaluating and diagnosing the combinatorial reasoning capabilities of large language models. ComBench contains 100 human-annotated competition-level problems organized around two complementary settings: analysis-centric problems, which primarily require rigorous mathematical arguments, and construction-centric problems, which require explicit constructions in addition to correctness justifications. The evaluation protocol combines rubric-guided proof grading with deterministic construction verification, exposing cases where proof quality and construction validity diverge. Experiments on frontier open- and closed-source models show that ComBench is far from saturated: the strongest model reaches 65.4% overall Avg. and 75.3% overall Best@4. We further find that Rigorous Proof Reasoning and Constructive Realization are distinct capabilities: Kimi-K2.6 trails GPT-5.5 on analysis-centric proof grading but surpasses it on construction-centric Best@4, while Existence and Construction problems remain consistently hardest across representative frontier models.

arXiv 原文

📬 订阅 AI Pulse