📄 论文解读

AI推理不再靠猜词，而是靠画图

信赖通道 ▲ 22 多模态推理连续空间训练推理不匹配双向校准视觉理解

现在的多模态大模型（比如看图回答）有个怪毛病：它把图像先转成文字再推理，就像把一幅画用几百个字描述一遍再想问题，细节全丢了。这篇论文换了个思路——让模型在“连续空间”里直接推理，不经过文字中转。但新问题来了：训练时模型能看到正确答案，推理时看不到，导致它学会走捷径。研究者用“双向校准”解决：一边让推理时的猜测靠近训练时的正确路径，一边反过来限制训练路径别太依赖答案。结果在复杂视觉推理测试上平均提升10.83分，单个任务最高涨32分。它不是你明天能用上的，但方向很明确：AI推理不该是猜字谜，而是画地图。

📄 原文摘要(英文)

Multimodal Large Language Models (MLLMs) are often constrained by a language-space bottleneck, forcing complex visual reasoning into discrete tokens which can lose perceptual nuance. A promising alternative is continuous latent reasoning, where the goal is to discover implicit reasoning pathways that bridge the multimodal query and the final answer. However, this introduces a severe train-inference mismatch: a training-time posterior, conditioned on the ground-truth answer, can exploit answer-dependent shortcuts. Standard variational training then forces the inference-time prior to mimic a posterior that has access to information unavailable at test time, leading to poor performance. To address this, we propose Asymmetric Mutual Variational Learning (AMVL), a framework that resolves this mismatch via a bidirectional calibration objective. A forward KL divergence trains the target-agnostic prior to match the posterior, while a novel reverse KL divergence simultaneously regularizes the posterior, preventing it from collapsing into inference-incompatible regions and mitigating this ``answer leakage''. We provide theoretical analysis formalizing this leakage as prior contamination and prove that our dual-KL objective reduces it. We instantiate AMVL in a latent-integrated MLLM and show that it consistently outperforms strong discrete and latent-reasoning baselines, improving the average score on the complex BLINK benchmark by +10.83 and achieving gains of up to +32.00 on individual reasoning tasks, with analyses confirming improved latent-space stability.

arXiv 原文

📬 订阅 AI Pulse