📄 论文解读

不用答案标签，AI自己学会看细节

信赖通道 ▲ 20 视觉推理无标签训练对比学习多模态大模型

训练AI做细粒度视觉推理通常需要大量人工标注的答案，成本高且慢。这篇提出V-Zero，完全不用答案标签：它让AI自己生成回答，然后拿一张正确裁剪的局部图和一张错误图做对比，判断AI的推理对不对，再据此修正。结果在多个视觉推理任务上效果提升，训练速度比传统方法快5倍以上，比强化学习快10倍以上。它不是你明天就能直接用的工具，但展示了一种摆脱昂贵标注的新方向。

📄 原文摘要(英文)

Fine-grained visual reasoning requires multimodal large language models (MLLMs) to identify task-relevant visual evidence and ground their reasoning in local image regions. Existing agentic methods typically rely on reinforcement learning with verifiable rewards or supervised fine-tuning on large-scale annotated reasoning traces, leading to costly exploration, hand-designed verification rules, or heavy dependence on textual supervision. A natural way to avoid such external answer labels is to learn from trajectories sampled by the student itself, which points to On-Policy Distillation (OPD). To understand what OPD can and cannot provide for visual reasoning, we revisit it as negative-free stop-gradient alignment. This perspective shows that, although OPD provides effective token-level correction, its ceiling is constrained by the absence of trajectory-level discrimination. Motivated by these observations, we propose V-Zero, an answer-label-free framework for visual reasoning with contrastive evidence gating. V-Zero uses no annotated textual answer labels; instead, during training it pairs a question-relevant regional crop with a negative visual view to evaluate student-sampled trajectories and gate dense token-level distillation. Experiments on multiple visual reasoning benchmarks show that V-Zero consistently improves fine-grained visual reasoning while preserving strong generalization. Notably, V-Zero is more than 5times faster than previous supervised fine-tuning methods and more than 10times faster than reinforcement learning baselines. Code and dataset will be released at https://github.com/eVI-group-SCU/V-Zero

arXiv 原文

📬 订阅 AI Pulse