AI Pulse
📄 论文解读

AI工具调用:有用还是添乱?新方法精准打分

现在的AI智能体可以调用工具(比如代码)来处理图片,但工具调用有时有用,有时多余,甚至误导。传统方法只看最终答案对不对,无法区分每个工具的具体贡献。这篇论文提出TACO方法,通过两个巧妙机制给每个工具调用单独打分:一是插入“探针”让AI自己预测“用了这个工具会怎样、不用又会怎样”,差值就是工具的真实价值;二是只把最终奖励分配给真正起作用的工具调用,避免奖励被无效调用稀释。实验表明,AI学会了只在需要时才调用工具,准确率持续提升。它不是你明天就能直接用的功能,但展示了让AI更高效、更可信地使用工具的前沿思路。

📄 原文摘要(英文)

Agentic multimodal models perform diverse operations on an image via code and reason over the returned view, an effective paradigm for fine-grained visual question answering. However, code operations can be useful, redundant, or misleading. Outcome-only rewards cannot precisely distinguish these cases, and existing process rewards either fail to attribute final correctness to individual tool calls, or require an external judge model. To address this, we introduce Tool-Augmented Credit Optimization (TACO), a GRPO variant for code-tool agents built on two coupled advantage channels. The first, Differential Answer-Probe Reward (DAPR), is a self-supervised, judge-free tool-contribution advantage that credits each tool call by its own effect on answering correctly. Probe tokens inserted into the model's reasoning elicit its predictions with and without the tool, and the difference in outcome reward is taken as the call's value: positive for a useful call, negative for a misleading one, and zero for one that changes nothing. This reuses the existing answer checker with no auxiliary judge, and, being a difference rather than an absolute probe score, is naturally robust to probe-hacking. The second is the outcome advantage from the final answer, distributed by Outcome-Gated Advantage Routing (OGAR): a parameter-free rule that, conditioned on the call's outcome, delivers this credit only to the responsible segments, suppressing wasted tool calls without any cost term. We train TACO through a two-stage SFT+RL pipeline. Extensive experiments across perception, reasoning, and general multimodal benchmarks show that it yields consistent accuracy gains and learns to invoke its tools only when they help.

arXiv 原文

📬 订阅 AI Pulse

每天三次更新,不错过重要信号

▲ 回到顶部