📄 论文解读

AI学会在长文本里找关键线索了

信赖通道 ▲ 11 上下文感知强化学习长程推理多模态细粒度定位

大模型经常在长文本或复杂图片里漏掉关键细节——比如工具日志里的一行字、图片里的一个微小特征。研究者提出ContextRL，不直接监督最终答案，而是让模型从两个高度相似的上下文中选出哪个支持给定的问答对，以此训练它细粒度定位关键信息。在代码智能体场景，用工具调用轨迹作为上下文；在多模态场景，用图片作为上下文。在5个长程推理基准上平均提升2.2%，在12个视觉问答基准上提升1.8%。对比实验表明，收益来自这种上下文选择目标，而非额外数据本身。它不是你明天能用上的，但指向了让AI更可靠地处理复杂信息的方向。

📄 原文摘要(英文)

Large language models (LLMs) often fail when answering requires identifying a small but decisive piece of evidence within a long or complex context, such as a single line in a tool trace or a subtle detail in an image. We propose ContextRL, a context-aware reinforcement learning (RL) method that improves long-horizon reasoning and multimodal performance through an indirect auxiliary objective. Instead of supervising only the final answer, ContextRL presents the model with a query, an answer, and two highly similar contexts, and rewards it for selecting the context that supports the query--answer pair, thereby encouraging fine-grained grounding. We construct contrastive context data in two domains: for coding agents, trajectories serve as contexts, yielding 1k pairs built via condition filtering; for multimodal reasoning, images serve as contexts, yielding 7K pairs built via generative editing and similarity search. ContextRL achieves average gains of +2.2% over standard GRPO on 5 long-horizon benchmarks, and +1.8% across 12 diverse visual question answering benchmarks. To disentangle the effect of the proposed objective from that of additional data, we compare against data-augmentation baselines that repurpose the same contrastive contexts as standard query--context--answer examples. These baselines provide little to no improvement, showing that the gains arise from the proposed context-selection objective rather than from the contrastive data alone.

arXiv 原文

📬 订阅 AI Pulse