📄 论文解读

AI打分不再靠死规则，而是像人一样灵活调用证据

趋势通道 ▲ 10 奖励模型AI训练智能体动态评估强化学习

现在的AI训练中，奖励模型（给AI输出打分的裁判）往往依赖死板的规则：要么看答案对不对，要么看格式对不对，要么看步骤全不全。但真实任务需要综合多种证据——比如数学题既要答案正确，也要步骤合理，还要避免冗余。这篇论文把打分这件事本身变成了一个AI智能体任务：它不再用固定规则，而是像人一样，先判断当前任务需要哪些证据（比如规则、参考答案、步骤清单、评分细则），再动态调用这些证据来综合打分。在多个测试中，这种灵活打分的模型比传统裁判模型表现更好。它不是你明天就能直接用的工具，但它指向一个趋势：AI训练中的反馈信号正在从“死规则”走向“活判断”。

📄 原文摘要(英文)

Reward models (RMs) provide critical feedback signals for LLM post-training, notably in reinforced fine-tuning (RFT) and reinforcement learning (RL) pipelines. However, current reward evaluation relies on heterogeneous criteria such as rule-based verifiers, ground-truth references, procedural checklists, and complex rubrics, where a unified mechanism to integrate all types of evidence remains unexplored. To this end, we propose Skill Reward Model (Skill-RM), a unified framework that reformulates reward modeling as the execution of a reusable Reward-Evaluation Skill. By treating reward computation as a structured agentic task, Skill-RM provides a consistent interface to orchestrate heterogeneous resources, dynamically selecting and aggregating evidence tailored to the specific requirements of each input. This approach enables the reward model to move beyond static evaluation, ensuring consistency and transparency across diverse tasks. Extensive experiments on reward benchmarks and downstream applications, including best-of-N selection and reinforcement learning, demonstrate that Skill-RM consistently outperforms traditional judge baselines. Our findings suggest that Skill-RM not only provides a unified solution for reward modeling but also achieves superior performance through the strategic and dynamic orchestration of evidence. The code is at https://github.com/Qwen-Applications/Skill-RM.

arXiv 原文

📬 订阅 AI Pulse