📄 论文解读

AI安全新思路：让规则像插件一样随时换

趋势通道 ▲ 11 AI安全多模态动态规则安全审核强化学习

现在的AI安全审核像一张死板的黑名单——只能识别预设的违规类型，一旦规则变了就得重新训练。SingGuard把规则变成运行时输入：你直接告诉它“不许提某品牌”“不能讨论某话题”，它就能按新规则逐条检查对话内容，并告诉你触发了哪条。更聪明的是，它有三种推理模式：快速判断、混合推理、慢速深思，用强化学习自动切换，在效率和准确性之间找平衡。在56,340个样本的测试中，它比现有方案平均F1分数更高，且规则切换后准确率从64.65%提升到74.15%。这不是你明天能用的工具，但它指向一个趋势：未来的AI安全将像法律条文一样可动态更新，而非固化在模型参数里。

📄 原文摘要(英文)

Vision-language models (VLMs) are increasingly deployed in consumer, medical, financial, and enterprise applications. This broad deployment expands the safety surface: risks can arise from multimodal question answering, assistant responses, and cross-modal composition, while moderation policies may vary across products, regions, and deployment stages. Most existing guardrails either rely on fixed taxonomies or target only a narrow set of interaction settings, which limits their adaptability when safety rules change at deployment time. We present SingGuard, a policy-adaptive multimodal guardrail model family for safety assessment in multimodal conversations. SingGuard treats the active policy as a runtime input: given natural-language rules, it checks the target content against the active policy rule by rule and predicts both the safety label and the triggered rule. To balance efficiency and interpretability, SingGuard supports fast, hybrid, and slow inference regimes along a fast-to-slow reasoning spectrum, ranging from direct safety judgments to policy-grounded deliberation. We further optimize this behavior with fast--slow decoupled reinforcement learning. We also introduce SingGuard-Bench, a multimodal guardrail benchmark with 56{,}340 examples spanning 80+ fine-grained risk types across multimodal QA, adversarial attack, and dynamic-rule evaluation settings, including cross-modal joint-risk cases where each modality is harmless in isolation but their composition implies unsafe intent. Across six benchmark families (35 datasets), SingGuard achieves state-of-the-art average F1 in every family. Dynamic-rule evaluation further shows improved policy-following accuracy from 0.6465 to 0.7415 under runtime policy shifts. Our code is available at https://github.com/inclusionAI/Sing-Guard.

arXiv 原文

📬 订阅 AI Pulse