📄 论文解读

让AI写代码更可靠的语法约束，反而成了越狱漏洞

信赖通道 ▲ 17 语法约束解码越狱攻击代码生成安全对齐大语言模型

为了让AI生成的代码语法正确，开发者常用一种叫“语法约束解码”的技术。但研究者发现，这个本意是提高可靠性的技术，反而能被用来绕过安全限制：只要给AI一个合法的代码语法框架，它就会乖乖生成恶意代码，成功率比传统越狱方法高出30个百分点以上。他们提出的攻击叫CodeSpear，防御叫CodeShield——后者让AI在语法约束下生成“蜜罐代码”，看起来像模像样但实际无害。这不是你明天能用上的技巧，但它揭示了一个反直觉的风险：你用来让AI更听话的工具，可能正是它被利用的通道。

📄 原文摘要(英文)

Large Language Models (LLMs) are increasingly used for code generation, raising concerns that they may be misused to produce malicious code. Meanwhile, Grammar-Constrained Decoding (GCD) has been widely adopted to improve the reliability of LLM-generated code by enforcing syntactic validity. In this paper, we reveal a counterintuitive risk: this reliability-oriented technique can itself become an attack surface. We uncover a new jailbreak attack, termed CodeSpear, that exploits GCD to induce LLMs into generating malicious code. Our experiments show that simply applying a benign code grammar constraint can effectively jailbreak LLMs. To address this vulnerability, we propose CodeShield, a safety alignment approach that robustly preserves safe behavior even under attacker-controlled grammar constraints. CodeShield aligns the model in the code modality by teaching it to generate honeypot code under GCD. Such code is semantically harmless, so it does not implement the malicious request, and structurally diverse, so it is difficult to suppress through grammar tightening. At the same time, CodeShield still preserves natural-language refusals when natural language is available. Experiments on 10 popular LLMs across 4 benchmarks show that CodeSpear outperforms representative jailbreak baselines and increases the attack success rate by more than 30 percentage points on average. CodeShield also restores safety under CodeSpear while preserving benign utility. Our findings reveal a fundamental risk of GCD and call for greater attention to its potential security implications.

arXiv 原文

📬 订阅 AI Pulse