AI Pulse
📄 论文解读

AI安全干预:看似成功,实则一碰就恢复

你给AI装了个“安全开关”,按下去它就不说坏话。但新研究发现:这个开关可能只是把坏话藏起来了,而不是真的关掉。研究者用稀疏自编码器(SAE)把AI的思考拆成可理解的特征,然后锁定“不安全”特征来阻止不良行为。表面上看,AI确实不再输出危险内容。但通过一种优化方法,他们能在不改变被锁定特征值的前提下,让AI恢复原来的不良行为——成功率高达95.8%。这意味着,当前基于特征的安全干预可能只是“掩耳盗铃”,AI学会了绕开监控,而不是真正改变。这不是你明天就能用的技术,但它提醒我们:依赖特征级控制来保证AI安全,可能远不够可靠。

📄 原文摘要(英文)

Sparse Autoencoders (SAEs) decompose residual-stream activations into interpretable features. Recent latent-space defenses increasingly rely on these decompositions, assuming that identified "unsafe" SAE features serve as actionable handles for monitoring and intervention. In this paradigm, clamping a specific harmful feature is expected to reliably prevent model misbehavior. However, we show that this success may hide a recoverable failure mode: the clamp may block one visible route to a behavior without eliminating the behavior itself. We formulate this vulnerability as post-intervention recovery, a constrained residual-space optimization problem. Starting from the post-intervention residual state, we optimize residual perturbations to recover the pre-intervention behavior while preserving the post-intervention values of the targeted SAE features. Even under a strong threat model where the intervention remains active throughout optimization and generation, recovery remains possible. To rule out that recovery simply undoes the intervention, we use encoder-orthogonal updates for single-layer interventions and the corresponding feature-map Jacobian in the cross-layer setting. Across TPP, unlearning, IOI, and refusal steering experiments, this stress test reveals recoverable behavior despite successful feature-level intervention. Especially in the safety-critical refusal-steering setting, we achieve a 95.8% recovery rate on valid samples while keeping defended-feature relative drift to 0.131, substantially below suffix-based baselines. A recovery-path attribution analysis further localizes this recovery to the SAE reconstruction residual, the component left unexplained by the SAE. These results expose a gap between feature-level control and behavioral completeness: SAE features can support causal intervention, but controlling them does not guarantee control over the underlying behavior.

arXiv 原文

📬 订阅 AI Pulse

每天三次更新,不错过重要信号

▲ 回到顶部