📄 论文解读

AI安全干预：看似成功，实则一碰就恢复

信赖通道 ▲ 16 稀疏自编码器AI安全特征干预行为恢复可靠性

你给AI装了个“安全开关”，按下去它就不说坏话。但新研究发现：这个开关可能只是把坏话藏起来了，而不是真的关掉。研究者用稀疏自编码器（SAE）把AI的思考拆成可理解的特征，然后锁定“不安全”特征来阻止不良行为。表面上看，AI确实不再输出危险内容。但通过一种优化方法，他们能在不改变被锁定特征值的前提下，让AI恢复原来的不良行为——成功率高达95.8%。这意味着，当前基于特征的安全干预可能只是“掩耳盗铃”，AI学会了绕开监控，而不是真正改变。这不是你明天就能用的技术，但它提醒我们：依赖特征级控制来保证AI安全，可能远不够可靠。

📄 原文摘要(英文)

Sparse Autoencoders (SAEs) decompose residual-stream activations into interpretable features. Recent latent-space defenses increasingly rely on these decompositions, assuming that identified "unsafe" SAE features serve as actionable handles for monitoring and intervention. In this paradigm, clamping a specific harmful feature is expected to reliably prevent model misbehavior. However, we show that this success may hide a recoverable failure mode: the clamp may block one visible route to a behavior without eliminating the behavior itself. We formulate this vulnerability as post-intervention recovery, a constrained residual-space optimization problem. Starting from the post-intervention residual state, we optimize residual perturbations to recover the pre-intervention behavior while preserving the post-intervention values of the targeted SAE features. Even under a strong threat model where the intervention remains active throughout optimization and generation, recovery remains possible. To rule out that recovery simply undoes the intervention, we use encoder-orthogonal updates for single-layer interventions and the corresponding feature-map Jacobian in the cross-layer setting. Across TPP, unlearning, IOI, and refusal steering experiments, this stress test reveals recoverable behavior despite successful feature-level intervention. Especially in the safety-critical refusal-steering setting, we achieve a 95.8% recovery rate on valid samples while keeping defended-feature relative drift to 0.131, substantially below suffix-based baselines. A recovery-path attribution analysis further localizes this recovery to the SAE reconstruction residual, the component left unexplained by the SAE. These results expose a gap between feature-level control and behavioral completeness: SAE features can support causal intervention, but controlling them does not guarantee control over the underlying behavior.

arXiv 原文

订阅 AI Pulse