📄 论文解读

给注意力装上混合引擎：只给关键头用全注意力

信赖通道 ▲ 13 注意力机制长上下文混合架构可解释性大模型

大模型处理长文本时，全注意力（FA）计算量随长度平方增长，线性注意力（LA）虽快但效果差。以往混合方案是整层替换，但研究发现：同一层里不同注意力头各司其职，有的负责检索关键信息，有的处理常规内容。HydraHead 利用可解释性分析，只对检索关键的头保留全注意力，其余用线性注意力，再通过归一化融合消除两者输出分布差异。在 512K 超长上下文上，仅用 15B 训练数据就比基线提升 69%，接近同规模顶尖模型 Qwen3.5。它不是你明天能用上的，但指明了混合注意力的新方向：精细到头的分工，而非整层一刀切。

📄 原文摘要(英文)

The quadratic complexity of attention poses a critical bottleneck for long-context processing, spurring interest in hybrid attention designs. Most open-source hybrid models adopt a layer-wise strategy. Yet, prior work has noted the inherent difficulty of integrating Linear Attention (LA) with Full Attention (FA), suggesting that the design space of attention hybridization remains underexplored. To probe this space, we conduct interpretability analysis and observe that layers exhibit block-wise functional similarity, while individual heads within the same layer display distinct functional specialization despite sharing input features. This head-level heterogeneity suggests that the head dimension provides a natural and principled granularity for fusing heterogeneous attention signals. Building on this insight, we introduce HydraHead, a novel architecture that hybridizes FA and LA along the head axis. HydraHead features two key innovations: (1) an interpretability-driven selection strategy that identifies retrieval-critical heads and preserves FA only for them, and (2) a scale-normalized fusion module that reconciles the distributional gap between FA and LA head outputs. By leveraging a three-stage transfer pipeline with parameter reuse and distillation, we achieve high-performance hybrid models with minimal training overhead. Under a unified training setup, HydraHead outperforms other hybrid designs in long-context tasks while maintaining strong general reasoning. With interpretability-driven head selection, it matches a 3:1 layer-wise hybrid's long-context performance at a 7:1 LA-to-FA ratio. Crucially, trained on only 15B tokens, HydraHead achieves over 69% improvement over the baseline at 512K context length, approaching Qwen3.5, a leading model of comparable size with a native context length of 256K. This highlights the significant scaling potential of head-level hybridization.

arXiv 原文

📬 订阅 AI Pulse