📄 论文解读

给注意力装上混合引擎：只给关键头用全注意力

趋势通道 ▲ 15 注意力机制长上下文混合架构模型效率

大模型处理长文本时，全注意力（FA）计算量随长度平方增长，线性注意力（LA）省资源但效果差。以往混合方案是整层替换，但研究发现同一层内不同注意力头功能各异——有的负责检索关键信息，有的处理其他。HydraHead 利用这一特性，只对检索关键的头保留全注意力，其余用线性注意力，并通过缩放归一化融合两类输出。在 15B 训练数据下，512K 上下文长度性能比基线提升 69%，接近同规模顶尖模型 Qwen3.5。这不是你明天能直接用的工具，但它揭示了一个趋势：未来模型可能不再一刀切，而是像人一样，把精力集中在真正重要的地方。

📄 原文摘要(英文)

The quadratic complexity of attention poses a critical bottleneck for long-context processing, spurring interest in hybrid attention designs. Most open-source hybrid models adopt a layer-wise strategy. Yet, prior work has noted the inherent difficulty of integrating Linear Attention (LA) with Full Attention (FA), suggesting that the design space of attention hybridization remains underexplored. To probe this space, we conduct interpretability analysis and observe that layers exhibit block-wise functional similarity, while individual heads within the same layer display distinct functional specialization despite sharing input features. This head-level heterogeneity suggests that the head dimension provides a natural and principled granularity for fusing heterogeneous attention signals. Building on this insight, we introduce HydraHead, a novel architecture that hybridizes FA and LA along the head axis. HydraHead features two key innovations: (1) an interpretability-driven selection strategy that identifies retrieval-critical heads and preserves FA only for them, and (2) a scale-normalized fusion module that reconciles the distributional gap between FA and LA head outputs. By leveraging a three-stage transfer pipeline with parameter reuse and distillation, we achieve high-performance hybrid models with minimal training overhead. Under a unified training setup, HydraHead outperforms other hybrid designs in long-context tasks while maintaining strong general reasoning. With interpretability-driven head selection, it matches a 3:1 layer-wise hybrid's long-context performance at a 7:1 LA-to-FA ratio. Crucially, trained on only 15B tokens, HydraHead achieves over 69% improvement over the baseline at 512K context length, approaching Qwen3.5, a leading model of comparable size with a native context length of 256K. This highlights the significant scaling potential of head-level hybridization.

arXiv 原文

📬 订阅 AI Pulse