AI模型瘦身:只留关键层,长文本处理更快
大模型处理长文本时,全注意力机制计算量巨大。这篇论文发现,并非所有层都需要全注意力,保留少数关键层,其余替换为更轻量的线性注意力,就能在几乎不损失性能的前提下大幅提速。但难点在于:哪些层该保留?以往靠经验或逐层打分,忽略了层与层之间的相互影响。研究者提出FlashMorph方法,先给每层装一个“线性注意力备用分支”,然后通过优化算法自动找出最优的保留组合。实验表明,它找到的配置比人工设计更高效,且选择成本极低。这不是你明天就能用的工具,但它揭示了模型压缩的一个新思路:与其硬砍参数,不如聪明地分配计算资源。
📄 原文摘要(英文)
Hybrid attention models improve long-context efficiency by retaining only a subset of full-attention layers and replacing the remaining layers with linear attention. However, the effectiveness of Transformer-to-hybrid conversion critically depends on which layers preserve full attention. Existing hybrid layer selection methods typically rely on heuristic strategies such as fixed placement patterns or layerwise scoring, implicitly treating layer importance as isolated and overlooking the interdependent layer effect under a global hybrid configuration. In this work, we formulate hybrid layer selection as a budget-constrained subset optimization problem. We further propose FlashMorph (Fast LAyer Selection for Hybrid MORPHing), an effective, efficient and scalable layer selection method for Transformer-to-hybrid conversion. FlashMorph first constructs a morphable model by equipping each full-attention layer with a converted linear-attention branch. It then freezes all model weights and jointly optimizes layerwise gates on synthetic long-context retrieval data, with a linearization regularization that encourages the model to rely on linear attention for efficiency. The learned gates are discretized under a preset full-attention budget to instantiate the hybrid architecture, followed by standard logits distillation and long-context finetuning. Extensive experiments show that FlashMorph discovers more effective hybrid configurations, preserves strong long-context recall and general benchmark performance while substantially reducing layer selection cost compared with existing layer selection methods, demonstrating its effectiveness, efficiency, and scalability.