AI Pulse
📄 论文解读

AI注意力头不再全开,省一半算力不掉分

Transformer模型的自注意力机制是算力黑洞——每个词都要和所有其他词互动,而且所有注意力头都无差别激活。这篇论文反其道而行:给每个词只分配一半的查询头,但让模型自己学会选哪一半。做法是在分组查询注意力(GQA)的每组里加一个路由器,根据词的内容挑出最相关的k个查询头,键值头则全部保留。在2.5亿参数、300亿token的训练中,激活一半查询头的模型在下游任务上追平了全激活的基线。它不是你明天能用上的,但指向一个趋势:模型不再“一视同仁”地浪费算力,而是学会按需分配——这对长文本场景尤其关键。

📄 原文摘要(英文)

Self-attention is central to Transformer performance and is often the most expensive part of the Transformer at long context lengths because its pairwise token interactions scale quadratically with sequence length. Standard dense attention also applies the same set of attention heads to every token regardless of token difficulty or information content. This uniform activation can waste compute, especially as sequences grow longer and attention cost increases rapidly. We propose Grouped Query Experts (GQE), a mixture-of-experts layer on top of grouped-query attention (GQA). Within each GQA group, a router selects k query-head experts per token while all key-value (KV) heads remain dense and unchanged. Thus, GQE keeps the KV cache benefits of GQA and reduces only the active query-head computation. On a fixed 30B token budget at the 250M parameter scale, GQE matches the all-active GQA baseline in downstream accuracy while activating half the query heads per token.

arXiv 原文

📬 订阅 AI Pulse

每天三次更新,不错过重要信号

▲ 回到顶部