AI Pulse
📄 论文解读

MoE模型路由器升级:让专家更精准

MoE模型(混合专家模型)是当前大模型的主流架构之一,它内部有多个“专家”子网络,路由器决定每个输入该交给哪些专家处理。传统路由器的设计缺乏原则,导致专家选择不够精准。这篇论文提出了一种新方法:让路由器的每一行与对应专家的主奇异方向对齐——简单说,就是让路由器更懂每个专家的特长。他们通过“流形幂迭代”技术实现这一对齐,并在1B到11B参数的模型上验证了效果提升。虽然这不是你明天就能直接用的工具,但它揭示了MoE模型优化的一个关键方向,未来可能让大模型更高效、更聪明。

📄 原文摘要(英文)

Router is the cornerstone component to the Mixture-of-Experts models. Serving as expert proxies, the rows of the router matrix compute their similarity to the MoE inputs to determine which subset of experts is activated. Ideally, each router row is designed to encode the expert matrix into this representative vector, such that its dot-product with token can better reflect token-expert affinity. However, there exists no design principles to enforce this condensation. In this paper, we propose to align each router row with the principal singular direction of the associated expert, as this direction provides the most expressive mathematical description of a matrix. Based on this principle, we propose a router redesign with Manifold Power Iteration (MPI). Specifically, it introduces a "Power-then-Retract" paradigm, where a power iteration step is performed on the router weights, followed by a retraction to impose a norm constraint to ensure both efficiency and stability. Theoretically, we show that MPI drives router rows to converge toward the principal singular directions of associated experts. Empirically, we pretrain MoE model across scales from 1B to 11B parameters to confirm that this alignment facilitates more effective MoE models.

arXiv 原文

📬 订阅 AI Pulse

每天三次更新,不错过重要信号

▲ 回到顶部