AI Pulse
📄 论文解读

大模型推理提速:让请求找对专家

大模型推理时,每个请求会激活不同的“专家”模块。现有调度只考虑负载均衡,但同样负载下,激活的专家不同,速度可能差很多。ELDR 在请求预填充阶段就预测它后续会激活哪些专家,然后把它路由到最匹配的推理节点上,避免跨节点加载专家权重。在 40 张 GPU 上测试,中位响应时间降低 5.9-13.9%,且模型输出不变。这不是你明天能直接用的工具,但它揭示了 MoE 模型推理优化的新方向:从“看负载”到“看专家亲和性”。

📄 原文摘要(英文)

In prefill-decode (PD) disaggregated LLM serving, each request is assigned to a decode worker after prefill. Existing decode routers balance only load; for mixture-of-experts (MoE) models this is incomplete: equally loaded workers can differ in latency, since each decode step loads the weights of every distinct expert its batch activates. We present ELDR, an expert-locality-aware decode router for PD-disaggregated MoE serving. From a request's prefill expert activations, ELDR builds an expert signature predicting the experts it will activate during generation. Offline, balanced K-means partitions signature space across decode workers; online, locality-band routing sends each request to the least-loaded worker among those best matching its signature. A signature cache, co-indexed with the KV cache at KV-block granularity, keeps signatures exact under prefix caching. Implemented in vLLM and evaluated on deployments of up to 40 GPUs, ELDR reduces median TPOT by 5.9-13.9% over the strongest of four load-balancing baselines across three MoE models and two workloads, with model outputs unchanged.

arXiv 原文

📬 订阅 AI Pulse

每天三次更新,不错过重要信号

▲ 回到顶部