📄 论文解读

AI推理加速：给大模型装个“分级安检”

趋势通道 ▲ 32 大模型推理投机解码加速分级验证

大模型生成文本时，通常需要“草稿模型”快速写个初稿，再由“验证模型”逐字检查。传统做法是：要么直接通过，要么全部重算——这就像机场安检，要么放行，要么把行李全倒出来。这篇论文发现，很多被驳回的草稿其实只是小问题，没必要惊动大模型。他们从大模型内部“切”出一个轻量版子模型，专门处理中等置信度的词，只有最不确定的才交给完整大模型。实验显示，这种分级机制能把拒绝率降低10-22%，整体速度提升10-20%，比不用草稿的原始解码快2.5-3倍。而且它不改变现有训练流程，可以直接套用。这不是你明天就能用的工具，但它指向一个趋势：AI推理正在从“一刀切”走向“按需分配”，让计算资源花在刀刃上。

📄 原文摘要(英文)

Speculative decoding (SD) addresses the high inference costs of LLMs by having lightweight drafters generate candidates for large verifiers to validate in parallel. Existing draft-verify methods use binary decisions: accept or fully recompute. Yet we find that many rejected tokens can be verified correctly by a slim submodel derived from the full verifier via intra-model routing, instead of the full verifier. This motivates our slim-verifier to handle tokens requiring moderate verification resources, reducing expensive large-model calls. We propose Verification via Intra-Model Routing for Speculative Decoding (VIA-SD), a multi-tier framework using a routed slim-verifier. Draft tokens are processed hierarchically: direct acceptance for high-confidence cases, slim-verifier regeneration for medium-confidence cases, and full-model verification for uncertain cases. Across four representative tasks and multiple model families, VIA-SD reduces rejection rates by 0.10-0.22 and delivers 10-20% speedups over strong SD baselines, while achieving 2.5-3x acceleration over non-drafting decoding. Moreover, VIA-SD is compatible with existing SD frameworks without modifying their training procedures. Our results suggest multi-tier SD as a general paradigm for scalable and efficient LLM inference. Project page: https://zju-xyc.github.io/VIA-SD-Project-Page/

arXiv 原文

📬 订阅 AI Pulse