AI推理提速10倍:新方法打破猜测解码天花板
大模型生成回答时,通常一个字一个字地写,很慢。猜测解码(Speculative Decoding)让模型先猜一串字,再一起验证,能提速。但过去猜得越多,浪费也越多——因为猜的字之间可能互相矛盾。JetSpec 解决了这个矛盾:它让模型在猜每个字时,都参考前面猜过的字,保证整串猜测逻辑一致。结果,猜得越多,提速越明显。在数学题上,速度提升高达9.64倍;日常对话也有4.58倍。这不是你明天能直接用的功能,但它意味着未来AI回复会更快,尤其适合需要大量推理的数学、编程场景。
📄 原文摘要(英文)
Speculative decoding (SD) accelerates autoregressive Large Language Models (LLMs) by drafting multiple tokens and verifying them in parallel, but it faces a scaling limitation: increasing the draft budget improves speed only when acceptance remains high and drafting overhead stays low. This ceiling has been difficult to break because prior head-based SD methods face a causality-efficiency dilemma. Autoregressive drafters produce path-conditioned candidates that are effective for tree speculative decoding with higher acceptance length, but their drafting cost grows with tree depth. Bidirectional block-diffusion drafters generate all positions in one pass, but their branch-agnostic marginals can form individually plausible yet mutually inconsistent trees, wasting budget and reducing acceptance. We propose JetSpec, a head-based SD framework that combines one-forward drafting efficiency with branch-wise causal conditioning. JetSpec trains a causal parallel draft head over fused hidden states from the frozen target model, producing candidate trees whose scores align with the target model's autoregressive factorization. This enables JetSpec to convert larger draft budgets into longer accepted prefixes and higher end-to-end speedup. Across math, coding, and chat benchmarks on dense and MoE Qwen3 models, JetSpec consistently outperforms bidirectional-head and tree-based SD baselines. On H100 GPUs, JetSpec achieves up to 9.64x speedup on MATH-500 and 4.58x on open-ended conversational workloads, with further latency gains demonstrated through vLLM integration under realistic serving loads. Our code and models are available at https://github.com/hao-ai-lab/JetSpec.