AI Pulse
📄 论文解读

AI推理提速4倍:动态块大小自适应策略

AI模型推理时,传统方法用固定大小的“块”来批量生成token,但不同输入的最佳块大小差异巨大。这篇论文发现,最优块大小其实集中在训练时用的块大小附近,形成一个低维结构。于是他们设计了一个轻量策略,在模型读完输入后,仅用一次预测就选出当前输入的最佳块大小。实验显示,在Qwen3-4B模型上,该方法将接受长度提升到5.92,推理速度提升4.2倍,且几乎不增加额外开销。这不是你明天能直接用的工具,但它揭示了AI推理加速的一个新方向:让策略随输入自适应,而非一刀切。

📄 原文摘要(英文)

Speculative decoding accelerates inference by using a lightweight draft model to generate candidate tokens in parallel, and are then verified by the target model, enabling lossless acceleration. Recently, diffusion-based speculative decoding further improves parallelism by generating multiple tokens per forward pass via block-level diffusion, achieving state-of-the-art (SOTA) performance. However, existing methods adopt a fixed inference block size and assume a uniform optimal decoding strategy across all inputs. In this paper, we show that this assumption is suboptimal, as the optimal block size varies across samples and plays a critical role in speculative decoding performance. Moreover, these values exhibit a clear local structure, concentrating around the training block size, which reduces the problem to a low-dimensional and structured decision space. Based on these insights, we propose BlockPilot, a sample-adaptive policy that predicts the optimal block size from the prefilling representation. Specifically, we formulate block size selection as a lightweight policy learning problem and propose an instance-adaptive decision mechanism that predicts the optimal block size based on the representation of the prefilling stage. The prediction is performed only once after prefilling, allowing for seamless integration. Extensive experiments demonstrate that our method is plug-and-play, introduces minimal overhead, and consistently improves efficiency, achieving an acceptance length of 5.92 and a 4.20times speedup on Qwen3-4B under temperature T=1.

arXiv 原文

📬 订阅 AI Pulse

每天三次更新,不错过重要信号

▲ 回到顶部