AI Pulse
📄 论文解读

AI推理内存省一半,不用再手动调参数

大模型推理时,KV缓存会吃掉大量显存。现有压缩方法需要你提前设定一个“阈值”(比如保留多少缓存),但不同输入的最佳阈值天差地别——短文本和长文本、简单问题和复杂问题,阈值完全不同。一旦设错,性能暴跌。这篇论文直接砍掉这个阈值,让模型自己动态分配缓存预算:哪里重要多留,哪里不重要少留。在13个数据集上,它做到了无损压缩,内存省一半,速度还更快。它不是你明天就能直接用的工具,但指明了方向:未来的AI推理会更省心,不用再为每个场景手动调参数。

📄 原文摘要(英文)

To reduce memory consumption during LLM inference, a handful of methods have been proposed for KV cache pruning. While these techniques can accomplish lossless memory reduction on many datasets, they often hinge on an under-emphasized condition: an input/domain-specific threshold for KV cache budget needs to be pre-determined to achieve the optimal performance. However, such input-sensitive design may be considerably limited in real-world scenarios, as open-domain inputs span diverse domains, lengths and difficulty levels, without clear boundaries for threshold selection. As a result, the dependence of such input-sensitive threshold can be a fundamental limitation that causes large degradation on arbitrary inputs. In this work, we propose a new objective that lifts the threshold constraints for robust KV compression, advocating for "threshold-free" methods that adaptively adjust budget allocation while preserving full-cache performance. We then propose a novel method, ReFreeKV, serving as the first instantiation of this objective. Extensive experiments across 13 datasets with diverse context lengths, task types, and model sizes demonstrate its efficacy and efficiency. Our code is publicly released at https://github.com/Patrick-Ni/ReFreeKV.

arXiv 原文

📬 订阅 AI Pulse

每天三次更新,不错过重要信号

▲ 回到顶部