📄 论文解读

AI 学会「压缩」长文本，像人一样只记重点

趋势通道 ▲ 17 长文本内存压缩编码器-解码器潜变量速读

大模型处理长文本时，内存会爆炸——因为每个词都要存一份「记忆缓存」。现有压缩方法要么让模型变笨，要么压缩一次比读一遍还慢。这篇论文反其道而行：训练一个专门的「压缩器」模型，把长文本（比如一本书）先浓缩成一小段「潜台词」，再喂给主模型。他们从零开始试了上百种架构，最终造出的模型能在 4 倍、8 倍、16 倍压缩下，既保持性能，又比传统缓存快得多、省内存。更妙的是，这个压缩器可以像人的速读一样：先扫一遍压缩版，遇到关键处再展开原文细看。它不是你明天就能用的工具，但指明了方向：未来的 AI 可能不再死记硬背，而是学会「抓重点」。

📄 原文摘要(英文)

Long-context language model inference is bottlenecked by memory, as the KV cache grows with context length. Recent techniques to compress the KV cache fall short: they either degrade model quality substantially or require considerable time and compute to compress a single long prompt. Furthermore, many methods require the input to fit within the target model's context window, and are generally incompatible with modern production inference engines. Encoder-decoder compressors, which map a long token sequence to a shorter sequence of latent embeddings consumed by a decoder, are an appealing alternative in principle. However, existing approaches are not competitive with KV cache compression on the accuracy-efficiency frontier. In this work, we revisit encoder-decoder compression and close this gap. We first perform an architecture search, pre-training many variants from scratch to determine how best to design and train encoder-decoder compressors. Guided by our findings, we continually pre-train a family of 0.6B-encoder, 4B-decoder models on over 350B tokens each, at compression ratios of 1:4, 1:8, and 1:16. We introduce Latent Context Language Models (LCLMs), a family of compressors that improve the Pareto frontier across general-task performance, compression speed, and peak memory usage. We demonstrate that LCLMs serve as efficient backbones for long-horizon agents, letting the agent skim through a compressed long context and adaptively expand relevant segments on demand.

arXiv 原文

📬 订阅 AI Pulse