📄 论文解读

大模型看百万字不卡壳，MiniMax 用稀疏注意力省了28倍算力

趋势通道 ▲ 126 稀疏注意力长上下文大模型推理加速MiniMax

大模型处理超长文本（比如整本小说、整个代码库）时，注意力机制的计算量会随长度平方增长，导致显存和速度双双崩溃。MiniMax 这篇论文的解法很直接：不是所有词都值得同等关注，他们设计了一种“稀疏注意力”——先快速扫一遍，挑出最重要的几个关键块，然后只对这些块做精确计算。在109B参数的多模态模型上，处理100万token时，每个token的注意力计算量减少了28.4倍，实际推理速度在H800上快了14倍（预填充）和7.6倍（解码）。关键是，模型质量没有下降。这不是你明天能直接用的工具，但它意味着：未来你用的AI助手能一口气读完你的所有聊天记录、整个项目代码，甚至整部电影剧本，而不会因为“记不住”而胡说八道。

📄 原文摘要(英文)

Ultra-long-context capability is becoming indispensable for frontier LLMs: agentic workflows, repository-scale code reasoning, and persistent memory all require the model to jointly attend over hundreds of thousands to millions of tokens, yet the quadratic cost of softmax attention makes this untenable at deployment scale. We introduce MiniMax Sparse Attention (MSA), a blockwise sparse attention built upon Grouped Query Attention (GQA). A lightweight Index Branch scores key-value blocks and independently selects a Top-k subset for each GQA group, enabling group-specific sparse retrieval while maintaining efficient block-level execution; the Main Branch then performs exact block-sparse attention over only the selected blocks. Designed around a principle of simplicity and scalability, MSA is deliberately streamlined, making it straightforward to deploy efficiently across a broad range of GPUs. To translate sparsity into practical speedups, we co-design MSA with a GPU execution path that uses exp-free Top-k selection and KV-outer sparse attention to improve tensor-core utilization under block-granular access. On a 109B-parameter model with native multimodal training, MSA performs on par with GQA while reducing per-token attention compute by 28.4x at 1M context. Paired with our co-designed kernel, MSA achieves 14.2x prefill and 7.6x decoding wall-clock speedups on H800. Our inference kernel is available at: https://github.com/MiniMax-AI/MSA. A production-grade natively multimodal model powered by MSA has been publicly released at: https://huggingface.co/MiniMaxAI/MiniMax-M3.

arXiv 原文

📬 订阅 AI Pulse