用哈希签名替代词表,大模型参数省了但更强了
大模型通常靠一个巨大的词表矩阵来记住每个词,词越多参数越臃肿。这篇论文反其道而行:把每个词变成一串短的哈希签名(类似指纹),用多个哈希函数生成唯一ID,再通过一个哈希编码器压缩成向量给Transformer处理。解码时,模型直接生成下一个词的哈希签名,再映射回文本。在1亿到30亿参数规模上,这种哈希模型不仅参数更少,还在多个基准上超过了标准Transformer。更意外的是,它扩展多语言时无需增加参数——加新语言只是加哈希映射,不碰模型本身。这不是你明天能用的技术,但它指向一个趋势:未来大模型可能不再需要死记硬背词表,而是靠更聪明的编码方式压缩知识。
📄 原文摘要(英文)
Language models (LMs) represent tokens using embedding matrices that scale linearly with the vocabulary size. To constrain the parameter footprint, prior work proposes hashing many tokens into a single vector within encoder-only models. While this offers parameter efficiency, many-to-one collisions prevent its use in causal LMs. In this paper, we propose MultiHashFormer, a new framework that allows hash-based autoregression. Each token is represented as a unique hash signature, a short sequence of discrete hash IDs, generated by multiple independent hash functions. A Hash Encoder compresses this signature into a single latent vector for processing by a Transformer decoder. Then, a Hash Decoder generates the hash signature of the next token, which is then mapped back to text. We evaluate our approach at the 100M, 1B and 3B parameter scales, demonstrating that MultiHashFormer consistently outperforms standard Transformer LMs across multiple benchmarks. Furthermore, we show that our model handles multilingual vocabulary expansion with a constant parameter footprint without any modifications.