📄 论文解读

大模型文本嵌入的隐藏缺陷：高频词在捣乱

信赖通道 ▲ 73 文本嵌入高频词语义表示降维大模型

你让大模型把一段话转成向量（文本嵌入），它却偷偷把高频词（比如“的”“是”“了”）的权重拉得很高，导致语义被稀释。研究者发现，问题出在模型的“反嵌入矩阵”——它本应把向量转回文字，却反向把高频词的特征写进了嵌入空间。他们设计了一个叫 EmbedFilter 的线性变换，直接滤掉这个“高频子空间”，结果嵌入质量反而提升，还能顺便降维（存储更小、检索更快）。这不是你明天能用的工具，但它解释了一个反直觉现象：大模型做嵌入时，最“懂”的词反而是最没信息量的。

📄 原文摘要(英文)

Large language models exhibit impressive zero-shot capabilities across a wide range of downstream tasks. However, they struggle to function as off-the-shelf embedding models, leading to suboptimal performance on massive text embedding benchmarks. In this paper, we identify a potential cause underlying this deficiency. Our motivation stems from an unexpected observation: text embeddings tend to align with frequent but uninformative tokens when projected onto the vocabulary space. We argue that this excessive expression of high-frequency tokens suppresses the model's ability to capture nuanced semantics. To address this, we introduce EmbedFilter, a simple linear transformation designed to refine text embeddings derived from LLMs directly. Specifically, we uncover that the unembedding matrix within LLMs encodes a latent space that is actively writing these frequent tokens into embedding space. By filtering out this subspace, EmbedFilter suppress the influence of high-frequency tokens, thereby enhancing semantic representations. As a compelling byproduct, this enables an inherent dimensionality reduction, lowering index storage and speedup retrieval while fully preserving the refined embedding quality. Our experiments across multiple LLM backbones demonstrate that LLMs equipped with EmbedFilter achieve superior zero-shot downstream performance even with significantly reduced embedding dimensions. We hope our findings provide deeper insights into the mechanisms of LLM-based representations and inspire more principled designs to improve text embeddings training. Our code is available at https://github.com/CentreChen/EmbFilter.

arXiv 原文

📬 订阅 AI Pulse