📄 论文解读

给大模型喂数据像做实验，因果推断找出最佳配方

趋势通道 ▲ 14 大模型训练数据配比因果推断可解释性

训练大模型时，不同来源的数据怎么配比最有效？以前的方法假设数据分布不变，一旦换数据就得重算。这篇把数据配比当成因果推断问题：先在小模型上跑512次实验，用因果模型算出每种数据配比对模型能力的“真实贡献”，再把这个规律直接套到大模型上。结果比传统方法好，而且换成长思维链数据也能用。它不是你明天能用上的，但告诉你一个趋势：AI训练正在从“试错”走向“科学实验”。

📄 原文摘要(英文)

In Large Language Model (LLM) training, data mixing plays a pivotal role in determining model performance. Recent methods optimize mixture weights via proxy models, but they rely on the assumption of static data distributions. As a result, when the underlying data pool shifts, these methods require costly retraining from scratch. This limitation restricts their ability to scale seamlessly from small settings to larger data pools and model sizes. In this paper, we propose CausalMix to address this limitation by casting data mixture optimization as a causal inference problem. We formulate the statistical features of the data pool as covariates and the domain mixture as the treatment. After fitting a causal model on 512 runs of Qwen2.5-0.5B to estimate the Conditional Average Treatment Effect (CATE), we extrapolate the optimal mixture for an 800K data pool and apply it to train a 7B model. Furthermore, we successfully generalize the framework to long chain-of-thought data on Qwen3-4B-Base. By leveraging causal modeling to isolate confounding biases, CausalMix dynamically infers state-dependent optimal data mixtures. Extensive experiments show that the mixture guided by CausalMix consistently improves performance across multiple downstream tasks, outperforming RegMix and other baselines. In addition, we use the CATE Interpreter to provide visual analysis of the learned mixing strategy. Overall, CausalMix offers a causal and interpretable framework for optimizing LLM data mixtures.

arXiv 原文

📬 订阅 AI Pulse