📄 论文解读

AI学会双向看上下文，写代码数学更强了

趋势通道 ▲ 23 扩散语言模型双向注意力非自回归大模型训练

主流大模型都是从左到右逐字生成，像打字机。这篇论文反着来：训练时让模型同时看到前后文，像拼图一样随机遮住一些词再猜。他们把这个方法从零训练了一个80亿参数的模型，喂了12万亿个token，结果在数学、代码等任务上比同规模的单向模型强出一大截——比如数学题正确率提升14.5%，代码生成提升16.5%。它不是你明天能用上的，但说明“双向看”可能成为下一代语言模型的新起点。

📄 原文摘要(英文)

Modern large language models are predominantly trained with autoregressive factorization and causal attention. We present iLLaDA, an 8B masked diffusion language model trained from scratch with fully bidirectional attention. iLLaDA keeps the masked diffusion objective throughout pre-training and supervised fine-tuning (SFT), scaling pre-training to 12T tokens and fine-tuning on a 25B-token instruction corpus for 12 epochs. We further use variable-length generation for efficiency and introduce confidence-based scoring for multiple-choice evaluation. Compared with LLaDA, iLLaDA improves broadly across general, mathematical, and code benchmarks; for example, iLLaDA-Base improves by 21.6 points on BBH and 14.9 points on ARC-Challenge, while iLLaDA-Instruct improves by 14.5 points on MATH and 16.5 points on HumanEval. Despite its non-autoregressive training, iLLaDA also remains competitive with Qwen2.5 7B on several benchmarks. These results show that fully bidirectional diffusion training from scratch is a competitive path toward strong language models. Model weights and codes: https://github.com/ML-GSAI/LLaDA.

arXiv 原文

📬 订阅 AI Pulse