📄 论文解读

AI画图终于能边写边画了

前沿通道 ▲ 65 图文交替生成多智能体图像生成强化学习推理增强

现在的AI画图工具（如Midjourney、FLUX）能生成单张照片级的图，也能按指令改图，但没法像人一样边写文字边配图——比如写一段故事，每句话配一张图，图文交替出现。这篇论文搞了个多智能体流水线：一个「规划师」把图文顺序拆成步骤，告诉画图模型每一步该画什么；一个「评论家」检查画出来的图是否跑偏，如果不对就修正指令重画。为了训练评论家，他们用强化学习让它在一次生成过程中（可能调用画图模型25次以上）学会逐步纠错。结果不仅让普通画图模型能生成图文交替的内容，效果追上最新闭源模型，还意外提升了画图模型在推理类任务上的表现。它不是你明天就能用的功能，但意味着AI离「边写边画」的创作方式又近了一步。

📄 原文摘要(英文)

Recent image generators have demonstrated impressive photorealism and instruction-following capabilities in single-image generation and editing. However, constrained by their architectures, they cannot achieve interleaved generation (text-image sequence), which has crucial applications in visual narratives, guidance, and embodied manipulation. Even the latest open-source Unified Multimodal Models (UMMs) exhibit limited performance in this regard. In this paper, we introduce InterleaveThinker, the first multi-agent pipeline designed to endow any existing image generator with interleaved generation capabilities. Specifically, we employ a planner agent to organize the image-text input sequence, instructing the image generator on the required execution at each step. Subsequently, we introduce a critic agent to evaluate the generator's outputs, identify samples that deviate from the planned instructions, and refine the instructions for regeneration. To implement this pipeline, we construct the Interleave-Planner-SFT-80k and Interleave-Critic-SFT-112k to perform a format cold-start. Then we develop Interleave-Critic-RL-13k to reinforce the step-wise instruction correction capability within a generation trajectory using GRPO. Since a single interleaved generation trajectory may involve over 25 generator calls, optimizing the entire trajectory is computationally impractical. Therefore, we propose accuracy reward and step-wise reward, allowing single-step RL to effectively guide the entire generation trajectory. The results show that InterleaveThinker improves performance across various image generators. On interleaved generation benchmarks, it achieves performance comparable to Nano Banana and GPT-5. Surprisingly, it also significantly enhances the base model on reasoning-based benchmarks; for example, on 4-step FLUX.2-klein, we observe substantial gains on WISE and RISE.

arXiv 原文

📬 订阅 AI Pulse