📄 论文解读

AI画图老写错字？它学会从废稿里偷师

信赖通道 ▲ 17 文本图像生成多智能体数据增强OCR自进化

AI生成带文字的图片（比如海报、菜单）一直很难：画面好看，字却经常糊、错位、语义不对。现有做法是“爬图→筛掉差的→用好的训练”，但筛掉的废稿里其实藏着宝贵信号——比如哪里OCR识别失败、哪里文字和画面不搭。这篇论文搞了个多智能体系统，让AI自己从废稿中总结失败模式（比如“这个字体在复杂背景上看不清”），然后针对性生成更多类似但改进的样本，再喂回训练。在PixArt-alpha模型上，只用75万张图，就把文字识别准确率（OCR-F1）最高提升了85%。它不是你明天就能用的工具，但指明了一条路：AI训练数据不必一次定终身，废料也能变肥料。

📄 原文摘要(英文)

Text-rich image generation is one of the most challenging settings in image generation, since models must simultaneously produce visually realistic images and render legible, semantically aligned, and layout-consistent text. Existing data pipelines usually follow a static crawl-filter-freeze paradigm. They collect candidate samples, filter them once, and freeze the accepted data for training. However, rejected samples are usually discarded, although they often contain useful failure signals such as OCR errors and semantic mismatches. As a result, later construction rounds may repeat the same failure modes. To address these limitations, we propose DataEvolver, a self-evolving multi-agent framework for text-rich image data construction. DataEvolver treats data construction as feedback-driven construction policy evolution. A Retriever collects candidate samples, a Verifier assigns quality scores and rejection causes, a Critic summarizes round-level feedback into semantic feedback, and a Generator completes under-covered regions through targeted synthesis. The updated feedback memory then guides the next construction round. Experiments on text-rich image generation benchmarks show that DataEvolver produces more useful training data than fixed-dataset baselines under matched data budgets. At the 0.75M scale on PixArt-alpha, DataEvolver improves OCR-F1 over the strongest baseline by 85.3 percent on TextScenesHQ and 35.3 percent on LongTextBench. The improvements are consistent across both evaluated benchmarks and also transfer to Show-o2, indicating that the benefit of DataEvolver is not tied to a single downstream generator. These results suggest that rejected samples can provide actionable feedback for improving text-rich image data construction.

arXiv 原文

📬 订阅 AI Pulse