📄 论文解读

AI写代码，现在要“看图说话”了

趋势通道 ▲ 21 多模态代码生成视觉理解GUI科学可视化

你让AI写个网页，它写出来了，但布局不对、颜色不对——因为你没法告诉它“左边那张图里的按钮样式”。这篇综述把这类问题统称为“多模态代码智能”：AI不仅要看懂文字需求，还要看懂截图、图表、视频、手绘图，然后生成能正确运行的代码。它把任务分成四类：做界面（照着截图写前端）、做科学图（照着数据图写可视化代码）、做矢量图（照着草图写SVG）、以及更前沿的“智能体”任务（AI自己操作软件、看反馈、改代码）。核心挑战是验证：你怎么知道AI生成的代码是对的？光跑通不够，还要看布局对不对、交互行为是否符合预期。这篇不是工具，而是帮你理解为什么现在的AI写代码“有时聪明、有时瞎”——因为它缺了“看”的能力。

📄 原文摘要(英文)

While Large Language Models (LLMs) have substantially advanced text-to-code synthesis, many real programming tasks specify intent through visual artifacts such as screenshots, charts, vector drawings, videos, and interactive states. These tasks require models to connect visual perception to executable programs, because correctness depends not only on syntax but also on layout, data semantics, interaction behavior, and domain-specific constraints that apply after execution. This survey examines Multimodal Code Intelligence, covering systems that generate, edit, refine, or reason with code under visually grounded inputs and outputs. We first formulate the field by the role that code plays in each task, distinguishing code as a rendered artifact, an editable symbolic structure, a scientific representation, an intermediate reasoning trace, or an executable policy or tool interface. We then organize benchmarks and methods into four domains: Graphical User Interface, Scientific Visualization, Structured Graphics, and Frontier Tasks and Frameworks. This taxonomy connects mature artifact-generation problems to emerging agentic and unified settings and allows us to compare how different tasks treat evidence of correctness. Looking ahead, we argue that future research may benefit from four verification-centered directions. Multi-signal validation can combine complementary evidence of correctness, multi-state verification can test behavior across execution trajectories, cross-task transfer testing can probe reusable visual-code skills, and verifiable agent traces can reveal whether agent actions are grounded in visual evidence. Together, these directions may move this field from single-output imitation toward evidence-grounded executable systems. An ongoing project and resources are available on https://github.com/xjywhu/Awesome-Multimodal-LLM-for-Code{GitHub}.

arXiv 原文

📬 订阅 AI Pulse