📄 论文解读

AI写代码，但这次它先看你的图

趋势通道 ▲ 26 多模态代码生成视觉输入GUI科学可视化

你给AI一张截图、一个图表或一段视频，它就能生成对应的代码——这不是科幻，而是这篇综述梳理的「多模态代码智能」领域。过去AI写代码只认文字指令，但真实场景里，程序员常指着屏幕说「把这个按钮右移10像素」，或者设计师丢来一张UI稿。这篇论文把任务分成四类：GUI（比如从截图生成网页代码）、科学可视化（从数据图生成绘图代码）、结构化图形（从矢量图生成SVG代码），以及更前沿的智能体任务（比如让AI看游戏画面后写操作脚本）。它不是你明天就能用的工具，但如果你关心AI如何真正理解人类用视觉表达的需求，这篇综述画出了技术地图。

📄 原文摘要(英文)

While Large Language Models (LLMs) have substantially advanced text-to-code synthesis, many real programming tasks specify intent through visual artifacts such as screenshots, charts, vector drawings, videos, and interactive states. These tasks require models to connect visual perception to executable programs, because correctness depends not only on syntax but also on layout, data semantics, interaction behavior, and domain-specific constraints that apply after execution. This survey examines Multimodal Code Intelligence, covering systems that generate, edit, refine, or reason with code under visually grounded inputs and outputs. We first formulate the field by the role that code plays in each task, distinguishing code as a rendered artifact, an editable symbolic structure, a scientific representation, an intermediate reasoning trace, or an executable policy or tool interface. We then organize benchmarks and methods into four domains: Graphical User Interface, Scientific Visualization, Structured Graphics, and Frontier Tasks and Frameworks. This taxonomy connects mature artifact-generation problems to emerging agentic and unified settings and allows us to compare how different tasks treat evidence of correctness. Looking ahead, we argue that future research may benefit from four verification-centered directions. Multi-signal validation can combine complementary evidence of correctness, multi-state verification can test behavior across execution trajectories, cross-task transfer testing can probe reusable visual-code skills, and verifiable agent traces can reveal whether agent actions are grounded in visual evidence. Together, these directions may move this field from single-output imitation toward evidence-grounded executable systems. An ongoing project and resources are available on https://github.com/xjywhu/Awesome-Multimodal-LLM-for-Code{GitHub}.

arXiv 原文

📬 订阅 AI Pulse