📄 论文解读

AI做游戏？最强模型成功率不到一半

趋势通道 ▲ 34 AI游戏生成编程智能体Godot引擎基准测试端到端游戏开发

让AI从头到尾做一个能玩的游戏，比写代码难得多。研究者搞了个新测试GameCraft-Bench，让AI在Godot引擎里按文字描述做140个游戏（从贪吃蛇到平台跳跃）。结果最强模型只完成41%，多数不到40%。AI能做出部分玩法，但做不出完整游戏——缺内容、缺视觉反馈、缺连贯体验。这不是你明天能拿来用的事，但它是衡量AI编程能力的新标尺：能写代码不等于能做出一个能玩的东西。

📄 原文摘要(英文)

Game generation is an emerging application of coding agents, requiring models to transform natural-language specifications into playable interactive systems. Unlike traditional coding tasks, game generation takes place within a game engine, where scripts, scenes, assets, rendering, and runtime interactions must jointly produce coherent gameplay. We formalize end-to-end game generation as the problem of producing a complete game artifact that realizes a specification through observable player-game interaction in a target environment. We argue that evaluating this setting requires three desiderata: Engine Grounding, Artifact Completeness, and Interactive Verification. We propose an interaction-grounded evaluation framework that assesses executable gameplay through replayed demonstrations and rubric-guided multimodal judging. We instantiate this framework as GameCraft-Bench, a benchmark comprising 140 Godot tasks across 15 game families. Evaluations of frontier coding agents show that end-to-end game generation remains highly challenging: the strongest agent achieves only 41.46%, and most agents score below 40%. Further analysis reveals that while agents often implement recognizable mechanics, they struggle to deliver complete games with sufficient content, functional visual feedback, and coherent presentation. See https://tongxuluo.github.io/gamecraft-bench-website for demos, code, and data.

arXiv 原文

📬 订阅 AI Pulse