手机AI不用人教,自己看屏幕就能学会操作App
想让AI学会操作手机App,通常需要大量人工标注——告诉它哪个按钮该点、哪个页面该滑。但App更新快、数量多,根本标不完。这篇论文让AI自己看屏幕、自己试错、自己改进:它先在一个真实手机环境里自动生成任务(比如“订外卖”),然后执行并记录每一步的屏幕截图和操作结果,再用一套分层反馈机制——从整条任务成败到每一步的细节提示——来优化自己的策略。最终,一个8B参数的模型在AndroidWorld测试中达到67.2%的成功率,接近用封闭数据训练的专业模型(69.0%),而改进版更达到77.6%。它不是你明天就能直接用的工具,但意味着未来手机助手可能不再需要开发者专门适配每个App,自己看几遍就能学会。
📄 原文摘要(英文)
MLLM-based mobile GUI agents have made substantial progress in UI understanding and action execution, but adapting them to real target apps remains costly because mobile apps are numerous, frequently updated, and hard to cover with human-written tasks, demonstrations, or reward labels. Existing annotation-free GUI learning reduces manual supervision, yet lacks a unified substrate connecting target-app exploration, curriculum mining, rollout execution, and feedback, while policy optimization often relies on isolated rollouts and coarse rewards that are hard to convert into reliable improvement signals. We present MobileForge, an annotation-free adaptation system for mobile GUI agents. MobileForge consists of MobileGym, which grounds task generation and rollout evaluation in real mobile app interaction, and Hierarchical Feedback-Guided Policy Optimization (HiFPO), which turns trajectory outcomes, step-level process feedback, and corrective hints into hint-contextualized step-level GRPO updates. Using only automatically generated annotation-free adaptation data, MobileForge adapts Qwen3-VL-8B to 67.2% Pass@3 on AndroidWorld, close to the closed-data GUI-specialized GUI-Owl-1.5-8B base model at 69.0%. The MobileForge-adapted ForgeOwl-8B further reaches 77.6% Pass@3 on AndroidWorld and 41.0% success on the out-of-domain MobileWorld GUI-only split, establishing the strongest open-data mobile GUI agent in our evaluation. Code, data, and trained models will be released at https://mobile-forge.github.io/.