AI Pulse
📄 论文解读

教AI点屏幕:不是所有老师都值得信

让AI看懂手机屏幕并准确点击某个按钮,比想象中难——因为按钮小、截图分辨率高,坐标差一点就点错。现有方法用AI自己教自己(自蒸馏),但有个坑:当AI生成的中间步骤已经偏离目标时,老师给的信号反而会带偏学生。这篇论文提出一个聪明的门控机制:先检查老师当前预测的坐标是否还能修正回正确答案,不能就降低它的权重;再用老师的自信程度来微调信号强度。两个机制单独用都没用,合在一起却稳定提升效果。在6个GUI基准测试上,该方法持续优于现有方案。它不是你明天就能用的工具,但揭示了AI自我训练中的一个关键陷阱——盲目信任老师可能比不信任更糟。

📄 原文摘要(英文)

Graphical user interface (GUI) grounding requires vision-language models (VLMs) to identify small target elements in high-resolution screenshots and predict precise screen coordinates. On-policy self-distillation (OPSD) is a promising post-training approach for this coordinate-sensitive task, since it provides dense token-level teacher signals beyond hard coordinate labels. However, naive OPSD is not well suited to GUI grounding: OPSD evaluates the teacher on student-generated prefixes, the quality of coordinate-token teacher signals can degrade when the prefix has already deviated from the target coordinate, leading to unreliable teacher signal. To mitigate this, We propose quality-aware self-distillation for VLM-based GUI grounding, which improves coordinate-token teacher-signal quality through soft correctness-aware gating and teacher-probability scaling. The soft correctness-aware gate checks whether the teacher's current coordinate-token prediction can still be completed into the ground-truth box under the student-generated prefix. If not, the corresponding teacher signal is down-weighted. Teacher-probability scaling then uses the teacher's confidence as a lightweight factor to further calibrate the strength of the gated supervision. A key empirical finding is that neither component alone improves overall performance, whereas combining them consistently improves performance. This suggests that the two mechanisms play complementary roles: correctness-aware gating suppresses unreliable coordinate-token supervision, while teacher-probability scaling calibrates the strength of the remaining signals. Experiments across six GUI grounding benchmarks show that our method consistently improves the base model and outperforms strong baselines.

arXiv 原文

📬 订阅 AI Pulse

每天三次更新,不错过重要信号

▲ 回到顶部