📄 论文解读

教AI点屏幕：不是所有老师都值得信

信赖通道 ▲ 13 GUI接地自蒸馏视觉语言模型坐标预测门控机制

让AI看懂手机屏幕并准确点击某个按钮，比想象中难——因为按钮小、截图分辨率高，坐标差一点就点错。现有方法用AI自己教自己（自蒸馏），但有个坑：当AI生成的中间步骤已经偏离目标时，老师给的信号反而会带偏学生。这篇论文提出一个聪明的门控机制：先检查老师当前预测的坐标是否还能修正回正确答案，不能就降低它的权重；再用老师的自信程度来微调信号强度。两个机制单独用都没用，合在一起却稳定提升效果。在6个GUI基准测试上，该方法持续优于现有方案。它不是你明天就能用的工具，但揭示了AI自我训练中的一个关键陷阱——盲目信任老师可能比不信任更糟。

📄 原文摘要(英文)

Graphical user interface (GUI) grounding requires vision-language models (VLMs) to identify small target elements in high-resolution screenshots and predict precise screen coordinates. On-policy self-distillation (OPSD) is a promising post-training approach for this coordinate-sensitive task, since it provides dense token-level teacher signals beyond hard coordinate labels. However, naive OPSD is not well suited to GUI grounding: OPSD evaluates the teacher on student-generated prefixes, the quality of coordinate-token teacher signals can degrade when the prefix has already deviated from the target coordinate, leading to unreliable teacher signal. To mitigate this, We propose quality-aware self-distillation for VLM-based GUI grounding, which improves coordinate-token teacher-signal quality through soft correctness-aware gating and teacher-probability scaling. The soft correctness-aware gate checks whether the teacher's current coordinate-token prediction can still be completed into the ground-truth box under the student-generated prefix. If not, the corresponding teacher signal is down-weighted. Teacher-probability scaling then uses the teacher's confidence as a lightweight factor to further calibrate the strength of the gated supervision. A key empirical finding is that neither component alone improves overall performance, whereas combining them consistently improves performance. This suggests that the two mechanisms play complementary roles: correctness-aware gating suppresses unreliable coordinate-token supervision, while teacher-probability scaling calibrates the strength of the remaining signals. Experiments across six GUI grounding benchmarks show that our method consistently improves the base model and outperforms strong baselines.

arXiv 原文

订阅 AI Pulse