📄 论文解读

AI看视频学做事，关键帧提取是瓶颈

趋势通道 ▲ 11 视频理解关键帧提取GUI智能体多模态大模型基准测试

现在的AI看视频答题已经很强，但让它看完视频教程后动手操作（比如跟着视频学用软件），它就不行了。研究者发现，问题出在AI不会挑关键帧——它把每一帧都当重点，结果信息过载。他们设计了一个新算法TASKER，能同时考虑“任务相关”和“场景变化”，只挑出真正有用的帧。在视频问答和GUI操作两个任务上，这个算法都让AI表现更好。虽然你明天用不上，但它点出了一个趋势：AI从“看懂”到“学会做事”，关键帧提取可能是必须跨过的坎。

📄 原文摘要(英文)

Video understanding is a fundamental capability for multimodal intelligence, and recent Multimodal Large Language Models (MLLMs) have achieved remarkable performance on Video Question Answering (VideoQA) benchmarks. However, existing benchmarks primarily evaluate whether models can perceive shallow visual cues, while rarely examining whether MLLMs can learn deeper knowledge or procedural skills from video tutorials and generalize them to downstream long-horizon agentic tasks. To address this gap, we introduce VG-GUIBench (Video-Guided GUI Benchmark), a new benchmark designed to evaluate whether MLLM-based GUI agents can follow video tutorials to complete corresponding GUI interactive tasks. Furthermore, we observe that the performance of models on both VideoQA and video-guided agentic tasks critically depends on effective keyframe extraction. Based on this observation, we propose TASKER (Task-driven And Scene-aware Keyframe searchER), a keyframe extraction algorithm that jointly considers task relevance and scene dynamics to identify informative frames. Experimental results demonstrate that TASKER achieves significant performance improvements on both VideoQA and video-guided agentic task benchmarks, outperforming the best baseline by 2.0% on the EgoSchema fullset and 1.8% on the NExT-QA dataset, respectively. These results further highlight the potential of generalized keyframe extraction methods for video understanding tasks. Our code and data are available at https://github.com/VG-GUI-TASKER/VG-GUI-TASKER.

arXiv 原文

📬 订阅 AI Pulse