📄 论文解读

AI 助手学会从失败中进化，成本直降98%

信赖通道 ▲ 20 视觉AI视频理解成本优化自我进化边缘计算

现在的视觉AI处理视频时，要么把所有帧都上传（贵且慢），要么随机抽几帧（容易漏关键信息）。VisualClaw 用了一个巧妙的“两级门控”：先快速筛掉没变化的帧，再根据当前问题只保留最相关的几帧——结果是把1小时视频的API调用从3600次降到5-20次，成本砍掉98%，准确率反而还涨了。更关键的是，它会在犯错后自动更新自己的“技能库”，下次遇到类似问题就能直接调用经验，不用每次都从头推理。在200个真实场景的测试中，这种自我进化让任务成功率提升了约3%。虽然它不是你明天就能用的产品，但“边用边学、越用越省”的思路，是AI从工具走向真正助手的关键一步。

📄 原文摘要(英文)

Vision language models are serving as general-purpose interfaces for complex multimodal tasks. However, deployment still faces three gaps: VLMs typically incur high latency and cost when processing dense video frames and long prompts, the agent scaffold remains static after deployment, and standard video-QA benchmarks do not test whether agents can use visual evidence inside tool-using workspaces. We present VisualClaw, a self-evolving multimodal agent built around two principles. First, hybrid encoding reduces deployment cost by filtering less informative streaming frames with a cascaded gate and compressing the text skill bank through hot/cold top-k injection. Second, skill evolution lets the agent learn from failures: retrieved memories condition an evolver as direct concatenated context or as guided evidence, producing skill-bank updates that help future questions. Across 4 video-QA benchmarks with 2 VLMs, VisualClaw cuts per-question API cost by an average -98% versus full-frame upload and by -25.9% over the offline uniform 8 frame baseline, while boosting accuracy in most settings, e.g., an average +3.85% and a peak +15.80% on EgoSchema with Gemini 3 Flash. To address the gap, we curate VisualClawArena, a 200-scenario multimodal agentic benchmark built through a strict five-stage pipeline; models must use video evidence, documents, dynamic updates, and executable checks inside a workspace. On VisualClawArena, the same framework with computer-use agent backends improves macro accuracy by +2.9% for Codex (GPT-5.5) and +3.2% for Claude Code (Sonnet 4.6) over no-evolution baselines, with a -9.5% cost reduction compared to the uniform-sampled baseline. These properties make VisualClaw a natural fit for edge applications, where the cascade reduces a 1-hour streaming session from ~3,600 API uploads down to only 5-20 calls and the self-evolution makes it a perfect personalized assistant.

arXiv 原文

📬 订阅 AI Pulse