AI不再等你问:它自己决定何时开口
现在的AI助手都是你问一句它答一句,但现实世界不会等你提问——监控里突然起火、直播中商品闪过、视频通话里表情变化,这些瞬间转瞬即逝。这篇论文做了一个8B参数的视觉语言模型,它像一个人一样持续“看着”当前画面,自己判断该不该说话:安静、回应、还是交给后台更强大的模型处理。它不需要你按按钮或喊名字,看到火灾就报警,看到你皱眉就问你还好吗。在6个真实场景中,人类评分者认为它比豆包和Gemini的视频通话助手好得多。这不是你明天就能下载的App,但它指向一个未来:AI从“应答机”变成“在场者”。
📄 原文摘要(英文)
Many moments in the real world do not wait for a user to ask. A fire starts on a security monitor, an expression flickers across a video call, or a product a viewer wants flashes by in a livestream. Yet today's large models remain mostly turn-based by design: they answer only when addressed, and even video-call apps that appear interactive still operate as question-answer systems, reacting only when polled or prompted. We argue for a different paradigm: a model that is present in the world like a person. It continuously watches what is happening now, decides on its own whether to speak or stay silent, interacts in real time, and delegates to a background model when the problem is hard. To advance interaction models and their adoption across domains, we make two fully open-sourced contributions. First, we release JoyAI-VL-Interaction, an 8B-scale, vision-first VL-interaction model. The model makes the response decision internally, choosing each second to stay silent, respond, or delegate to a background model, and it excels at vision-triggered responsiveness and time awareness. We pair it with a transferable training recipe, from which capabilities we never trained for emerge, such as guiding a shopper through changing app screens or improvising a lecture from a slide deck. Second, we release a complete, deployable system built around that model. The system streams any ongoing video into the model, making it genuinely present in the world. All other components are pluggable, including ASR/TTS modules, memory, visualization UI, and a background brain that can connect to any API or agent. Across six real-world scenarios, human raters prefer JoyAI-VL-Interaction over the in-app video-call assistants of Doubao and Gemini by a wide margin. To our knowledge, this is the first open, vision-driven interaction model released together with its training recipe, data, and complete deployable system.