📄 论文解读

AI 终于能像真人一样边看边听边说了

趋势通道 ▲ 32 实时交互多模态端到端低延迟视听对话

现在的 AI 对话要么只能听声音、要么只能看文字，视频通话更是靠多个模块拼凑，延迟高、反应生硬。这篇论文直接端掉了所有中间模块——语音识别、文本生成、语音合成、表情驱动、视频生成——全部揉进一个模型里，让 AI 能同时接收和输出视频、音频、文字，而且延迟压到了 200 毫秒（模型侧），加上网络延迟总共约 550 毫秒，基本实现真人对话的节奏感。它不是你明天就能用的产品，但指明了下一代交互 AI 的方向：不再分段处理，而是像人一样边看边听边回应。

📄 原文摘要(英文)

We present Wan-Streamer, a native-streaming, end-to-end interactive foundation model designed from the ground up for real-time, low-latency, full-duplex audio-visual interaction. Wan-Streamer seamlessly models language, audio, and video as both input and output within a single Transformer, where the sequence is represented as interleaved visual, audio, and text input tokens together with visual, audio, and text output tokens, coordinated by block-causal attention for incremental streaming. Unlike cascaded interactive systems that rely on separate VAD, ASR, language, TTS, audio-driven animation, or video-generation modules, Wan-Streamer does not rely on external language, speech, avatar, or video-generation modules: perception, reasoning, generation, response timing, turn management, and cross-modal synchronization are learned jointly within one unified model, reducing pipeline latency and error accumulation. To support natural audio-visual responsiveness, we redesign the entire stack around streamability, including causal encoders, causal decoders, block-causal attention, and low-latency multimodal token scheduling, enabling streaming units as short as 160 ms at 25 fps. Wan-Streamer achieves approximately 200 ms model-side response latency and approximately 550 ms total interaction latency when combined with 350 ms bidirectional network latency, supporting sub-second duplex audio-visual communication. These results position Wan-Streamer as a unified, end-to-end, multimodal interactive foundation model for low-latency streaming interaction.

arXiv 原文

📬 订阅 AI Pulse