视频编辑实时化:逐帧改,背景不崩
现在的AI视频编辑要么慢到没法实时互动,要么改几帧后背景就开始闪烁、物体变形。这篇论文搞了个新框架,核心是三步蒸馏:先让一个强大的双向模型学会怎么改视频(但很慢),再把它教给一个单向的流式模型,后者能一帧一帧地边看边改,同时用缓存机制复用上一帧的计算结果,最终在保持画面稳定的前提下把速度拉到12.66帧/秒——够直播或AR里用了。它不是你明天就能用的工具,但指明了实时视频编辑从“离线渲染”走向“边拍边改”的技术路径。
📄 原文摘要(英文)
Streaming video editing has made rapid progress, yet practical deployment is still limited by two core issues: maintaining stable backgrounds and non-edited regions over time, and achieving the low latency required for real-time interactive scenarios. Meanwhile, recent streaming video generation methods are mostly developed for synthesis and cannot be directly applied to editing due to the strict preservation requirement and region-specific control. In this work, we present a novel streaming video editing framework that performs causal, frame-by-frame editing with strong content preservation and real-time responsiveness. Our key design is a three-stage distillation pipeline that progressively transfers editing capability from a powerful bidirectional foundation model to an efficient unidirectional streaming editor, enabling stable long-horizon edits without sacrificing visual fidelity. To further support real-time deployment, we introduce an AR-oriented mask cache that reuses region-related computation across frames, substantially reducing redundant processing and accelerating inference. Finally, we establish a dedicated benchmark for streaming video editing. Extensive evaluations demonstrate that our method achieves state-of-the-art visual quality among streaming baselines while drastically boosting inference speed to 12.66 FPS, making it suitable for interactive and augmented reality applications.