📄 论文解读

视频修复跑上消费级显卡，4K实时不是梦

趋势通道 ▲ 12 视频修复实时消费级显卡注意力机制生成式模型

视频修复（去噪、去模糊、补帧）通常需要高端服务器，但SwiftVR让它在普通显卡上也能实时跑。它用了一种巧妙的注意力机制：把画面切成小窗口，每个窗口内做密集计算，避免了传统方法中全局注意力带来的巨大算力开销。同时，它设计了一个轻量级的视频编码器，解码更快。结果：在RTX 5090上，1080p视频修复能达到26帧/秒，4K也能跑14帧/秒——这是首个在消费级显卡上实现实时高清修复的生成式模型。虽然它不是你明天就能用的App，但意味着未来直播、视频会议中的画质增强可能不再依赖云端，本地就能搞定。

📄 原文摘要(英文)

Real-time video restoration (VR) for live streams requires high-resolution outputs under strict per-frame latency constraints. Existing one-step diffusion-based VR models remain difficult to deploy on consumer-grade GPUs due to two main bottlenecks: quadratic spatial attention at high resolutions and the latency-memory overhead of large video autoencoders. We present SwiftVR, a streaming one-step generative VR framework that reduces both bottlenecks under a causal chunk-wise protocol. For attention, mask-free shifted-window self-attention gathers each spatial window into a dense tensor via deterministic indexing, keeping all attention calls on the dense scaled dot-product attention path without masks, cyclic shifts, padding, or hardware-specific sparse kernels. Because SwiftVR uses only standard dense SDPA calls, the trained model transfers to consumer GPUs without retraining or custom kernels. For autoencoding, a lightweight Restoration-aware Autoencoder enables fast chunk-wise decoding while preserving reconstruction quality. On a single H100, SwiftVR sustains 31~FPS at 2560x1440 and 14~FPS at 3840x2160, whereas all compared diffusion-based VR baselines exceed the memory limit at 4K. On a consumer RTX~5090, SwiftVR reaches 26~FPS at 1920x1080. To our knowledge, SwiftVR is the first generative VR model to achieve real-time 1080p streaming on a consumer-grade GPU, while attaining strong no-reference perceptual quality with lower inference cost. Project is available at https://h-oliday.github.io/SwiftVR.

arXiv 原文

📬 订阅 AI Pulse