让视频里的物体听你摆布,还能换风格
现在的AI生成视频,如果你想让一个特定物体(比如你的猫、一个玩具)出现在视频里,要么它死死保持原样,换个风格就崩;要么它跟着文字变,但物体本身也走样了。这篇论文的DomainShuttle把这两个矛盾的需求拆开处理:它用「域感知模块」让物体在保持核心特征的同时,背景、风格、动作可以自由跟随文字指令变化。比如你上传一张猫的照片,输入「猫在火星上跳舞」,它既能认出那是你的猫,又能让猫真的跳起来、背景变成火星。技术上,它把参考图和视频的「位置编码」分开处理,避免两者打架,再用一个损失函数确保只提取物体的本质特征,忽略无关的姿势或光影。这不是你明天就能用的工具,但它指明了视频生成从「贴图」走向「可控创作」的关键一步。
📄 原文摘要(英文)
Open domain subject-driven text-to-video (S2V) generation has drawn significant interest in academia and industry. Open domain S2V mainly involves two scenarios: in-domain, which requires retaining the reference subject features as much as possible, and cross-domain, which preserves the intrinsic features of the subject while allowing subject-irrelevant properties to vary flexibly according to the text prompt. Existing methods primarily focus on maximizing subject fidelity in in-domain scenarios, which limits their editability and adaptability in cross-domain scenarios, such as novel styles, semantic combinations, or domain attributes. In this study, we propose that an ideal S2V method should flexibly shuttle between different domains, achieving strong performance in both in-domain and cross-domain scenarios. To this end, we propose DomainShuttle, which could achieve high fidelity and generative flexibility for open domain video personalization. Specifically, we introduce Domain-MoT, which decouples videos and reference features and introduces the domain-aware AdaLN for domain-specific modeling of reference images. We then introduce the Video-Reference DualRoPE scheme, which places reference image tokens and video tokens in separate RoPE spaces to enable precise subject-level spatial modeling, and Cross-Pair Consistent Loss, which aims to extract intrinsic subject features unaffected by irrelevant features. Extensive experiments demonstrate that DomainShuttle achieves significant performance improvements over existing methods, exhibiting high subject fidelity and generative flexibility across diverse open domain application scenarios.