让视频里的物体听你摆布,还能换风格
现在的AI视频生成,如果你想让一个特定物体(比如你的猫、一个玩偶)出现在视频里,要么它死死保持原样,换个风格就崩;要么能换风格但物体特征全丢。这篇论文说,理想状态应该是:物体核心特征不变,但背景、风格、动作可以随便改。他们搞了个叫DomainShuttle的方法,核心是把物体参考图和视频内容分开处理,用两套独立的坐标系统(双RoPE)让物体位置和动作精准可控,再用一个损失函数专门提取物体“不变的本质”,排除干扰。结果就是:你的猫在视频里还是你的猫,但可以变成油画风格、在火星上跑、或者穿着宇航服——物体特征和场景自由度都保住了。
📄 原文摘要(英文)
Open domain subject-driven text-to-video (S2V) generation has drawn significant interest in academia and industry. Open domain S2V mainly involves two scenarios: in-domain, which requires retaining the reference subject features as much as possible, and cross-domain, which preserves the intrinsic features of the subject while allowing subject-irrelevant properties to vary flexibly according to the text prompt. Existing methods primarily focus on maximizing subject fidelity in in-domain scenarios, which limits their editability and adaptability in cross-domain scenarios, such as novel styles, semantic combinations, or domain attributes. In this study, we propose that an ideal S2V method should flexibly shuttle between different domains, achieving strong performance in both in-domain and cross-domain scenarios. To this end, we propose DomainShuttle, which could achieve high fidelity and generative flexibility for open domain video personalization. Specifically, we introduce Domain-MoT, which decouples videos and reference features and introduces the domain-aware AdaLN for domain-specific modeling of reference images. We then introduce the Video-Reference DualRoPE scheme, which places reference image tokens and video tokens in separate RoPE spaces to enable precise subject-level spatial modeling, and Cross-Pair Consistent Loss, which aims to extract intrinsic subject features unaffected by irrelevant features. Extensive experiments demonstrate that DomainShuttle achieves significant performance improvements over existing methods, exhibiting high subject fidelity and generative flexibility across diverse open domain application scenarios.