📄 论文解读

AI拍视频终于能记住主角长什么样了

信赖通道 ▲ 20 AI视频生成跨镜头一致性记忆机制音频视频同步

现在的AI视频生成，每次镜头切换就像失忆——主角长相、声音、场景全变样。UnityShots给AI装了两个固定大小的记忆槽：一个记住开场镜头（长期记忆），一个记住上一镜结尾（短期记忆），每次切换时用边界检测和节拍信号决定更新哪个。音频也单独注入一个参考说话人标记，保证声音不漂移。它还学会识别“硬切”还是“淡入淡出”，让你在生成时能手动控制转场强度。在跨镜头一致性上，它碾压所有开源模型，追平最强闭源系统。你明天用不上，但这是AI视频从“单镜头魔术”走向“真正讲故事”的关键一步。

📄 原文摘要(英文)

Generating a coherent multi-shot video requires structured cross-shot memory. Subject appearance, scene context, and speaker identity must persist across cuts. Existing approaches either train end-to-end over fixed-length sequences and cannot scale, generate shot-by-shot with memory banks that grow linearly, or orchestrate pretrained generators under an LLM planner without a multi-shot-aware backbone. We present UnityShots, a memory-driven multi-shot audio-video generation system built on LTX-2.3, trained on annotated cinematic and music-video shots. The video stream maintains two fixed-size slots, a long-term memory (LTM) slot anchored to the opening shot and a short-term memory (STM) slot holding the immediately preceding tail, both updated at every cut by a boundary-conditioned gate that fuses visual cut probability and beat-tracker signals. The audio stream injects a reference speaker token at every shot to preserve vocal timbre without a sliding audio bank. A discrete cut-type prior, learned through AdaLN, becomes an inference-time control knob over transition strength. We release a benchmark of 200 multi-cultural multi-shot sequences spanning six ethnic regions and ten or more languages, with per-shot reference identities, reference audio, and per-boundary transition labels. Evaluated across I2V, T2V, and R2V conditioning modes, UnityShots leads open-source baselines on every cross-shot coherence metric and matches the strongest closed-source system on the multi-shot axes.

arXiv 原文

📬 订阅 AI Pulse