📄 论文解读

把2D网格压成1D令牌，图像融合更聪明

信赖通道 ▲ 30 图像融合多模态令牌编辑全局一致性局部细节

图像融合——把红外和可见光、医学CT和MRI等不同模态的照片合成一张——一直有个矛盾：既要保留局部细节（比如纹理、边缘），又要让整体看起来自然（比如亮度、色调一致）。以前的方法用2D网格来建模，局部细节抓得好，但全局外观容易失控。这篇论文反着来：它借用了一个预训练的图像分词器（类似把图片拆成单词的模型），把全局信息压缩成一条1D的令牌序列，作为“全局调色盘”；局部细节仍然走原来的2D路径。关键创新是“选择性令牌编辑”：只修改少数几个关键令牌，就能调整整张图的全局风格，而不动融合模型本身，也不需要额外损失函数。在四个标准测试集上，它在全局一致性和局部保真度上都超过了现有方法。它不是你明天就能用的工具，但思路很巧妙——用1D空间管全局、2D空间管局部，各干各的，互不打架。

📄 原文摘要(英文)

Multimodal image fusion aims to integrate complementary information from different modalities into a fused image that preserves rich local details while maintaining globally consistent appearance. Existing approaches build shared representations on 2D feature grids, which excel at modeling local structures but offer limited leverage over image-level global appearance factors. To balance these objectives, we introduce a compact 1D token interface based on a frozen pretrained image tokenizer for modeling non-local appearance/base factors. Rather than using the tokenizer as a reconstruction backbone, our design uses the 1D token space as a global carrier while retaining the 2D spatial pathway for local structure restoration. Specifically, we introduce Selective Token Editing (STE), which sparsely updates/replaces a small set of critical tokens, providing a lightweight mechanism to steer global appearance coherence while keeping the fusion backbone unchanged and avoiding extra losses. Experiments on four commonly used benchmarks show that our method achieves the best overall performance, with consistent, multi-metric improvements in both global coherence and local fidelity. Project page: https://zju-xyc.github.io/1D-Fusion-Project-Page/

arXiv 原文

📬 订阅 AI Pulse