Yuhao Wang, Lingjuan Miao, Zhiqiang Zhou, Lei Zhang, Yajun Qiao
-
We first propose to use nature language to express the whole objective of IVIF, which allows to avoid the complex and explicit mathematical modeling in current fusion loss functions.
-
A language-driven fusion model is derived in CLIP embedding space, based on which we develop a simple yet highly effective language-driven loss for IVIF. Particularly, by introducing a novel regularization and patch filtering approach, we ensure high robustness of the trained model in practice and resolve the challenge of removing textual artifacts induced by CLIP.
-
Experiments show a great improvement of fusion quality achieved by the proposed method, revealing the superiority of language in modeling of the fusion output and the potential of pre-trained vision-language model in improving the IVIF performance.
- create conda environment
conda create -n LDFusion python=3.9.12
conda activate LDFusion
- Install Dependencies
pip install -r requirements.txt
(recommended cuda11.1 and torch 1.8.2)
Please put test data into the test_imgs
directory (infrared images in ir
subfolder, visible images in vi
subfolder), and run python src/test.py
.
(Note: The weight files (*.pt) might require independent download from the repository)
Then, the fused results will be saved in the ./results/
folder.
From left to right are the infrared image, visible image, and the fused image generated by LDFusion.