TGFuse: An Infrared and Visible Image Fusion Approach Based on Transformer and Generative Adversarial Network

TGFuse: A Transformer and Generative Adversarial Network-Based Method for Infrared and Visible Image Fusion

Background Introduction

The neural network framework of this study

With the development of imaging devices and analysis methods, multimodal visual data is rapidly emerging, with many practical applications. In these applications, image fusion plays a significant role in helping the human eye perceive information in multimodal data. In particular, the fusion of infrared and visible images is of paramount importance in fields such as military, security, and visual tracking, becoming a crucial part of the image fusion task. Designing a natural and efficient image fusion algorithm can enhance whole-image level perception, thus meeting the fusion needs of complex scenes. However, existing fusion methods based on Convolutional Neural Networks (CNN) directly ignore long-range dependencies, hindering balanced perception across the entire image.

Traditional fusion algorithms based on multi-scale transformations have achieved preliminary research results by extracting and fusing multi-scale representations of source images and then reconstructing them. However, these methods are limited in their ability to handle complex scenes, tend to introduce noise, and operate with low efficiency. With the development of deep learning, CNNs, with their powerful representation capabilities and flexible structures, have become the mainstream research approach. Nonetheless, as most image fusion tasks are unsupervised, supervised end-to-end training frameworks are not suitable for training fusion tasks.

This paper proposes an infrared and visible image fusion algorithm based on Transformer modules and generative adversarial learning to address the aforementioned issues. Our innovation lies in learning effective global fusion relationships using Transformer technology and incorporating adversarial learning during training to obtain competitive consistency from the input, thereby improving the discriminability of the output images. Experimental results show that the proposed method performs better in complex scenarios.

Paper Source

The paper is titled “An Infrared and Visible Image Fusion Approach Based on Transformer and Generative Adversarial Network,” authored by Dongyu Rao, Tianyang Xu, and Xiao-Jun Wu, all affiliated with the School of Artificial Intelligence and Computer Science at Jiangnan University. The paper was published in the 2023 IEEE Transactions on Image Processing journal, DOI: 10.1109/TIP.2023.3273451.

Research Methods and Process

Research Process

The research is mainly divided into two parts: a Transformer-based generator and two discriminators (Discriminator). The generator is responsible for generating the fused images, while the discriminators are used to refine the perceptual quality of the fused images.

  1. Generator

    • The source images are combined into the channel dimension and fed into a CNN for initial feature extraction.
    • The mixed CNN features are input into the Transformer fusion module to learn global fusion relationships.
    • Downsampling operations reduce computational resource consumption, and the learned fusion relationships are upsampled to different scales and multiplied with the corresponding features to obtain preliminary results.
    • Fusion features of different scales are upsampled to the original image size and superimposed to get the final fusion result.
  2. Discriminator

    • Two discriminators are set up: a discriminator for the fused and infrared images (dis-ir) and a discriminator for the fused and visible images (dis-vis).
    • A pre-trained VGG-16 network is used as the discriminator, using L1 loss at the feature level to make the fused image closer to the infrared or visible image.
    • During the training phase, source images are input into the generator to obtain preliminary fused images, which are then fed back to the generator via the two discriminators. This adversarial training with loss functions ultimately achieves the desired effect of the generator.

Transformer Module

The Transformer fusion module consists of two parts: a general Transformer (Spatial Transformer) and a cross-channel Transformer (Channel Transformer). The combination of these two helps to obtain more comprehensive global integration relationships.

  • Spatial Transformer: Images are divided into patches and flattened into vectors, then input into the Transformer model for relationship learning.
  • Channel Transformer: A new cross-channel Transformer model is proposed to learn information relationships across the channel dimension.
  • Combined Transformer: By first using the Channel Transformer and then the Spatial Transformer, more suitable coefficients for infrared and visible image fusion are learned.

Loss Function

  • Generator Loss: An improved SSIM (Structural Similarity) loss is used, optimizing the fusion effect with a single loss function to avoid conflicts between multiple loss functions.
  • Discriminator Loss: Includes the discriminator loss for the infrared and fused images (dis-ir) and the visible and fused images (dis-vis). Both are sampled at the feature level and calculated using L1 loss with features extracted by the VGG-16 network.

Research Results

Experimental results on the TNO, Road Scene, and LLVIP datasets show that the proposed method achieves optimal or suboptimal performance on multiple objective evaluation metrics. For example, on the TNO dataset, our method achieved the best performance on 5 out of 9 evaluation metrics and the second-best performance on 3 metrics.

Subjective Evaluation

Through visual comparison, the proposed method excels in retaining significant infrared image information and low-noise background information. The generated fused images are more consistent with human visual perception compared to other methods.

Conclusion

This paper proposes a Transformer and generative adversarial learning-based method for infrared and visible image fusion, demonstrating excellent performance in the fusion task and providing a new exploration direction for image fusion tasks. Future research will further explore the application of Transformers in fusion tasks and attempt to apply them to downstream tasks.

Research Highlights

  1. Proposed a New Fusion Algorithm: Combining Transformer and generative adversarial learning, introducing adversarial learning in the training process to improve the discriminability of the output images.
  2. Multi-Module Combination: Learning more comprehensive global fusion relationships through the combination of Spatial and Channel Transformers.
  3. Excellent Experimental Results: The proposed method achieved optimal or suboptimal performance on multiple objective metrics across various datasets.