DRTN: Dual Relation Transformer Network with Feature Erasure and Contrastive Learning for Multi-Label Image Classification
New Breakthrough in Multi-Label Image Classification: Dual Relation Transformer Network
Academic Background
Multi-Label Image Classification (MLIC) is a fundamental yet highly challenging problem in the field of computer vision. Unlike single-label image classification, MLIC aims to assign multiple labels to objects within a single image. Due to the potential presence of multiple objects in an image, along with complex spatial and semantic relationships among them, MLIC tasks face challenges such as scene complexity, varying object scales, and implicit correlations between objects. In recent years, with the rapid development of deep learning technologies, particularly the introduction of Convolutional Neural Networks (CNNs) and Transformers, significant progress has been made in MLIC tasks. However, existing Transformer methods often flatten 2D feature maps into 1D sequences, leading to the loss of spatial information. Additionally, current attention mechanism models tend to focus only on salient feature regions, ignoring other potentially useful features, thereby limiting the classification performance of the models.
To address these issues, a research team from Sun Yat-sen University proposed a novel Dual Relation Transformer Network (DRTN), which significantly enhances the performance of multi-label image classification through feature erasure and contrastive learning techniques. The study aims to solve the problems of spatial information loss and the limitations of attention mechanisms in Transformer methods, thereby providing a more comprehensive solution for MLIC tasks.
Source of the Paper
This paper was co-authored by Wei Zhou, Kang Lin, Zhijie Zheng, Dihu Chen, Tao Su, and Haifeng Hu, all from the School of Electronics and Information Technology at Sun Yat-sen University. The paper was published in 2025 in the journal Neural Networks, titled “DRTN: Dual Relation Transformer Network with Feature Erasure and Contrastive Learning for Multi-Label Image Classification.”
Research Process and Details
1. Overview of the Research Process
The core design of the DRTN network lies in enhancing the performance of multi-label image classification through the Dual Relation Enhancement (DRE) module, the Feature Enhancement and Erasure (FEE) module, and the Contrastive Learning (CL) module. The specific process is as follows:
- Feature Extraction: Use a pre-trained CNN (e.g., ResNet-101) to extract feature maps from the input image.
- Dual Relation Enhancement (DRE) Module: Capture the correlations between different objects in the image by fusing grid features and pseudo-region features.
- Feature Enhancement and Erasure (FEE) Module: Discover salient feature regions through the attention mechanism and mine other potentially useful features through a region-level erasure strategy.
- Contrastive Learning (CL) Module: Use contrastive learning to bring the foregrounds of salient and potential features closer while pushing them away from background features.
- Model Training and Evaluation: Train and evaluate the model on multiple public datasets (e.g., MS-COCO 2014, Pascal VOC 2007, and NUS-WIDE) to validate its effectiveness.
2. Detailed Process and Experimental Design
a) Feature Extraction
The study first uses a pre-trained ResNet-101 network to extract feature maps from the input image. Specifically, the input image is resized to a resolution of 448×448, and after passing through the CNN, the generated feature map is represented as F∈R^H×W×C, where H and W are the height and width of the feature map, and C is the number of channels.
b) Dual Relation Enhancement (DRE) Module
The DRE module aims to capture the correlations between different objects in the image by fusing grid features and pseudo-region features. The specific steps are as follows:
- Grid Relation Encoder: Compress the channel dimension of the feature map F through a 1×1 convolutional layer, then flatten it into a grid feature sequence V_g. Next, use a Transformer encoder to capture the correlations between grid features.
- Pseudo-Region Relation Encoder: To compensate for the loss of spatial information in grid features, the study proposes a grid aggregation scheme that clusters grid features into N pseudo-region features V_r. These pseudo-region features capture the correlations between different regions through a Transformer encoder.
- Feature Fusion: Fuse the grid features and pseudo-region features to generate a more representative feature F_x, which serves as the input for subsequent modules.
c) Feature Enhancement and Erasure (FEE) Module
The FEE module aims to discover salient feature regions through the attention mechanism and mine other potentially useful features through a region-level erasure strategy. The specific steps are as follows:
- Feature Enhancement Branch: Generate a spatial attention map M_att through an attention head, and use the sigmoid function to generate an importance map M_imp. Multiply the importance map with the original feature to obtain the salient enhanced feature F_e.
- Feature Erasure Branch: Generate a region-level erasure mask M_e_r through a predefined erasure ratio θ_e, and multiply it with the original feature to obtain the erased potential feature F_s.
d) Contrastive Learning (CL) Module
The CL module aims to bring the foregrounds of salient and potential features closer while pushing them away from background features through contrastive learning. The specific steps are as follows:
- Foreground and Background Separation: Separate the foreground and background of salient and potential features through thresholding.
- Contrastive Loss Calculation: Design a contrastive loss L_cl to bring the foreground embedding vectors of salient and potential features closer while pushing them away from background embedding vectors.
e) Model Training and Evaluation
The study conducted experiments on three public datasets: MS-COCO 2014, Pascal VOC 2007, and NUS-WIDE. During training, the SGD optimizer was used with an initial learning rate of 10^-3, and the learning rate was reduced by a factor of 10 at the 25th and 35th epochs. The experimental results show that the DRTN model outperforms existing MLIC methods on multiple evaluation metrics.
3. Main Results and Conclusions
a) Experimental Results
On the MS-COCO 2014 dataset, the DRTN model achieved an mAP (mean Average Precision) of 84.7% at a resolution of 448×448, outperforming existing CNN, RNN, and GCN methods. When the resolution was increased to 576×576, the DRTN model’s mAP further improved to 86.2%, achieving the best performance among all compared methods.
On the Pascal VOC 2007 dataset, the DRTN model achieved an mAP of 94.7% at a resolution of 448×448, significantly outperforming existing CNN and GCN methods. When the resolution was increased to 576×576, the DRTN model’s mAP further improved to 94.9%.
On the NUS-WIDE dataset, the DRTN model achieved an mAP of 63.4%, outperforming existing GCN and Transformer methods.
b) Conclusions and Significance
The DRTN model significantly enhances the performance of multi-label image classification through the Dual Relation Enhancement module, the Feature Enhancement and Erasure module, and the Contrastive Learning module. The main contributions of the study include:
- Proposed a Dual Relation Enhancement module that captures the correlations between different objects in the image by fusing grid features and pseudo-region features.
- Designed a Feature Enhancement and Erasure module that discovers salient feature regions through the attention mechanism and mines other potentially useful features through a region-level erasure strategy.
- Introduced a Contrastive Learning module that brings the foregrounds of salient and potential features closer while pushing them away from background features.
This study provides a new solution for multi-label image classification tasks, with significant scientific and application value.
4. Research Highlights
- Novel Dual Relation Enhancement Module: Effectively captures the correlations between different objects in the image by fusing grid features and pseudo-region features.
- Innovative Feature Erasure Strategy: Mines other potentially useful features through a region-level erasure strategy, enhancing the model’s classification performance.
- Application of Contrastive Learning Mechanism: Brings the foregrounds of salient and potential features closer while pushing them away from background features, further enhancing the model’s discriminative ability.
5. Other Valuable Information
The study also explored the impact of different hyperparameters (e.g., the number of clusters N and the erasure ratio θ_e) on the model’s performance and validated the effectiveness of each module through ablation experiments. The experimental results show that the DRTN model achieved significant performance improvements on multiple public datasets, demonstrating its superiority in multi-label image classification tasks.
Summary
The DRTN model significantly enhances the performance of multi-label image classification through the Dual Relation Enhancement module, the Feature Enhancement and Erasure module, and the Contrastive Learning module. This study not only provides a new solution for MLIC tasks but also offers valuable insights for other tasks in the field of computer vision.