Dynamic Attention Vision-Language Transformer Network for Person Re-Identification

Dynamic Attention Vision-Language Transformer Network for Person Re-Identification Research Report

In recent years, multimodal person re-identification (ReID) has gained increasing attention in the field of computer vision. Person ReID aims to identify specific individuals across different camera views, serving as a critical technology in security and surveillance applications, such as locating missing persons and tracking criminals. However, integrating visual and textual information for multimodal ReID faces significant challenges, including biases in feature fusion and the domain gap affecting model performance.

This paper, authored by Guifang Zhang, Shijun Tan, Zhe Ji, and Yuming Fang from Jiangxi University of Finance and Economics and Newcastle University, was published in the International Journal of Computer Vision in 2024. It proposes a Dynamic Attention Vision-Language Transformer Network (DAVLT) to address the aforementioned issues, specifically in multimodal ReID.


Background and Motivation

Person ReID technology has faced persistent challenges such as image blurring, low resolution, background interference, and occlusion. These challenges degrade recognition performance, particularly in cross-camera scenarios and complex environments. While earlier CNN-based ReID methods showed limitations in extracting local features, recent Transformer-based methods have gained prominence due to their superior fine-grained feature extraction capabilities. With the rise of large-scale pretrained multimodal models (e.g., CLIP, ViLT), researchers are leveraging multimodal information to enhance person ReID performance. However, naive feature fusion methods may introduce noise, limiting the effectiveness of such approaches.

To address these issues, this study introduces DAVLT, focusing on reducing irrelevant information during feature integration and mitigating the domain gap between pretraining and downstream datasets through an adapter module.


Methodology

Network Architecture

DAVLT consists of the following components: 1. Image Encoder: Utilizes a pretrained Vision Transformer (ViT) to extract discriminative image features. 2. Text Encoder: Employs the ViLT model to generate text features using predefined templates such as “a [mask] wears a pair of [mask] pants…” to ensure consistent textual descriptions. 3. Adapter Module: Addresses the domain gap between pretrained and downstream task datasets. 4. Image-Text Dynamic Attention Module (ITDA): Dynamically assigns weights to features, emphasizing relevant information and suppressing noise for optimized fusion.

Image-Text Dynamic Attention Module (ITDA)

The ITDA module employs an attention weight control mechanism to compute Text-to-Image and Image-to-Text attention weights dynamically. This ensures effective feature fusion by highlighting meaningful features. For instance, when the description is “a woman wears red clothes,” the model accurately identifies the “red clothes” region in the image and assigns higher weights to those features.

Loss Functions

The network is optimized using a combination of Cross-Entropy Loss (ID Loss) and Triplet Loss to enhance discrimination in embedding space, ensuring that similar samples are closer while dissimilar ones are farther apart.


Experimental Results

The study conducted extensive experiments on three benchmark datasets—Market1501, MSMT17, and DukeMTMC—to validate the proposed method.

Performance Comparison

On the Market1501 dataset, DAVLT achieved a mAP of 91.1% and a Rank-1 accuracy of 96.3%, outperforming various existing methods such as TransReID and CLIP-ReID. On the MSMT17 dataset, DAVLT achieved a mAP of 71.7% and a Rank-1 accuracy of 87.6%, further demonstrating its efficacy.

Ablation Studies

Several ablation studies were conducted to evaluate the contributions of individual components: 1. ITDA Module: Adding the ITDA module improved mAP by 2.2% and Rank-1 accuracy by 1.1% on the Market1501 dataset. 2. Adapter Module: Incorporating the adapter module led to a 0.4% improvement in mAP and a 0.3% improvement in Rank-1 accuracy on Market1501. 3. Feature Combination: Experiments revealed that concatenating features (Image-Text Attention Features and Original Features) yielded the best results compared to addition or weighted addition.


Significance and Limitations

This study introduces several innovations: 1. ITDA Module: Dynamically integrates image and text features, effectively reducing noise. 2. Adapter Module: Mitigates domain discrepancies, improving model adaptability. 3. State-of-the-Art Performance: Achieves superior results on multiple datasets, demonstrating the potential of multimodal fusion in ReID tasks.

However, limitations persist. The model struggles with low-resolution images and highly similar appearances among individuals. For instance, when several individuals wear similar clothing, the model fails to identify distinctive features, leading to incorrect retrievals. Future work could focus on enhancing textual descriptions, employing more advanced neural architectures, and exploring multi-scale feature representations to improve performance further.


Conclusion

This paper proposed a novel Dynamic Attention Vision-Language Transformer Network (DAVLT) for person re-identification. By integrating an Image Encoder, Text Encoder, Adapter Module, and ITDA Module, the network captures rich semantic representations from images and text. Extensive experiments on Market1501, MSMT17, and DukeMTMC demonstrated the method’s superior performance, achieving state-of-the-art results. This study underscores the potential of leveraging language modality for feature integration in ReID tasks, providing a foundation for future research.

Acknowledgments: Supported by the National Natural Science Foundation of China (Grant Nos. 62361029, 62441203, 62311530101, and 62132006).

Data Availability: Datasets used are available: - Market1501: https://doi.org/10.1109/iccv.2015.133 - MSMT17: https://doi.org/10.1109/cvpr.2018.00016 - DukeMTMC: https://doi.org/10.1007978-3-319-48881-3_2

Conflicts of Interest: None declared.