Adaptive Middle Modality Alignment Learning for Visible-Infrared Person Re-Identification

2024-11-24 Sun
visible-infrared person re-identification middle modality adaptive alignment modality distribution intelligent surveillance modality discrepancy
Research on Adaptive Middle-Modality Alignment Learning for Visible-Infrared Cross-Modality LearningBackground and Problem StatementDriven by the needs of intelligent surveillance systems, visible-infrared person re-identification (VIReID) has gradually become a prominent research topic. This task aims to achieve around-the-clock person recognition by matching pedestrian images across different spectral modalities (e.g., visible light and infrared). Significant modality differences between visible light and infrared images, including illumination, texture, and color, arise due to their different spectral origins, making cross-modality matching highly challenging.
Traditional methods often address these issues using complex generative adversarial networks (GANs) or deep learning models, but these approaches face several limitations:
- Lack of adaptability to the varying modality differences between images.
- Significant discrepancies between generated and real images.
- High complexity, making these methods difficult to deploy in practice.
To tackle these challenges, this paper proposes an Adaptive Middle-Modality Alignment Learning (AMML) approach. By generating and aligning a middle modality at both image and feature levels, AMML dynamically reduces modality discrepancies and significantly improves VIReID performance.
Research Source and Publication InformationThis research was conducted by Yukang Zhang, Yan Yan, Yang Lu, and Hanzi Wang, affiliated with the Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Xiamen University, and the Fujian Key Laboratory of Sensing and Computing for Smart City. The study was published in 2024 in the International Journal of Computer Vision, titled “Adaptive Middle-Modality Alignment Learning for Visible-Infrared Person Re-identification.” DOI: 10.1007/s11263-024-02276-4.
Methodology and Framework1. Framework OverviewThe AMML method consists of three main modules:
1. Adaptive Middle-Modality Generator (AMG): Generates middle-modality images at the image level to establish a unified middle-modality image space between visible and infrared images.
2. Adaptive Distribution Alignment Loss (ADA): Forces the distribution of visible and infrared features to align with the middle-modality features at the feature level.
3. Center-Based Diverse Distribution Learning Loss (CDDL): Learns diverse distributions among the three modalities while further reducing modality discrepancies.
The entire framework uses a ResNet50-based backbone and integrates these components into a lightweight and efficient end-to-end learning network.
2. Adaptive Middle-Modality Generator (AMG)The AMG module generates middle-modality images through the following steps:
- Uses a set of 1×1 convolution layers to project visible and infrared images into a single-channel grayscale image space.
- Applies non-linear transformations to the grayscale images for closer alignment.
- Reconstructs the transformed grayscale images into three-channel middle-modality images (UMMI) using parameter-shared convolutions.
An adaptive MixUp strategy is also proposed to dynamically fuse modality factors, improving modality alignment.
3. Adaptive Distribution Alignment Loss (ADA)At the feature level, ADA dynamically reduces modality discrepancies by aligning the distributions of visible and infrared features with the middle-modality features. The ADA loss is formulated as:
$$
L{\text{ADA}} = \frac{1}{N} \sum{i=1}^N \left[ mv \cdot |f{vis} - f_{m}| + mn \cdot |f{nir} - f_{m}| \right]
$$
Here, $m_v$ and $m_n$ are modality factors measuring discrepancies between each modality and the middle modality.
4. Center-Based Diverse Distribution Learning Loss (CDDL)CDDL learns diverse feature distributions by:
- Positive Constraints: Minimizing intra-class distances between centers of different modalities for the same identity.
- Negative Constraints: Maximizing inter-class distances between feature centers of different identities.
The positive component of the loss is defined as:
$$
L{\text{CDDL}}^{\text{pos}} = \sum{i=1}^N \max(0, \alpha + d(c{v}, c{n}) - d(c{v}, c{m})) + \max(0, \alpha + d(c{n}, c{v}) - d(c{n}, c{m}))
$$
5. Multi-Loss Joint OptimizationAMML optimizes the following combined loss function during training:
$$
L{\text{total}} = L{\text{global}} + \lambda1 L{\text{local}}
$$
Global and local losses integrate cross-entropy, triplet, ADA, and CDDL losses.
Experiments and ResultsDatasets and Evaluation MetricsThe proposed method was validated on three public datasets:
- SYSU-MM01: Contains 491 identities with images captured by six cameras (visible and infrared).
- RegDB: Includes 412 identities with paired visible and infrared images.
- LLCM: A nighttime dataset with low-light conditions.
Evaluation metrics include the Cumulative Matching Characteristics (CMC) curve and mean Average Precision (mAP).
Key ResultsAMML outperforms state-of-the-art methods across all datasets:
- SYSU-MM01: Achieves a Rank-1 accuracy of 77.8% and mAP of 74.8% in the all-search mode.
- RegDB: Obtains a Rank-1 accuracy of 94.9% and mAP of 87.8% in the visible-to-infrared mode.
Compared to complex multi-branch models (e.g., MRCN) and methods requiring additional pre-trained models (e.g., SEFEL), AMML demonstrates simplicity and generalization.
Research Contributions and SignificanceScientific Value: The AMML framework introduces a novel image-feature joint optimization strategy for cross-modality learning.
Application Value: AMML has significant potential for use in intelligent surveillance systems, especially in diverse and challenging scenarios.
ConclusionCentered on middle-modality learning, AMML effectively reduces modality discrepancies at both the image and feature levels, offering a lightweight solution for VIReID tasks. Future work could explore its extension to other cross-modality applications, such as multispectral imaging and multimodal semantic understanding.