Dual-Space Video Person Re-Identification

Dual-Space Video Person Re-Identification Research

Background Introduction

Person Re-Identification (ReID) technology aims to identify specific individuals through images or video sequences captured by different cameras. In recent years, with the rapid development of deep learning technology, ReID has shown great application potential in areas such as urban security, missing person searches, and suspect tracking. However, existing ReID methods primarily rely on Euclidean space for feature representation learning, which faces numerous challenges when dealing with complex scenarios, such as occlusions, background clutter, and complex spatiotemporal information modeling.

To address these issues, a research team from Chongqing University of Posts and Telecommunications proposed a new framework called “Dual-Space Video Person Re-Identification” (DS-VReID). This framework is the first to introduce hyperbolic space into the task of video person re-identification, combining the advantages of Euclidean and hyperbolic spaces to more effectively capture visual features and hierarchical relationships, thereby improving recognition performance. The significance of this research lies in exploring the potential value of non-Euclidean geometry in the field of computer vision and providing new ideas for solving person re-identification problems in complex scenarios.

Research Source

This study was completed by the research team from the Key Laboratory of Image Cognition, Chongqing University of Posts and Telecommunications, and the Chongqing Institute for Brain and Intelligence. The first author of the paper is Jiaxu Leng, and the corresponding author is Professor Xinbo Gao. The paper was published in the International Journal of Computer Vision (IJCV), received on January 6, 2025, with the DOI 10.1007/s11263-025-02350-5.

Research Content and Methods

a) Research Workflow

The DS-VReID framework mainly includes three modules: Dynamic Prompt Graph Construction (DPGC), Hyperbolic Disentangled Aggregation (HDA), and Dual-Space Fusion (DSF). Below are the specific workflows for each module:

1. Dynamic Prompt Graph Construction (DPGC)

The goal of the DPGC module is to extract human body regions from videos and construct a human skeleton graph. The specific workflow is as follows: - Input Data: The study uses video sequences from the MARS dataset, with each video containing 8 frames at a resolution of 256×128. - Feature Extraction: Video frames are first fed into a pre-trained CLIP model (Radford et al., 2021) to extract visual features. The CLIP model combines global text descriptions (e.g., “person,” “head,” “torso”) with dynamic prompts to locate human body regions. - Coarse-to-Fine Strategy: The DPGC module adopts a coarse-to-fine feature extraction strategy. First, it locates the entire human body using global descriptions (e.g., “a person”); then, it extracts specific local features using local descriptions (e.g., “a part of a person’s head”). - Graph Construction: The extracted human local regions serve as graph nodes, and the relationships between nodes serve as edges, ultimately constructing a human skeleton graph.

2. Hyperbolic Disentangled Aggregation (HDA)

The HDA module aims to solve the problem of long-range dependency modeling in hyperbolic space. Its core idea is to decouple the adjacency matrix into submatrices of different orders and progressively aggregate spatiotemporal information using a sliding time window strategy. The specific steps are as follows: - Spatial Domain Processing: Based on the distance between nodes, compute the k-order adjacency matrix (A_k), assigning uniform weights to nodes at the same distance. - Temporal Domain Processing: Select frames within a specific time window for aggregation, gradually integrating information across the entire video sequence. - Hyperbolic GCN Operation: Perform graph convolution operations in hyperbolic space to capture detailed spatiotemporal hierarchical relationships.

3. Dual-Space Fusion (DSF)

The DSF module fuses feature representations from Euclidean and hyperbolic spaces to fully utilize the strengths of both spaces. Specific methods include: - Mapping features from hyperbolic space back to tangent space. - Performing weighted fusion of the two types of features in tangent space to obtain the final feature representation.

b) Main Results

1. Effectiveness of the DPGC Module

Experiments show that the DPGC module significantly improves the recognition performance of the model. On the MARS dataset, using only the baseline model, mAP and Rank-1 accuracy were 82.1% and 88.5%, respectively; after adding the DPGC module, these metrics increased by 3.6% and 1.8%, respectively. This indicates that the DPGC module can effectively eliminate background noise and focus on pedestrian-related regions.

2. Effectiveness of the HDA Module

The HDA module further enhances the model’s performance. On the MARS dataset, after adding the HDA module, mAP and Rank-1 accuracy improved by 1.7% and 1.0%, respectively. Experiments also found that smaller time windows (e.g., τ=[3,3]) better adapt to the distance characteristics of hyperbolic space, thereby enhancing performance.

3. Effectiveness of the DSF Module

The dual-space fusion module (DSF) combines feature representations from Euclidean and hyperbolic spaces, significantly improving the overall performance of the model. On the MARS dataset, the final mAP and Rank-1 accuracy of DS-VReID reached 87.6% and 92.3%, respectively, surpassing existing state-of-the-art methods.

c) Research Conclusions

The DS-VReID framework successfully addresses person re-identification problems in complex scenarios by combining the advantages of Euclidean and hyperbolic spaces. This method not only achieved the best performance on multiple datasets such as MARS, LS-VID, and DukeMTMC-VideoReID but also performed well on the ILIDS-VID and PRID2011 datasets. These results demonstrate the superiority of DS-VReID in capturing visual features and hierarchical relationships.

d) Research Highlights

  1. Innovation: For the first time, hyperbolic space was introduced into the task of video person re-identification, proposing the concept of dual-space fusion.
  2. Practicality: The DPGC module effectively reduces the impact of background noise through dynamic prompts and a coarse-to-fine strategy.
  3. Technical Breakthrough: The HDA module solves the challenge of long-range dependency modeling in hyperbolic space, significantly improving model performance.
  4. Comprehensiveness: Through multi-module collaboration, efficient modeling of complex scenarios is achieved.

e) Other Valuable Information

The research team conducted extensive ablation experiments to verify the effectiveness of each module. For example, different text prompt designs significantly affect performance, with the prompt “a {cls} part of a person” achieving the best results. Additionally, experiments showed that dynamic prompts play a key role in capturing subtle changes and dynamic information in videos.

Research Significance and Value

The DS-VReID framework not only provides a new solution for the field of person re-identification but also demonstrates the potential application value of non-Euclidean geometry in computer vision. This method performs exceptionally well in handling complex scenarios such as occlusions and background clutter, offering broad application prospects in areas like urban security monitoring, intelligent transportation systems, and large-scale crowd analysis. Furthermore, this research lays the foundation for future exploration of hyperbolic space applications in other computer vision tasks.