Sample-Cohesive Pose-Aware Contrastive Facial Representation Learning

Enhancing Pose Awareness in Self-Supervised Facial Representation Learning

Research Background and Problem Statement

In the field of computer vision, facial representation learning is a crucial research task. By analyzing facial images, we can extract information such as identity, emotions, and poses, thereby supporting downstream tasks like facial expression recognition (FER), face recognition (FR), and head pose estimation (HPE). In recent years, deep convolutional neural networks (DCNNs) have achieved remarkable results in facial understanding tasks. However, these methods typically rely on large-scale annotated datasets for supervised learning, which require substantial manual annotation efforts and may not generalize well to other datasets.

To address this limitation, self-supervised learning (SSL) has gradually emerged as a promising alternative. Particularly, contrastive learning (CL)-based methods learn meaningful visual representations without relying on explicit annotations by constructing positive and negative sample pairs. Nevertheless, despite their good performance on unlabeled data, existing methods still fall short when handling facial pose variations. Specifically, traditional CL methods tend to learn pose-insensitive features, potentially leading to the loss of some useful pose details. Additionally, due to inadequate selection of positive and negative sample pairs, current CL methods might introduce false-negative pairs, affecting the model’s learning outcomes.

To tackle these issues, the authors propose a novel framework—Pose-Disentangled Contrastive Facial Representation Learning (PCFRL)—aiming to enhance pose awareness in self-supervised facial representation learning and improve the effectiveness of contrastive learning through more appropriate sample calibration strategies.


Source of the Paper

This paper was authored by Yuanyuan Liu, Shaoze Feng, Zhe Chen, et al., from institutions including China University of Geosciences (Wuhan), Yunnan United Vision Technology Co., Ltd., Yunnan University, and La Trobe University in Australia. The paper was published in the International Journal of Computer Vision (IJCV), received on January 6, 2025, with the DOI 10.1007/s11263-025-02348-z.


Research Content and Methods

a) Research Workflow and Methodology

The research process in this paper is divided into three main parts: Feature Disentanglement, False-Negative Pair Calibration, and Improved Contrastive Learning Loss Design.

1. Feature Disentanglement

The authors first proposed a module called Pose-Decoupling Decoder (PDD) to separate pose-aware features from non-pose face-aware features. PDD achieves disentanglement through reconstruction target constraints, ensuring that the same facial image under different poses can be reconstructed based on new pose features and original non-pose features. Mathematically, the disentanglement process is realized through the following loss functions: - Reconstruction Loss ((L_{dis})): Measures the difference between the original image and its reconstructed version. - Orthogonal Loss ((L_{orth})): Ensures that the two types of disentangled features are orthogonal to each other, reducing redundant information.

2. False-Negative Pair Calibration

After feature disentanglement, the authors observed that directly applying traditional CL methods could lead to false-negative pair issues. For instance, two images with the same pose but belonging to different individuals might be incorrectly selected as negative pairs. To resolve this issue, the authors proposed a method based on Neighborhood-Cohesive Pair Alignment (NPA) to identify and calibrate false-negative pairs. The NPA method combines cosine similarity and neighborhood sample consistency scores, dynamically adjusting the calibration of false-negative pairs through a threshold mechanism.

3. Improved Contrastive Learning Loss Design

To further optimize the learning of calibrated sample pairs, the authors designed two new contrastive learning loss functions: - Calibrated Pose-Aware CL Loss ((L’_p)) - Calibrated Face-Aware CL Loss ((L’_f))

These two loss functions dynamically optimize the calibrated sample pairs through an adaptive weighting strategy, thereby enhancing the model’s robustness and generalization ability.


b) Main Results

1. Effectiveness of Feature Disentanglement

Experiments show that the PDD module can effectively separate pose-aware features from non-pose face-aware features. Through t-SNE visualization, the authors demonstrated that the features learned by the PCFRL framework are more discriminative compared to its previous version (PCL).

2. Effectiveness of False-Negative Pair Calibration

Using the NPA method, the authors successfully identified and calibrated a significant number of false-negative pairs. Compared to methods relying solely on cosine similarity, the NPA method showed notable advantages in calibrating pose-aware and non-pose face-aware false-negative pairs.

3. Performance Improvement in Downstream Tasks

In four downstream tasks (FER, FR, facial action unit detection, HPE), PCFRL outperformed existing state-of-the-art methods. For example, in the FER task on the RAF-DB dataset, PCFRL achieved an accuracy of 75.68%, improving by 1.21% over PCL; in the FR task on the CPLFW dataset, PCFRL reached an accuracy of 66.17%, surpassing PCL by 2.41%.


Conclusions and Value

c) Research Conclusions and Significance

The PCFRL framework proposed in this paper significantly enhances the performance of self-supervised facial representation learning through feature disentanglement, false-negative pair calibration, and improved contrastive learning loss design. The results indicate that enhancing pose awareness is crucial for robust facial representation learning.

From a scientific perspective, PCFRL offers a new approach to addressing the false-negative pair problem in self-supervised learning and validates the effectiveness of the NPA method. From an application standpoint, the framework performs excellently across multiple facial-related tasks, providing technical support for practical applications such as intelligent surveillance and human-computer interaction.


d) Research Highlights

  1. Innovative Workflow: PCFRL is the first to combine feature disentanglement with false-negative pair calibration, addressing the shortcomings of traditional CL methods in pose awareness.
  2. Novel NPA Method: By comprehensively considering neighborhood sample relationships, the NPA method can more accurately identify false-negative pairs.
  3. Improved Contrastive Learning Loss: The adaptive weighting strategy enables the model to optimize calibrated sample pairs more effectively.

Summary

This paper, authored by Yuanyuan Liu et al., proposes a novel self-supervised facial representation learning framework—PCFRL—aimed at improving facial representation learning by enhancing pose awareness. The research not only addresses the false-negative pair problem in traditional CL methods but also provides important references for the application of self-supervised learning in facial-related tasks. In the future, the authors plan to further explore how to utilize physical prior knowledge to handle complex noise, thereby further enhancing the model’s robustness.