Contrastive Decoupled Representation Learning and Regularization for Speech-Preserving Facial Expression Manipulation
Contrastive Decoupled Representation Learning in Speech-Preserving Facial Expression Manipulation
Background Introduction
In recent years, with the rapid development of virtual reality, film and television production, and human-computer interaction technologies, facial expression manipulation has become a research hotspot in the fields of computer vision and graphics. Among these, Speech-Preserving Facial Expression Manipulation (SPFEM) aims to alter facial emotional expressions while maintaining lip movements synchronized with spoken content. This technology not only enhances human expressiveness but also provides significant support for practical applications such as virtual character generation and film post-production.
However, the implementation of SPFEM faces numerous challenges. First, spoken content and emotional information are highly intertwined during natural conversations, making it difficult to effectively separate these two types of information from reference or source videos. Second, existing methods often rely on simple supervision signals (e.g., reference images or 3D face model parameters), which may contain biases that affect the realism and accuracy of the final generated results. Therefore, designing an effective algorithm capable of manipulating emotions while preserving audio-lip synchronization has become an urgent problem to solve.
To address these issues, Tianshui Chen et al. proposed an innovative Contrastive Decoupled Representation Learning (CDRL) algorithm, which learns independent content and emotion representations separately, providing more direct and accurate supervision signals for SPFEM.
Paper Source
This paper was co-first-authored by Tianshui Chen and Jianman Lin, with Zhijing Yang as the corresponding author. The authors are from Guangdong University of Technology, South China University of Technology, and Sun Yat-sen University. The paper was published in the prestigious journal International Journal of Computer Vision (IJCV) and officially accepted in January 2025. The title of the paper is “Contrastive Decoupled Representation Learning and Regularization for Speech-Preserving Facial Expression Manipulation.”
Research Details
a) Research Workflow
The core of this study is the design and implementation of a novel CDRL algorithm, which consists of two main modules: Contrastive Content Representation Learning (CCRL) and Contrastive Emotion Representation Learning (CERL). Below are the specific details of the research process:
1. Data Preparation
The research is based on the MEAD dataset (Multilingual Emotional Audio-Visual Dataset) for training and validation. MEAD contains video data from 60 speakers, each recording 30 videos in seven different emotional states. To construct paired data, the authors used Dynamic Time Warping (DTW) to align two videos with the same spoken content but different emotions, thus obtaining one-to-one training samples.
2. Contrastive Content Representation Learning (CCRL)
- Objective: Learn a representation that contains only spoken content information, excluding emotional interference.
- Method:
- Use audio as content prior, extracting content features from source images through a cross-attention mechanism.
- Introduce emotion-aware contrastive loss to ensure maximum similarity between positive samples (same spoken content but different emotions) and minimum similarity between negative samples (different spoken content but the same emotion).
- Audio feature extraction uses the pre-trained XLSR model, while image feature extraction combines ArcFace and mapping operations.
- Experimental Setup: Training was performed using a GeForce RTX 4090 GPU, with the Adam optimizer, an initial learning rate of 0.0001, and lasted for 10 epochs.
3. Contrastive Emotion Representation Learning (CERL)
- Objective: Learn a representation that contains only emotional information, excluding spoken content interference.
- Method:
- Utilize a pre-trained visual-language model (e.g., CLIP) and prompt tuning to extract emotion priors.
- Introduce emotion-augmented contrastive loss, selecting high-emotion-clarity images as training samples.
- Fuse emotion priors with image features through a dot product operation to obtain emotion representations.
- Experimental Setup: During training, only the prompt vector ( t_i ) was updated, while other parameters were fixed. The optimizer was Stochastic Gradient Descent (SGD) with an initial learning rate of 0.1, which was reduced at epochs 2, 4, and 6.
4. SPFEM Model Training
- During SPFEM model training, the content and emotion representations provided by CDRL were used as additional supervision signals.
- Content representations were used to constrain content consistency between the generated image and the source input, while emotion representations constrained emotion consistency between the generated image and the reference input.
b) Main Results
1. Quantitative Comparison
Extensive quantitative evaluations were conducted on the MEAD and RAVDESS datasets, using the following three metrics to measure the quality of generated results: - FAD (Fréchet ArcFace Distance): Measures the realism of generated images; lower values are better. - CSIM (Cosine Similarity): Measures the emotion similarity between generated images and reference images; higher values are better. - LSE-D (Lip Sync Error Distance): Measures lip synchronization with audio; lower values are better.
The experimental results showed that, under the Cross-ID setting on the MEAD dataset, the CDRL algorithm significantly improved all metrics. For example, when applied to the NED baseline model, the average FAD decreased from 4.448 to 4.344, LSE-D decreased from 9.906 to 9.351, and CSIM increased from 0.773 to 0.792. Similarly, on the RAVDESS dataset, CDRL also demonstrated consistent performance improvements.
2. Qualitative Comparison
Qualitative analysis further validated the effectiveness of CDRL. For instance, the NED baseline model often distorted mouth shapes during emotion editing, whereas integrating CDRL allowed the generated images to better maintain lip synchronization with speech while achieving accurate emotion transfer.
3. User Study
The study also conducted a user survey, inviting 25 participants to rate the realism, emotion similarity, and lip synchronization of the generated results. The results showed that CDRL significantly outperformed the baseline model across all metrics. For example, on the MEAD dataset, CDRL increased realism ratings by 40%, emotion similarity ratings by 38%, and lip synchronization ratings by 48%.
c) Conclusions and Significance
The CDRL algorithm proposed in this study provides a novel and efficient solution for SPFEM. By separately learning independent content and emotion representations, CDRL can manipulate emotions more accurately while effectively preserving audio-lip synchronization. Additionally, CDRL demonstrated strong generalization capabilities, achieving excellent performance on new datasets (e.g., RAVDESS) even without retraining.
This research holds significant scientific value and application potential. On one hand, it offers a new approach to decoupled representation learning, applicable to multimodal data processing. On the other hand, it provides technical support for practical applications such as virtual character generation and film post-production.
d) Research Highlights
- Innovative Algorithm Design: First proposed the CDRL algorithm, using CCRL and CERL modules to learn content and emotion representations separately.
- Application of Contrastive Learning: Successfully achieved decoupling of content and emotion information using a contrastive learning framework.
- Multimodal Data Fusion: Combined audio and image data, fully leveraging the advantages of multimodal information.
- Validation via User Studies: Conducted large-scale user surveys to comprehensively evaluate the quality of generated results.
e) Other Valuable Information
The research team also explored the limitations of CDRL, such as its inability to perfectly transfer detailed features like teeth in some cases. Future work plans to further enhance the algorithm’s generalization ability through adversarial training.
Summary
This paper addresses the long-standing challenge of decoupling content and emotion information in SPFEM by proposing the CDRL algorithm. Its innovative workflow, rigorous experimental design, and outstanding performance make it a significant milestone in the field.