Phonetically-Anchored Domain Adaptation for Cross-Lingual Speech Emotion Recognition

Phonetic-Anchored Domain Adaptation in Cross-Lingual Speech Emotion Recognition

Academic Background

Speech Emotion Recognition (SER) has broad application prospects in intelligent agents, social robots, voice assistants, and automated call center systems. With the development of globalization, the demand for cross-lingual SER is increasing. However, the main challenge in cross-lingual emotion recognition lies in the differences in emotional expression and acoustic features between different languages. Traditional research methods mainly address the issue from a computational perspective by adapting features, domains, and labels to handle cross-lingual problems, but often overlook the underlying commonalities between languages.

This study aims to solve the language adaptation problem in cross-lingual emotion recognition by introducing vowel phonemes as anchors for cross-lingual emotion recognition. Specifically, the authors explore the commonalities of vowels associated with specific emotions across different languages and use these commonalities as bridges for cross-lingual emotion recognition. Through this method, the research team hopes to enhance the performance of cross-lingual emotion recognition, especially in unsupervised learning scenarios.

Paper Source

This paper was jointly completed by a research team from National Tsing Hua University, University of Texas at Dallas, and Carnegie Mellon University. The main authors include Shreya G. Upadhyay, Luz Martinez-Lucas, William Katz, Carlos Busso, and Chi-Chun Lee. The paper was published in the journal IEEE Transactions on Affective Computing in October 2024.

Research Process

1. Research Objectives and Framework

The objective of this study is to improve the performance of cross-lingual speech emotion recognition through the commonalities of vowel phonemes. The research framework is divided into two parts: first, the researchers analyzed the vowel commonalities related to specific emotions in different languages, especially those vowels that are valuable in emotion recognition; secondly, they utilized these commonalities as anchors to design an unsupervised cross-lingual emotion recognition model.

2. Datasets and Preprocessing

The study used three naturalistic emotional speech datasets: MSP-Podcast (American English), BIIC-Podcast (Taiwanese Mandarin), and Dusha (Russian). These datasets were meticulously annotated by human raters to ensure the accuracy of emotion labels. For phoneme analysis, the research team used the Montreal Forced Aligner (MFA) tool to align the speech samples with phonemes and converted them into International Phonetic Alphabet (IPA) representations.

3. Vowel Commonality Analysis

The researchers explored the commonalities of vowels in different languages through Formant analysis and Wav2Vec2.0 feature representation. Specifically, the research team calculated the F1 and F2 formants of the vowels and visualized the similarities of vowel features across different languages using t-SNE visualization techniques. The analysis also expanded its scope to include not only monophthongs but also diphthongs to gain a more comprehensive understanding of vowel behavior in emotion recognition.

4. Anchor Selection

Based on the distances and similarities of vowel features, the research team selected vowels that performed consistently across different languages as anchors. Specific methods included calculating cosine similarity and Euclidean distance, and determining the best anchors through a combined score. The study also proposed a group-anchor-based method, selecting a set of vowels that performed well in emotion recognition as anchors.

5. Cross-Lingual Emotion Recognition Model

The study proposed an Attention-based Group-vowel-anchored Cross-lingual SER (AGA-CL) model. This model includes two branches: an emotion classification branch and a phonetically-anchored domain adaptation branch. The emotion classification branch uses features extracted by Wav2Vec2.0 for emotion classification, while the phonetically-anchored domain adaptation branch aligns the vowel features of the source and target languages through a triplet loss function.

Main Results

1. Results of Vowel Commonality Analysis

The results show that specific vowels exhibit emotion-related commonalities across different languages. For example, the vowels /i/ and /a/ show high similarity in happiness and anger emotions. Through Formant analysis and Wav2Vec2.0 feature representation, the research team found that these vowels are highly valuable in emotion recognition.

2. Results of Anchor Selection

Based on the combined score, the research team selected vowels that performed well across different languages as anchors. For example, in the case of happiness, the vowel /i/ was chosen as the best anchor, while the vowels /o/ and /u/ performed poorly. The study also found that using group anchors can significantly improve emotion recognition performance.

3. Model Performance

The proposed AGA-CL model performed excellently in cross-lingual emotion recognition tasks. In the task from MSP-Podcast to BIIC-Podcast, the Unweighted Average Recall (UAR) of the AGA-CL model reached 58.14%, which is a 6.89% improvement over the baseline models. In the task from BIIC-Podcast to MSP-Podcast, the UAR of the AGA-CL model was 55.49%, again significantly outperforming the baseline models.

Conclusion and Significance

This study proposes a novel unsupervised cross-lingual emotion recognition method by introducing vowel phonemes as anchors for cross-lingual emotion recognition. The results indicate that specific vowels exhibit emotion-related commonalities across different languages, and utilizing these commonalities can significantly enhance the performance of cross-lingual emotion recognition. This method not only holds scientific value but also provides new insights for practical applications of cross-lingual emotion recognition.

Research Highlights

  1. Discovery of Vowel Commonalities: The study systematically analyzes the commonalities of vowels in emotion recognition across different languages for the first time, providing a new perspective for cross-lingual emotion recognition.
  2. Phonetic Anchoring Mechanism: The proposed phonetic anchoring mechanism aligns the vowel features of the source and target languages through a triplet loss function, significantly improving the performance of cross-lingual emotion recognition.
  3. Unsupervised Learning: This method performs excellently in unsupervised learning scenarios, reducing the dependency on labeled data in the target language and has broad application prospects.

Future Work

The research team plans to further expand the analysis methods, including consonants and articulatory gestures, to comprehensively understand the commonalities in cross-lingual emotion recognition. Additionally, the team plans to integrate the phonetic anchoring mechanism with other advanced domain adaptation techniques to further enhance the model’s performance.