Speech Emotion Recognition in Conversations Using Artificial Intelligence: A Systematic Review and Meta-Analysis

2025-04-18 Fri
speech emotion recognition artificial intelligence systematic review meta-analysis emotion labels multimodal conversation
Academic BackgroundEmotion Recognition is an important research direction in the fields of Artificial Intelligence (AI) and Affective Computing, with broad application prospects in areas such as healthcare, education, and Human-Computer Interaction (HCI). Speech, as a significant carrier of emotional expression, can convey rich emotional information through features such as tone, speed, and volume. However, Speech Emotion Recognition (SER) in conversational contexts still faces numerous challenges, including the dynamic nature of emotions, the integration of multimodal data, and the accuracy of emotional annotations.
To better understand the latest advancements and existing issues in AI-based Speech Emotion Recognition in Conversation (SERC), the authors conducted a systematic review and meta-analysis. This study aims to reveal current trends, performance, biases, and limitations in the field of SERC through systematic review and quantitative analysis, providing guidance for future research.
Source of the PaperThis paper was co-authored by Ghada Alhussein, Ioannis Ziogas, Shiza Saleem, and Leontios J. Hadjileontiadis, who are affiliated with multiple research institutions, including the Aristotle University of Thessaloniki in Greece. The paper was accepted on March 7, 2025, and published in the journal Artificial Intelligence Review, with the DOI 10.1007/s10462-025-11197-8.
Topic and Main Points of the PaperThe topic of this paper is “A Systematic Review and Meta-Analysis of Artificial Intelligence in Speech Emotion Recognition in Conversation.” Through systematic review and meta-analysis, the authors explore the current applications, performance, and challenges of AI technologies in the field of SERC. Below are the main points and detailed content of the paper:
1. Choice of Emotion Modeling: Categorical vs. Dimensional ModelsEmotion modeling is a core issue in SERC research. The paper points out that current research primarily adopts two emotion modeling approaches: Categorical Models and Dimensional Models. Categorical Models are based on Ekman’s six basic emotions (e.g., happiness, anger, sadness), while Dimensional Models describe emotional states through three dimensions: Valence, Arousal, and Dominance.
Supporting Evidence: The meta-analysis results show that Categorical Models dominate SERC research, especially in studies using the IEMOCAP and MELD datasets. However, Dimensional Models have advantages in capturing continuous changes in emotions, particularly in classification tasks involving Valence and Arousal.
Sub-point: The strength of Categorical Models lies in their intuitiveness and ease of annotation, while Dimensional Models are more suitable for describing subtle emotional changes.
2. Multimodal vs. Unimodal Speech Emotion RecognitionThe paper explores the performance differences between Multimodal and Unimodal Speech Emotion Recognition. Multimodal methods combine multiple data sources such as speech, video, and physiological signals, while Unimodal methods rely solely on speech data.
Supporting Evidence: The meta-analysis indicates that Unimodal Speech Emotion Recognition performs slightly better in terms of accuracy and F1-score, but Multimodal methods have an advantage in Recall. However, due to the small sample size, this conclusion requires further validation.
Sub-point: Multimodal methods have potential in handling complex emotional expressions, but their performance is significantly influenced by data fusion techniques.
3. Evolution of Feature Extraction MethodsThe paper provides a detailed analysis of feature extraction methods in SERC research, including Hand-crafted Features, Deep-learned Features, Image Transformations, and Hybrid Approaches.
Supporting Evidence: In recent years, Deep Learning and Hybrid Approaches have gradually become mainstream, with a significant increase in Deep Learning-based feature extraction methods after 2019. Image Transformation methods (e.g., spectrograms) demonstrate high stability in processing speech signals.
Sub-point: Hybrid Approaches, which combine Hand-crafted Features and Deep Learning Features, can significantly improve the accuracy of emotion recognition, but their complexity also increases computational costs.
4. Dataset Selection and Its ImpactThe paper emphasizes the importance of datasets in SERC research, particularly the widespread use of the IEMOCAP and MELD datasets. However, these datasets are primarily based on Acted Conversations, which may not fully reflect emotional expressions in real-world scenarios.
Supporting Evidence: The meta-analysis results show that datasets based on Acted Conversations perform better in accuracy and recall compared to Spontaneous Conversations datasets. However, Spontaneous Conversations datasets have higher application value in real-world scenarios.
Sub-point: Future research should focus more on Spontaneous Conversations datasets to improve the generalization ability of emotion recognition models in practical applications.
5. Reliability of Emotion AnnotationsThe paper delves into the reliability of emotion annotations, particularly the impact of Inter-rater Reliability (IRR) on emotion recognition performance.
Supporting Evidence: Through Cronbach’s α coefficient analysis, the paper finds that Valence annotations are more reliable than Arousal annotations. The annotation consistency of the IEMOCAP dataset is significantly higher than that of the K-EmoCon dataset.
Sub-point: The accuracy of emotion annotations is crucial for the performance of AI models, and future research should optimize annotation processes to reduce annotation noise.
Significance and Value of the ResearchThrough systematic review and meta-analysis, this paper comprehensively evaluates the latest advancements and challenges in AI-based Speech Emotion Recognition in Conversation. The main value of the research lies in:
1. Scientific Value: The paper reveals key technological trends in the SERC field, providing direction for future research.
2. Application Value: The research results offer theoretical support for developing more efficient emotion recognition systems, with broad application prospects in healthcare, education, and HCI.
3. Methodological Contribution: The multi-subgroup meta-analysis method proposed in this paper provides a new quantitative analysis framework for emotion recognition research.
Highlights of the ResearchComprehensiveness: The paper covers 51 SERC studies from 2010 to 2023, conducting a systematic review and quantitative analysis.
Innovation: Through multi-subgroup meta-analysis, the paper explores the impact of emotion modeling, multimodal fusion, feature extraction, and dataset selection on emotion recognition performance.
Practicality: The research results provide practical guidance for developing more efficient emotion recognition systems, particularly in optimizing annotation processes and dataset selection.
Other Valuable InformationThe paper also discusses biases and reporting quality issues in emotion recognition and proposes improvement suggestions. For example, future research should focus more on cross-language and cross-dataset emotion recognition capabilities to enhance model generalization. Additionally, the paper calls for the establishment of more open-access emotion annotation datasets to promote further development in the SERC field.
Through this academic report, we can clearly see the current research status, challenges, and future directions of Artificial Intelligence in Speech Emotion Recognition in Conversation. This paper not only provides valuable research references for the academic community but also offers important guidance for the development of emotion recognition technologies in practical applications.