SigWavNet: Learning Multiresolution Signal Wavelet Network for Speech Emotion Recognition
Application of Multiresolution Signal Wavelet Network in Speech Emotion Recognition: SigWavNet
Academic Background
Speech Emotion Recognition (SER) plays a crucial role in human-computer interaction and psychological assessment. It identifies the speaker’s emotional state by analyzing speech signals, with wide applications in emergency call centers, healthcare, and virtual AI assistants. However, despite significant progress in this field, challenges such as system complexity, insufficient feature distinctiveness, and noise interference persist. To address these issues, a research team from the University of Québec, Concordia University, and the University of Québec at Montréal proposed a new end-to-end deep learning framework—SigWavNet, which directly extracts meaningful features from raw waveform speech signals and enhances emotion recognition accuracy through multiresolution analysis.
Source of the Paper
This paper was co-authored by Alaa Nfissi, Wassim Bouachir, Nizar Bouguila, and Brian Mishara from the University of Québec, Concordia University, and the University of Québec at Montréal. It was published in the journal IEEE Transactions on Affective Computing in 2025, titled “SigWavNet: Learning Multiresolution Signal Wavelet Network for Speech Emotion Recognition.”
Research Process
1. Research Motivation and Problem
Existing speech emotion recognition systems have limitations when handling complex emotional expressions, especially in feature extraction and noise robustness. Traditional methods often rely on fixed-length speech segmentations, failing to fully capture the distribution of emotional information. Additionally, noise interference significantly affects the practical performance of these systems. To solve these problems, SigWavNet proposes an end-to-end deep learning framework based on Fast Discrete Wavelet Transform (FDWT), combined with one-dimensional dilated convolutional neural networks (1D Dilated CNN) and bidirectional gated recurrent units (Bidirectional GRU) to capture spatial and temporal features of speech signals.
2. Research Methods and Processes
a) Fast Discrete Wavelet Transform (FDWT)
The core of SigWavNet is the FDWT layer, used for multi-level decomposition of raw speech signals. FDWT simulates low-pass and high-pass filters through convolutional layers, decomposing signals level by level. Each decomposition generates approximation coefficients (low-pass results) and detail coefficients (high-pass results), maintaining orthogonality through Conjugate Quadrature Filters (CQF). The advantage of FDWT lies in its ability to perform localized analysis in both time and frequency domains, which is crucial for capturing emotional features in speech.
b) Learnable Asymmetric Hard Thresholding (LAHT)
To enhance the sparsity of feature representation, SigWavNet introduces the Learnable Asymmetric Hard Thresholding function. This function combines two opposing Sigmoid functions, dynamically adjusting thresholds to effectively remove noise while retaining emotion-related features.
c) One-Dimensional Dilated CNN and Spatial Attention Mechanism
Based on the multi-level features extracted by FDWT, SigWavNet uses a one-dimensional dilated CNN to further capture local dependencies. The dilated CNN expands the receptive field of the convolution kernel, enabling it to process long-distance temporal information. The spatial attention mechanism dynamically adjusts feature weights, highlighting emotionally significant areas.
d) Bidirectional GRU and Temporal Attention Mechanism
To capture temporal patterns in speech signals, SigWavNet incorporates a bidirectional GRU network. The bidirectional GRU processes forward and backward temporal information simultaneously, while the temporal attention mechanism identifies key regions that contribute most to emotion recognition.
e) Channel Weighting and Global Average Pooling
In the final stage of feature extraction, SigWavNet dynamically adjusts the weights of different frequency bands through a channel weighting layer, combining global average pooling (GAP) to compress feature maps into scalar values, ultimately outputting emotion classification probabilities via the Log Softmax layer.
3. Experiments and Results
a) Datasets
The study utilized two publicly available speech emotion recognition datasets: IEMOCAP and Emo-DB. IEMOCAP contains 12 hours of audio data covering various emotion categories; Emo-DB includes 535 German recordings simulating seven emotional states. To ensure fairness in experiments, the study employed 10-fold cross-validation and used stratified random sampling to divide the training and testing sets.
b) Experimental Results
SigWavNet performed exceptionally well on both the IEMOCAP and Emo-DB datasets. On IEMOCAP, the model achieved an overall accuracy of 84.8% and an F1 score of 85.1%; on Emo-DB, the accuracy reached 90.1%, with an F1 score of 90.3%. Particularly, SigWavNet excelled in recognizing “neutral” and “sad” emotions, achieving accuracies of 97% and 95.4%, respectively. Additionally, the confusion matrix showed that the model faced certain challenges in distinguishing between “angry” and “sad” emotions.
c) Comparison with Existing Methods
SigWavNet outperformed various existing speech emotion recognition methods on the IEMOCAP and Emo-DB datasets, including models based on MFCC feature extraction and CNN classification. Its advantage lies in directly extracting multi-resolution features from raw speech signals and combining spatial and temporal attention mechanisms to capture emotional information.
4. Ablation Study
To validate the roles of SigWavNet’s components, the study conducted ablation experiments. The results indicated that learnable asymmetric hard thresholding and independently learned wavelet kernels across layers significantly improved model performance. Additionally, the introduction of bidirectional GRU and temporal attention mechanisms further enhanced the model’s ability to capture temporal information.
Conclusion and Significance
SigWavNet significantly improves the accuracy and robustness of speech emotion recognition by integrating multiresolution analysis, learnable thresholds, and attention mechanisms. Its end-to-end deep learning framework not only simplifies the feature extraction process but also effectively handles noise interference in practical applications. This research provides new insights into the field of speech emotion recognition and has broad application prospects in human-computer interaction and mental health assessment.
Research Highlights
- Multiresolution Analysis: SigWavNet utilizes fast discrete wavelet transform for multi-level decomposition of speech signals, effectively capturing the time and frequency information of emotional features.
- Learnable Asymmetric Hard Thresholding: By dynamically adjusting thresholds, the model can better remove noise and retain emotion-related features.
- Spatial and Temporal Attention Mechanisms: Combining one-dimensional dilated CNNs and bidirectional GRUs, SigWavNet captures both local and global features of speech signals simultaneously.
- End-to-End Framework: SigWavNet extracts features directly from raw speech signals, avoiding the complex manual feature extraction process required by traditional methods.
Outlook
Future research can further explore the applicability of SigWavNet in multi-language and multi-dialect environments and attempt to apply it to more complex real-world scenarios, such as real-time speech emotion recognition and multimodal emotion analysis.