Spectro-Temporal Modulations Incorporated Two-Stream Robust Speech Emotion Recognition
Research on Two-Stream Robust Speech Emotion Recognition Based on Spectro-Temporal Modulation Features
Academic Background
Speech Emotion Recognition (SER) is a technology that identifies emotions by analyzing the emotional content in human speech. It has broad application potential in areas such as human-computer interaction, customer service management systems, and healthcare. However, although SER models based on deep learning have shown impressive results in controlled environments, their performance significantly degrades in real-world noisy conditions. Noise, such as traffic noise or fan noise, can severely interfere with speech signals, leading to a significant drop in the accuracy of emotion recognition systems. Therefore, developing an SER system that remains robust in noisy environments has become an important research direction.
Traditional SER systems typically rely on acoustic features such as Mel-Frequency Cepstral Coefficients (MFCC) and mel-spectrograms. However, these features are susceptible to interference in noisy environments, resulting in decreased recognition performance. In recent years, researchers have begun exploring more robust features, such as Spectro-Temporal Modulation (STM) features. STM features, by simulating the processing mechanisms of the human auditory cortex, can better capture emotional information in speech signals and exhibit stronger robustness in noisy environments.
Paper Source
This paper was co-authored by Yih-Liang Shen, Pei-Chin Hsieh, and Tai-Shih Chi from the Department of Electronics and Electrical Engineering at National Yang Ming Chiao Tung University, Taiwan. It was published in the Journal of LaTeX Class Files in August 2021. The research was supported by the Ministry of Science and Technology, Taiwan.
Research Process
1. Research Objectives
This paper proposes a two-stream SER model that combines spectro-temporal modulation features with traditional acoustic features, aiming to improve the model’s robustness in noisy environments. The study validates the effectiveness of this model under the “clean-train-noisy-test” paradigm using German (EMODB) and English (RAVDESS) datasets.
2. Data Preparation
The study utilized two publicly available SER datasets: EMODB and RAVDESS. The EMODB dataset contains 535 German speech samples covering seven emotions, while the RAVDESS dataset includes 1440 English speech samples covering eight emotions. All speech samples were uniformly processed to a length of 3 seconds, with shorter samples padded with zeros.
3. Feature Extraction
The study employed two types of features: - Mel-Spectrogram: Generated using a window length of 40 ms, a hop size of 10 ms, a 2048-point Fast Fourier Transform (FFT), and 128 mel-frequency bins. - Spectro-Temporal Modulation Features: Generated by applying modulation filters to the mel-spectrogram. The rate parameter (ω) of the modulation filters was set to ±2, ±4, ±8, ±16, ±32 Hz, and the scale parameter (ω) was set to 0.5, 1, 2, 4 cycles/20 mel-bands.
4. Model Design
The study proposed a Two-Stream Attention-based Convolutional Recurrent Neural Network (TACRNN) model, consisting of two branches: - Mel-Spectrogram Branch: Uses convolutional layers to extract mel-spectrogram features, followed by max-pooling and fully connected layers for feature consolidation. - Modulation Branch: Employs a similar architecture to the mel-spectrogram branch, extracting information from spectro-temporal modulation features. The features from the two branches are concatenated and then input into a Bi-directional Long Short-Term Memory (BiLSTM) network and attention layer, ultimately classified by a Softmax classifier.
5. Experimental Setup
The study used 10-fold cross-validation, training the model with the Adam optimizer and cross-entropy loss function. Experiments were conducted under both clean and noisy conditions, with noise conditions including white noise and DNS Challenge noise. Signal-to-noise ratios (SNRs) were set at 5, 10, 15, and 20 dB.
Main Results
1. Performance Under Clean Conditions
Under clean conditions, the ACRNN model using only mel-spectrogram features performed better than the model using only STM features. However, the two-stream TACRNN model achieved comparable performance to baseline models on both the EMODB and RAVDESS datasets.
2. Robustness Under Noisy Conditions
Under noisy conditions, the TACRNN model demonstrated significant robustness. In the presence of white noise and DNS Challenge noise, the TACRNN model outperformed models using only mel-spectrogram features and other baseline models across most SNR levels. Statistical analysis showed that the performance improvement of the TACRNN model in noisy environments was statistically significant.
3. Weight Analysis of Modulation Features
The study found that during training, the TACRNN model paid more attention to the outputs of certain specific modulation filters, such as those with a rate of ±2 Hz and a scale of 4 cycles/20 mel-bands. These filters were able to capture the harmonic structure and formant contours of speech, which are crucial for speech perception in noisy environments.
Conclusion and Significance
By incorporating spectro-temporal modulation features into neural network models, this paper significantly improved the robustness of SER systems in noisy environments. The results indicate that STM features are more advantageous than traditional acoustic features in noisy conditions, providing new directions for future SER research.
Research Highlights
- Novel Feature Fusion Method: For the first time, STM features were combined with mel-spectrogram features, proposing a two-stream SER model.
- Significant Robustness Improvement: The TACRNN model demonstrated superior performance to baseline models under various noise conditions.
- In-depth Feature Analysis: By analyzing the weights of modulation features, the study revealed key speech features that the model focuses on in noisy environments.
Application Value
This research provides theoretical and technical support for developing robust SER systems applicable in real-world environments, with the potential to play a significant role in intelligent customer service and affective computing.
Other Valuable Information
The study also pointed out that future work could further optimize the parameter selection of modulation filters and explore the fusion of other acoustic features with STM features. Additionally, the researchers plan to extend the model to environments that include reverberation to assess its generalization capabilities.