CSE-GResNet: A Simple and Highly Efficient Network for Facial Expression Recognition
An Efficient Expression Recognition Network Based on Gabor Convolution: CSE-GResNet
Academic Background
Facial Expression Recognition (FER) is an important research direction in the field of computer vision, with wide applications in social robots, healthcare, social psychology, customer service, and psychoanalysis. Facial expressions are natural and universal signals for conveying human emotional states and intentions. Therefore, accurate recognition of facial expressions is crucial for understanding human emotions. However, most existing FER methods focus on improving model performance while neglecting computational resource consumption. Achieving high recognition performance while maintaining efficiency on resource-constrained platforms remains a significant challenge.
To address this issue, this paper proposes a lightweight and highly efficient Channel Shift-Enhancement Gabor-ResNet (CSE-GResNet) network, which aims to enhance key visual features in facial images through Gabor convolution (Gconv), while further improving the model’s expressive power through innovative Channel Shift Module (CS-Module) and Channel Enhancement Module (CE-Module).
Source of the Paper
This paper was co-authored by Jiang Shaoping, Xing Xiaofen, Xu Xiangmin, Wang Lin, Guo Kailing from South China University of Technology, and Liu Fang from Guangdong University of Finance. The paper was published in the October 2023 issue of IEEE Transactions on Affective Computing, Volume 18, Issue 9.
Research Process
1. Research Problem and Objectives
The objective of this study is to design an efficient and lightweight FER model that maintains high recognition performance while reducing computational resources and memory consumption. To achieve this, the authors proposed CSE-GResNet, which combines Gabor convolution, channel shift module, and channel enhancement module to capture key features in facial images.
2. Network Architecture Design
The core of CSE-GResNet is an improved version of ResNet called GResNet, where traditional convolution operations are replaced by Gabor convolution (Gconv). Gabor convolution enhances the model’s robustness against scale changes and rotations by embedding Gabor filters into convolutional kernels, while also reducing the number of parameters. Specifically, each output channel of the Gabor convolution contains features from multiple Gabor filters with different orientations, capturing more detailed information.
To further enhance the model’s expressive power, the authors proposed the Channel Shift Module (CS-Module) and Channel Enhancement Module (CE-Module): - CS-Module: Promotes information exchange between neighboring channels by shifting some channels along the spatial dimension. The parameters of this module are fixed, making it computationally efficient during backpropagation. - CE-Module: Enhances the model’s expressive power by aggregating complementary features from neighboring channels through local region convolution. This module has fewer parameters and low computational cost, effectively improving model performance.
3. Experimental Design and Datasets
The authors conducted extensive experiments on three publicly available FER datasets: RAF-DB, FER2013, and SFEW. These datasets contain facial expression images in natural scenes and present significant challenges such as varying head poses, lighting changes, and occlusions.
- RAF-DB: Contains 12,271 training images and 3,068 test images, labeled with six basic expressions and neutral expressions.
- FER2013: Contains 35,887 grayscale images divided into training, validation, and test sets.
- SFEW: Key frames extracted from the AFEW5.0 dataset, containing 958 training images, 436 validation images, and 372 test images.
4. Training Strategy and Data Augmentation
To enhance the model’s robustness to pose variations, the authors employed various data augmentation techniques, including random cropping, horizontal flipping, and random rotation. Additionally, the model was pre-trained on the AffectNet dataset and fine-tuned on the RAF-DB, FER2013, and SFEW datasets. During training, the SGD optimizer was used with a learning rate of 0.005, which decayed exponentially after 30 epochs.
Main Results
1. Effectiveness of GResNet
Experimental results show that GResNet based on Gabor convolution performs excellently across the three datasets. Compared to traditional ResNet, GResNet significantly improves recognition accuracy while maintaining a lower number of parameters. For example, on the RAF-DB dataset, the recognition accuracy of GResNet18 reached 85.59%, compared to 85.33% for ResNet18.
2. Effectiveness of CS-Module and CE-Module
The introduction of the channel shift module and channel enhancement module further enhanced the model’s performance. On the RAF-DB dataset, CSE-GResNet achieved a recognition accuracy of 89.02%, significantly outperforming state-of-the-art methods. Moreover, the computational cost and memory consumption of CS-Module and CE-Module are extremely low, allowing the model to run efficiently on resource-constrained platforms.
3. Comparison with Other Methods
Compared to existing efficient FER methods, CSE-GResNet demonstrates significant advantages in both recognition accuracy and computational efficiency. For instance, on the FER2013 dataset, CSE-GResNet achieved a recognition accuracy of 74.15%, while existing efficient models like EfficientFace achieved 73.59%. Furthermore, CSE-GResNet has only 2.80M parameters, far fewer than other models.
Conclusion and Significance
The proposed CSE-GResNet significantly enhances the performance of FER models by combining Gabor convolution, channel shift module, and channel enhancement module, while maintaining high computational efficiency. Experimental results demonstrate that CSE-GResNet achieves excellent recognition accuracy across multiple public datasets, with extremely low computational cost and memory consumption, making it suitable for resource-constrained application scenarios.
Research Highlights
- Efficiency and Lightweight Design: CSE-GResNet significantly reduces the number of model parameters and computational costs while maintaining efficiency.
- Innovative Module Design: The introduction of channel shift and channel enhancement modules further enhances the model’s expressive power.
- Extensive Experimental Validation: Extensive experiments on multiple public datasets validate the model’s superiority and robustness.
Other Valuable Information
The paper also discusses in detail the impact of orientation parameter (u) and scale parameter (v) selection in Gabor convolution on model performance, and validates the optimal parameter settings through experiments. Additionally, the authors explore the fusion methods of the channel shift module and channel enhancement module, proposing three different fusion strategies and validating their effectiveness through experiments.
Summary
The proposal of CSE-GResNet provides a new solution for efficient facial expression recognition, offering significant theoretical value in academia and broad application prospects in practical scenarios. Future research can further explore the applicability of this model in other computer vision tasks, such as face recognition and emotion analysis.