Facial 3D Regional Structural Motion Representation Using Lightweight Point Cloud Networks for Micro-Expression Recognition
3D Regional Structural Motion Representation Using Lightweight Point Cloud Networks for Micro-Expression Recognition
Academic Background
Micro-expressions (MEs) are brief and subtle facial expressions in human emotional expression, typically lasting between 1⁄25 and 1⁄5 of a second. Due to their spontaneity, rapidity, and difficulty to control, micro-expressions often reveal an individual’s true emotions, making them valuable in fields such as Human-Computer Interaction (HCI), psychology, criminal analysis, and business negotiations. However, the low intensity and transient nature of micro-expressions make their recognition a highly challenging task. Traditional micro-expression recognition methods primarily rely on motion feature extraction from 2D RGB images, neglecting the critical role of facial structure and its movements in conveying emotions. To overcome this limitation, this paper proposes an innovative 3D facial motion representation method that integrates 3D facial structure, regionalized RGB, and structural motion features, aiming to capture subtle changes in facial dynamics more accurately.
Source of the Paper
This paper was co-authored by Ren Zhang, Jianqin Yin, Chao Qi, Yonghao Dang, Zehao Wang, Zhicheng Zhang, and Huaping Liu, from the School of Intelligent Engineering and Automation at Beijing University of Posts and Telecommunications and the Department of Computer Science and Technology at Tsinghua University. The paper has been accepted by IEEE Transactions on Affective Computing and is scheduled for official publication in 2025.
Research Workflow and Experimental Methods
1. 3D Spatiotemporal Facial Motion Representation
The study first extracts video sequences from the CAS(ME)3 dataset, including depth maps and corresponding RGB images. By generating 3D point clouds from depth maps and combining optical flow information from RGB images, it captures spatiotemporal dynamic changes of facial pixels. Specific steps are as follows: - Conversion from Depth Map to 3D Point Cloud: Using camera intrinsic parameters (e.g., focal length and principal point coordinates), pixels in the depth map are mapped to 3D space to generate color-informed point clouds. - Integration of Optical Flow and Structural Motion: By calculating optical flow and depth changes between the onset frame and apex frame, motion information in the x, y, and z directions for each point is obtained.
2. Semantic Facial Region Segmentation
To capture emotional expressions in different facial areas more precisely, the study divides the face into eight semantic regions, including the left and right eyebrows, cheeks, mandibles, mouth, and chin. Using the 68 facial landmarks detected by dlib, boundaries for each region are defined, and motion features are extracted from the point cloud for each region.
3. Lightweight Point Cloud Graph Convolutional Network (Lite-Point-GCN)
To address the issue of limited micro-expression samples, the study proposes a lightweight point cloud graph convolutional network (Lite-Point-GCN). The network performs feature extraction and modeling through two stages: - Local Regional Motion Feature Extraction: A lightweight PointNet++ network is used to extract local features from each semantic region, integrating spatial and motion information. - Global Motion Feature Relation Learning: A graph convolutional network (GCN) models interactions among different facial regions, capturing the relationship between emotional categories and motion features.
4. Experiments and Evaluation
The study conducted extensive experiments on the CAS(ME)3 dataset, using Leave-One-Subject-Out (LOSO) cross-validation to evaluate the effectiveness of the proposed method. Experimental results demonstrate that the 3D facial motion representation method, which integrates depth information, significantly outperforms existing state-of-the-art methods in micro-expression recognition tasks.
Main Results
- Superiority of 3D Motion Representation: The 3D facial motion representation method, which combines optical flow and depth information, can capture facial dynamic changes more accurately, showing stronger robustness under varying lighting and pose conditions.
- Effectiveness of Semantic Region Segmentation: Dividing the face into eight semantic regions and extracting motion features from each region significantly improves the accuracy and robustness of micro-expression recognition.
- Performance of Lite-Point-GCN: The lightweight point cloud graph convolutional network excels in both local and global feature modeling, effectively reducing the risk of overfitting and achieving excellent recognition performance on the CAS(ME)3 dataset.
Conclusion and Significance
This study proposes an innovative 3D facial motion representation method that integrates depth information with a lightweight point cloud graph convolutional network, significantly improving the accuracy and robustness of micro-expression recognition. This method not only holds significant application value in fields like HCI and psychology but also provides new ideas and approaches for future micro-expression recognition research.
Research Highlights
- Innovative 3D Facial Motion Representation: For the first time, depth information is combined with optical flow to propose a more comprehensive facial motion representation method.
- Lightweight Point Cloud Graph Convolutional Network: The designed Lite-Point-GCN network performs excellently under limited sample conditions, effectively reducing the risk of overfitting.
- Semantic Region Segmentation: By dividing the face into eight semantic regions, precise capture of emotional expressions in different regions is achieved.
Other Valuable Information
The study also explores the choice of global models, comparing the performance of GCN and Transformer in micro-expression recognition tasks. Experimental results show that GCN has significant advantages in global modeling, capable of capturing complex relationships among facial regions more accurately. Future research will further explore how to validate the effectiveness and generalization ability of this method across larger-scale and more diverse datasets.