2.1. Expression Recognition Method
In the mid-19th century, international scholars began studying facial expressions, with psychologist Paul Ekman [
8] leading the way. Ekman conducted extensive experiments, which ultimately led to the identification and classification of six fundamental human expressions: happiness, surprise, fear, sadness, anger, and disgust. These expressions can be seen in
Figure 1. Ekman also introduced the Facial Action Coding System (FACS) [
9]. The Facial Action Coding System (FACS) divides the face into 46 distinct facial movement units and describes facial expressions by utilizing the combined information derived from these facial action units. Many scholars studying facial expressions accept and base their research on this system. The progression of face expression recognition can generally be categorized into two main types: feature-based face expression recognition methods and deep learning-based face expression recognition methods.
Currently, there are three common types of facial expression recognition methods. The first is the feature-designed expression recognition method, a traditional classification method that predesigns manual features and extracts effective expression information. The classifier has a significant impact on the accuracy of expression recognition, and commonly used classifiers include SVM [
10], AdaBoost [
11], and K-Means [
12]. The second is the geometry-based expression recognition method, which is based on FACS. In 1995, Cootes et al. [
13] proposed an active shape model based on statistical learning, which detects the outline of the face and extracts facial feature information to allow the model to more comprehensively extract facial expression features. Matthews et al. [
14] proposed an improved algorithm, the active appearance model, based on the active shape model, which enhances the model’s ability to detect face contours and locate facial features. Setyati et al. [
15] proposed an active shape model combined with a radial deviation function network, which realizes facial expression classification through face reconstruction. Han et al. [
16] proposed the face mesh transformation method, which extracts local action units related to expressions through mesh edge feature extraction, achieving an overall accuracy rate of 94.96% in the CK dataset. Finally, there is the texture-based expression recognition method, which presents facial expression feature information by statistically calculating the pixel grayscale distribution of local areas of the facial image. Most of the research methods adopted by researchers are based on classic algorithms such as Local Binary Pattern (LBP) and Gabor wavelet transform. Fu Xiaofeng [
17] proposed a multi-scale center binary method based on LBP, which uses the comparison of neighboring point pairs and center pixel weighting to reduce histogram dimensions and introduces an improved LBP sign function and multi-scale to solve the noise sensitivity problem of the LBP operator in facial expression recognition, enabling the model to achieve good classification results. Bashyal et al. [
18] extracted facial expression features using 18 Gabor filters and used principal component analysis for data dimension reduction. Finally, they combined vector quantization learning to significantly improve the recognition performance of the algorithm on the Jaffe dataset. Zhang et al. [
19] proposed a facial expression recognition algorithm based on Gabor wavelet transform, LBP, and Local Phase Quantization (LPQ). They used Gabor filters to extract facial image features from multiple angles and scales to capture significant facial expression features. The extracted images were encoded using LBP and LPQ operators, and principal component analysis was used to reduce the dimensions of the fused features of the transformed LBP and LPQ operators. Zhang Liang et al. [
20] proposed a recognition algorithm based on Gabor wavelet transform and fused gradient feature LBP, which combines the facial region features extracted by the improved LBP operator with Gabor wavelet transform features through weighted fusion, making facial expression features more prominent and achieving good algorithm classification results.
Traditional algorithms do not rely on device computing power and are easy to implement, but they do rely on predesigned manual features, and feature extraction and expression classification cannot be optimized together. Additionally, traditional algorithms have weaker robustness and generalization ability in complex scenarios and are difficult to train on large-scale datasets, so their practicality is relatively poor compared to deep learning algorithms.
Yang et al. [
21] proposed de-expression residual learning, DeRL, which regards an expression as a combination of expression elements and non-expression elements. The model adversarial network is used to input the model into a unified generation of neutral expression images, and the middle layer information of expression elements is saved by learning adversarial generation network to achieve the expression classification. In a similar vein to DeRL, Ruan et al. [
22] introduced a novel feature structure and reconstruction approach called FDRL for face expression recognition. Unlike DeRL, FDRL initially decomposes the expression features generated by the backbone network to derive a collection of potential perceptual features related to facial action units. Subsequently, the model learns the weight of each feature and the weight of the relationship between groups of features to facilitate the reconstruction of face expression features. The weights of the features and their relationships are determined with respect to their importance. The model is able to capture valid expression features. The experiments show that FDRL can achieve a good face expression recognition both in a controlled environment and the natural one. Zhao et al. [
23] designed MA-Net, an attention network based on local and global features, for facial occlusion and head pose variation. This network uses Resnet-18 as its backbone, mitigating the interference from occlusion factors. Test results indicate that MA-Net can effectively achieve facial expression recognition in natural scenarios. Wang et al. [
24] also acknowledged that occlusion and pose variation hinder facial expression recognition technology and proposed the Region Attention Network (RAN) to adaptively extract effective facial expression features. Additionally, Wang et al. [
25] viewed deep learning-based facial expression recognition research from another perspective, arguing that uncertainties arising from low-quality images and non-objective expression image labels seriously mislead the learning of neural networks. They proposed the Self-Cure Network (SCN) to suppress the uncertainties faced by facial expression recognition. Unlike suppressing uncertainties, Zhang et al. [
26] also proposed a new solution for the uncertainty problem. They encouraged the model to learn more precise uncertainty values through the loss function, helping the model learn labels for ambiguous expression images from mixed images.
Numerous studies indicate that, in comparison to traditional expression recognition methods, deep learning-based approaches exhibit superior recognition capabilities. They are adept at learning intricate expression patterns and contextual information, and typically demonstrate high accuracy, particularly when trained on large-scale datasets and equipped with a sufficiently deep network structure. Moreover, these methods also demonstrate a certain level of robustness to image variations and noise. Even in the presence of image noise, deformations, and other distortions, deep learning models can effectively recognize facial expressions. They possess stronger generalization capabilities and capture a wider range of expression variations and sample diversity through extensive training data. This enables the method to achieve favorable recognition outcomes on unseen data.
The uniqueness and significance of this study lie in the application of the global information association module and local feature enhancement. Firstly, by introducing a multi-scale global association module, this study achieves deep excavation and fusion of global facial expression information. Furthermore, the integration of the fused convolutional self-attention mechanism (ACMix) dynamically captures the contextual associations within facial expressions. The effective utilization of the ACMix mechanism enables the model to intelligently allocate attention, focusing on key facial regions during expression recognition, thereby significantly enhancing the robustness and accuracy of the model in handling complex natural scenarios.
Echoing the global information association, this study also conducts meticulous processing in local feature enhancement. Through fine-grained segmentation of feature maps and the introduction of asymmetric convolution blocks, the local feature enhancement module precisely captures and effectively enhances the crucial local features of facial expressions. This meticulous treatment not only improves the expressiveness of local features but also seamlessly fuses them with the global feature map through residual connections, achieving complementarity and enhancement between local and global information.
Table 1 shows a comparison of the advantages and disadvantages of the traditional method and ours.
The proposed method in this paper stands out for its ability to integrate global and local facial features, dynamically model facial contexts across multiple scales, and enhance salient local features. This unique combination enables robust facial expression recognition, even under challenging conditions like occlusion and pose variations.