PH-CBAM: A Parallel Hybrid CBAM Network with Multi-Feature Extraction for Facial Expression Recognition

Liao, Liefa; Wu, Shouluan; Song, Chao; Fu, Jianglong

doi:10.3390/electronics13163149

Open AccessArticle

PH-CBAM: A Parallel Hybrid CBAM Network with Multi-Feature Extraction for Facial Expression Recognition

¹

School of Software Engineering, Jiangxi University of Science and Technology, Nanchang 330000, China

²

Jiangxi Modern Polytechnic College, Nanchang 330000, China

³

Information Engineering College, HeBei University of Architecture, Zhangjiakou 075000, China

⁴

Big Data Technology Innovation Center of Zhangjiakou, Zhangjiakou 075000, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(16), 3149; https://doi.org/10.3390/electronics13163149

Submission received: 16 July 2024 / Revised: 6 August 2024 / Accepted: 8 August 2024 / Published: 9 August 2024

(This article belongs to the Special Issue Applied AI in Emotion Recognition)

Download

Browse Figures

Versions Notes

Abstract

:

Convolutional neural networks have made significant progress in human Facial Expression Recognition (FER). However, they still face challenges in effectively focusing on and extracting facial features. Recent research has turned to attention mechanisms to address this issue, focusing primarily on local feature details rather than overall facial features. Building upon the classical Convolutional Block Attention Module (CBAM), this paper introduces a novel Parallel Hybrid Attention Model, termed PH-CBAM. This model employs split-channel attention to enhance the extraction of key features while maintaining a minimal parameter count. The proposed model enables the network to emphasize relevant details during expression classification. Heatmap analysis demonstrates that PH-CBAM effectively highlights key facial information. By employing a multimodal extraction approach in the initial image feature extraction phase, the network structure captures various facial features. The algorithm integrates a residual network and the MISH activation function to create a multi-feature extraction network, addressing issues such as gradient vanishing and negative gradient zero point in residual transmission. This enhances the retention of valuable information and facilitates information flow between key image details and target images. Evaluation on benchmark datasets FER2013, CK+, and Bigfer2013 yielded accuracies of 68.82%, 97.13%, and 72.31%, respectively. Comparison with mainstream network models on FER2013 and CK+ datasets demonstrates the efficiency of the PH-CBAM model, with comparable accuracy to current advanced models, showcasing its effectiveness in emotion detection.

Keywords:

hybrid attention; FER; spatial attention; channel attention

1. Introduction

FER is a prominent research area in computer vision, with advancements in deep neural networks enabling researchers to improve their methods. Human facial expressions are a form of non-verbal communication that directly conveys emotions [1]. In interpersonal communication, facial expressions are crucial for quickly understanding others’ emotions, as they make up 70 percent of emotional expression compared to verbal communication’s 30 percent. Therefore, facial expressions play a significant role in deciphering human emotions. The use of facial expressions has expanded into fields such as mental health, psychiatry, and psychology over the past decade, making emotion extraction a vibrant research area [2,3]. With the rapid advancement of deep learning, the application of human-computer interaction technology in real life has become increasingly extensive, particularly in areas such as classroom teaching evaluation, drama performance training, figure skating expression training, and intelligent medical applications. However, as model complexity and computational demands increase, so too does the cost associated with deep learning. These sophisticated models often necessitate substantial amounts of data and computing resources for both training and inference, which poses challenges for deployment on resource-constrained mobile devices. Consequently, there is a pressing need for models that offer high accuracy, low response times, and robust resistance to interference for future evaluations of HRI experiments [4,5]. These emotions are typically categorized into seven main groups: neutral, happy, surprised, fearful, angry, sad, and disgusted [6]. As artificial intelligence advances, there is a growing demand for systems that can accurately and reliably identify and classify human emotions. Whether using traditional neural networks or advanced Transformer networks, the primary goal remains to enhance the accuracy of FER.

Expression recognition in image processing involves several key steps, including image preprocessing, feature extraction, and classification. One common issue faced by existing models is the presence of noise and the challenge of extracting adequate features, which is essential for accurate recognition. Facial feature extraction methods range from traditional techniques like SIFT and LBP to more modern deep learning approaches [7,8]. While traditional methods are effective in many applications, they often only capture superficial features and lack the ability to extract deep features for improved accuracy. In contrast, deep learning techniques excel at extracting intricate semantic features and integrating extraction and fusion processes [9]. Deep convolutional neural networks (DCNNs) are pivotal in this context. Despite the success of DCNNs in image classification, increasing the network depth can lead to issues such as increased complexity, higher memory requirements, and problems like gradient vanishing. To tackle these challenges, researchers have proposed innovative solutions like ResNet, which incorporates residual networks into CNNs to address gradient problems and maintain accuracy in deep networks.

Many studies have improved the accuracy of face recognition by modifying the network structure. Huang et al. [10] combined the residual model with the VGG model to enhance emotion recognition accuracy. Researchers have explored extracting more facial emotional features to improve accuracy, incorporating additional modalities such as audio, time, and body movements in recent studies [11,12]. Kuhnke et al. [13] introduced a dual-stream model that merges audio and image inputs with CNN networks, leveraging facial similarities across emotions for improved accuracy. Attention mechanisms have been integrated into various network models to reduce complexity and parameter accumulation [14]. Wang et al. [15] proposed an attention branch for modularized feature extraction in Facial Expression Recognition (FER), focusing on facial features.

There are also studies that analyze FER through attention mechanisms. ZiYu Huang et al. [16] integrated the SE attention module with CNN for facial expression analysis, specifically targeting mouth and nose regions. Arpita Vats et al. [17] presented an FER framework that combines Swin Vision Transformers and SE attention, achieving superior results compared to the winners of the field affective behavior analysis competition at ECCV 2022 [18]. Yao et al. [19] proposed an attention mechanism that integrates spatial and channel attention to improve image feature extraction capabilities in facial expression recognition without increasing network complexity. Liu et al. emphasized the importance of attention in classifying emotional expressions by fine-tuning a pre-trained model with a vision converter [20]. Wen et al. introduced the DAN network, utilizing multiple attention heads to focus on various facial regions simultaneously and merging them into a collective attention module for facial expression recognition [21]. Yan suggested integrating SGE attention with the SandGlass network to identify and categorize facial expressions, allowing each spatial position in the network to produce an attention factor for adjusting sub-features and enhancing learned expressions while suppressing noise [12,22]. These studies highlight the significance of attention mechanisms in facial expression recognition.

While previous studies have enhanced the accuracy of facial emotion recognition through network structure and attention mechanism improvements, the models still tend to have a large number of parameters and require further accuracy enhancements. Our model has maintained a similar number of parameters but has managed to enhance both calculation speed and accuracy.

This study introduces a network architecture based on hybrid attention to address the issue of locality in facial feature attention, offering significant value in the realm of facial expression recognition. The key contributions of this research are outlined as follows:

The network design incorporates a multimodal extraction approach to capture diverse facial features during initial image feature extraction. By utilizing depth-separable convolution and attention modules, the model reduces parameter count and computational burden, making it suitable for resource-constrained environments without compromising performance.
Through the combination of CBAM attention and sub-channel attention, the proposed model can automatically identify facial expressions. Parallel Hybrid attention enhances focus on critical facial information, improving key feature extraction for precise expression classification.
Integrating residual networks and the MISH activation function, the network structure effectively addresses issues like gradient vanishing and zero negative gradient during residual transmission. This integration enhances information retention, suppresses irrelevant data, and facilitates information flow between crucial image details and the network model.

2. Methods

2.1. Multi-Feature Extraction Module

A convolutional neural network with residual connections is developed to extract image features in a layered manner, enhancing accuracy in image classification tasks. The model consists of four feature extraction modules (Figure 1A). The initial module employs a 7 × 7 convolution kernel and a 3 × 3 convolution kernel to produce an 8-channel feature map from the input image, maintaining input-output dimensions with padding and enforcing regularization. Batch normalization is applied to normalize the output, aiding model convergence and generalization. Each feature extraction step is followed by a ReLU activation function for network expressiveness. The subsequent modules expand channel numbers using point convolution and further process feature maps with 3 × 3 convolution kernels, without bias. The second and third modules first utilize a 1 × 1 convolutional kernel to expand the number of channels, followed by the use of 5 × 5 and 3 × 3 convolution kernels to process the feature map. The second module concludes with a 3 × 3 convolution module. These branches execute distinct convolution operations and feature extraction on the input image. The final part of the network includes a 3 × 3 maximum pooling layer (MaxPooling2D) to decrease the spatial dimension of the feature map by selecting the maximum value in each region of the image. This step aids in reducing the number of parameters and computational load while retaining important features. Following this, a 1 × 1 convolutional kernel is applied to further extract features from the maximally pooled feature map. The integration of these four feature extraction modules is crucial for establishing the residual connection. Residual connections enable the network to directly capture the variances between inputs and outputs, addressing the issue of gradient vanishing and facilitating more effective training and optimization of deeper networks.

2.2. MISH Activation Function

The Mish activation function [23], proposed by Sergey Ioffe in 2019, combines the hyperbolic tangent function (tanh) and the softplus function. It is a smooth and continuously differentiable function, allowing for better gradient propagation during neural network training. Mish, being a nonlinear function (Equation (1)), captures complex input relationships more effectively than linear activation functions like ReLU. Unlike ReLU, Mish avoids the issue of gradient vanishing in the negative region, preventing ‘neuronal death’, as shown in Figure 2. Additionally, Mish offers a wider output range compared to tanh and sigmoid, helping to mitigate the gradient vanishing problem.

M i s h = x \times \tan (l n (1 + e^{x}))

(1)

2.3. Parallel Hybrid Attention

2.3.1. Channel Attention

Channel Attention [24] is a network model that enhances key features by focusing on feature channels and comparing the correlations between them to improve classification accuracy, as shown in Figure 2. The Channel Attention schema consists of three main operations: global average pooling, GlobalMaxPool, and a two-layer neural network known as MLP. Initially, the feature map undergoes global average pooling and GlobalMaxPool operations to create two 1 × 1 × C global feature images. These images are then fed into a two-layer neural network, with the first layer having C/r neurons and using the Relu activation function. The second layer also has C neurons, which helps in squeezing the features in the spatial dimension before returning to the original state, as shown in Equations (4)–(6). The output of the two layers is added and passed through a Sigmoid activation function

(σ)

to obtain a feature coefficient, converting each two-dimensional feature map into a real number, as shown in Equations (2) and (3). By extracting channel information through global average pooling and global maximum pooling operations, channel attention weights are learned via two fully connected layers, and attention weights are generated using the Sigmoid activation function. Channel attention aids the network in understanding the feature correlation between different channels, thereby enhancing the global semantic perception ability of images.

M_{C} = σ (M L P (A v g P o o l (F)) + M L P (M a x P o o l (F)))

(2)

M_{c} \in R^{C \times 1 \times 1}

(3)

M L P (A v g P o o l (F)) = W_{1} (W_{0} (F_{a v g}^{c}))

(4)

M L P (M a x P o o l (F)) = W_{1} (W_{0} (F_{m a x}^{c}))

(5)

W_{0} \in R^{C / r \times C}, W_{1} {\in R}^{C \times C / r}

(6)

2.3.2. Spatial Attention

Similar to channel attention, the input feature maps of size H×W×C undergo both maximum pooling

F_{m a x}^{s}

and average pooling

F_{a v g}^{s}

along one channel dimension to produce two H × W × 1 feature maps, as shown in Equations (9)–(11). These two feature maps are then concatenated based on the channel dimension. Subsequently, a 7 × 7 convolutional layer is applied, followed by the Sigmoid activation function

(σ)

to obtain a weight coefficient

M_{s}

, as shown in Equations (7) and (8). Finally, the input feature map is multiplied by the feature coefficient to generate a scaled new feature image. By utilizing average pooling and maximum pooling operations, spatial information is extracted from the feature map. Convolution operations are employed to fuse the two attention information, resulting in spatial attention weights. Spatial attention [25] aids the network in focusing on crucial areas of the image, thereby enhancing the perception of local features, as shown in Figure 3.

M_{s} = σ (f^{7 \times 7} ([A v g P o o l (F); M a x P o o l (F)]))

(7)

M_{s} \in R^{1 \times H \times W}

(8)

A v g P o o l (F) = F_{a v g}^{s}

(9)

M a x P o o l (F) = F_{m a x}^{s}

(10)

F_{a v g}^{s} \in R^{1 \times H \times W}, F_{m a x}^{s} \in R^{1 \times H \times W}

(11)

2.3.3. Parallel Hybrid Attention Module

To address the issue of diversification in target objects of various attention mechanisms, a parallel hybrid attention module is proposed, called PH-CBAM, as shown in Figure 4. This module combines channel attention and spatial attention to improve the capacity for extracting information from feature maps. The module splits the input feature map into two branches for feature extraction, with coefficients

α

and

β

assigned to each branch, respectively, ensuring that the sum of

α

and

β

equals 1. Channel attention primarily focuses on global image features, while spatial attention emphasizes local image features. Therefore, the selection of coefficients

α

and

β

is crucial. If the global characteristics of the subject are of greater importance, coefficient

α

is increased; otherwise, coefficient

β

is increased. Our research primarily focuses on FER analysis, requiring both overall face control and detailed expression analysis. Through multiple experiments, we determined that setting both

α

and

β

to 0.5 yielded the best results.

The first branch initiates the channel attention mechanism on the input feature map to derive the channel eigenvalues. Subsequently, the original feature map is merged with the channel eigenvalues to generate a new feature map, denoted as

F^{'}

. Next, the spatial attention mechanism is applied to

F^{'}

to capture the corresponding spatial attention feature values. Ultimately, the fusion of

F^{'}

and the respective spatial attention feature values results in the final feature map, referred to as

F^{″}

for the first branch, as shown in Equations (12) and (13). The hybrid attention module is devised to leverage both channel information and spatial information to effectively extract the eigenvalues from the input feature map. By amalgamating channel attention and spatial attention, the model is able to concentrate more efficiently on significant channels and spatial positions within the input feature map [26].

F^{'} = M_{c} (F) \otimes F

(12)

F^{″} = M_{S} (F^{'}) \otimes F^{'}

(13)

The second branch of the proposed module initially splits the original input feature tensor into two parts along the channel dimension, as shown in Equation (14). One part of the feature map undergoes channel attention to derive feature1, while the other part goes through spatial attention to obtain feature2. The channel attention and spatial attention functions extract two weighted feature maps from the respective parts. Subsequently, these two weighted feature maps are combined, resulting in a fused feature map that serves as the output, as shown in Equations (15) and (16). Then, the output attention vector graph is combined with the first branch result to obtain the final feature vector graph, as shown in Equation (17). By applying the proposed attention mechanism to the input feature map, a more comprehensive and valuable set of features is extracted, thereby enhancing the model’s performance and generalization capabilities.

F_{c / 2} \in R^{C / 2 \times H \times W}

(14)

M_{S}^{'} = C o n c a t ([M_{c} (F_{c / 2}); M_{s} (F_{c / 2})])

(15)

M_{S}^{″} = σ (f^{7 \times 7} (M_{S}^{'}))

(16)

O u t p u t = F^{″} \otimes M_{S}^{″} (O u t p u t \in R^{C \times H \times W})

(17)

In attention mechanisms, the relationship between channels is captured through global average pooling and global maximum pooling. Following the application of a Dense layer for nonlinear transformation, the results of these two pooling methods are combined to generate attention weights. These operations compress the feature map in the spatial dimension, resulting in a vector that corresponds to the number of channels. This approach significantly reduces the number of parameters, as subsequent fully connected layers process only this vector rather than the entire feature map. In the channel attention mechanism, two fully connected layers operate on the outputs of global average pooling and global max pooling, respectively, before adding the results together. This design leverages shared layers to minimize the number of parameters by reusing the same parameters across fully connected layers. Additionally, a smaller ratio is employed in the attention mechanism to decrease the channel dimension, further reducing the number of parameters in the fully connected layer and alleviating the computational burden. In spatial attention, average pooling and maximum pooling are utilized in the spatial dimension to capture the spatial information of the feature map while minimizing parameter calculations. Small convolutional layers are then used to generate spatial attention weights. Split-channel attention divides the input feature map along the channel dimension, applying channel attention and spatial attention independently before recombining the results. Due to this segmentation in the channel dimension, the parameter calculation for split-channel attention is halved compared to the overall channel attention and spatial attention.

2.4. The Proposed Model

The increase in depth, defined as the number of layers in the network, enables the model to capture more complex features; however, it can also lead to challenges such as gradient vanishing or exploding. These issues are often mitigated through techniques such as residual connections. In our study, we evaluated the impact of varying the number of layers on model performance using the CK+ dataset. The results indicate that a network with five layers yields the highest accuracy, as shown in Table 1. As the number of layers continues to increase, accuracy does not rise proportionally, despite an increase in the model’s computational complexity. By balancing computational complexity and accuracy, we ultimately determined that five layers represent the optimal configuration for our model. To enhance the robustness and flexibility of the proposed model for future applications, we incorporated depth-separable convolutions and residual connections in each layer type, along with attention mechanisms to emphasize essential features.

The proposed neural network architecture is well-suited for image classification tasks, boasting approximately 1,960,000 parameters. This parameter count is lower compared to current mainstream deep convolutional neural networks, yet it still achieves a commendable accuracy rate. The model features a deep convolutional neural network architecture with a PH-CBAM module specifically crafted to enhance image classification performance. Comprised of convolutional layers, batch normalization, activation functions, and pooling layers, this model effectively captures image features. The hybrid attention module elevates feature representation by incorporating both channel attention and spatial attention mechanisms, enabling the network to prioritize important feature regions while suppressing irrelevant features. Each module includes convolution operations and residual connections to facilitate information flow between convolutional blocks, ultimately enhancing the model’s ability to comprehend image content and boosting accuracy and generalization in image classification, shown in Figure 1.

In standard convolution, the convolution kernel size is denoted as

D_{k} \times D_{k}

, the input channel as M, the output channel as N, and the output feature map size as

D_{w} \times D_{h}

. The standard convolution calculation can then be expressed as follows: the parameter amount is

D_{k} \times D_{k} \times M \times N

, and the calculation amount is

D_{k} \times D_{k} \times M \times N \times D_{w} \times D_{h}

. In a deep convolutional network, each convolution kernel is composed of N convolution kernels. Consequently, the parameter quantities for depth-separable convolution can be determined as

D_{k} \times D_{k} \times

M and 1

\times

1

\times

M, leading to a total parameter quantity of

D_{k} \times D_{k} \times

M

+

M

\times

N. The convolution kernel parameter in the depth convolution process is

D_{k} \times D_{k} \times

M, requiring

D_{w} \times D_{h}

multiplications and additions for each operation. The N sets of point convolution parameters are 1

\times

1

\times

M, which must also be multiplied by

D_{w} \times D_{h}

. The results from these two components are then summed, resulting in a calculation amount of

D_{k} \times D_{k} \times M \times D_{w} \times D_{h} + M \times N \times D_{w} \times D_{h}

for the depth-separable convolution. Equation (18) illustrates the ratio of parameters in depth-separable convolution compared to standard convolution, while Equation (19) presents the ratio of computational effort between depth-separable convolution and standard convolution.

\frac{D_{k} \times D_{k} \times M + M \times N}{D_{k} \times D_{k} \times M \times N} = \frac{1}{{D^{2}}_{k}} + \frac{1}{N}

(18)

\frac{D_{k} \times D_{k} \times M \times D_{w} \times D_{h} + M \times N \times D_{w} \times D_{h}}{D_{k} \times D_{k} \times M \times N \times D_{w} \times D_{h}} = \frac{1}{{D^{2}}_{k}} + \frac{1}{N}

(19)

It can be observed that following the enhancement of the network structure, both the multiplication and addition operations across the overall architecture, as well as the total number of parameters, are significantly reduced. It has been observed that enhancing the network structure leads to a significant reduction in both the number of multiplication and addition operations, as well as the overall number of parameters. Furthermore, for standard convolutions that do not incorporate a

1 \times 1

convolution kernel prior to the

3 \times 3

convolution kernel within the network architecture, there is also a considerable decrease in the number of parameters and the volume of multiplication and addition operations.

In the initial module (Module 1), a structure is implemented to separate convolutions and residual connections. This structure comprises two separable convolutional layers: a 1 × 1 convolutional layer for adjusting channel numbers and a residual connection to avoid vanishing gradients. Initially, input features undergo a lightweight convolution operation, followed by batch normalization and the ReLU activation function. Subsequently, the features are convoluted again using a 3 × 3 convolution kernel in two stages to reduce the feature map size. Batch normalization and ReLU activation are applied similarly in this step. These convolution operations aim to extract and modify features to align them with the subsequent network layer. Simultaneously, another branch executes a series of convolution operations, including 1 × 1 convolution, 3 × 3 convolution, and two 3 × 3 separable convolutions, with batch normalization and ReLU activation at each stage. The feature map size is halved via maximum pooling. The features from both paths are then connected residually by adding their feature maps and activating them with the ReLU function. Subsequently, the features undergo processing through an attention module to enhance image perception. Finally, the processed features are residually connected with features from another branch to achieve a more comprehensive and effective feature representation. This entire process is tailored to extract and provide valuable image features for utilization in image classification tasks. The modules (Module 2 to Module 5) are duplicated, each with an increase in the number of filters for the convolutional layer. This gradual extraction and abstraction of features from images involves the addition of convolutional layers, pooling layers, and branching structures, progressing from low to high levels. Following each module, the output is processed by a hybrid attention module to boost the importance of features and the network’s perception. Residual connections and branching structures are also incorporated in the model. Residual connections aid in transferring information and gradients across different levels of the model, mitigating gradient vanishing issues and speeding up model training and convergence. The branching structure enables the network to learn diverse levels of feature representation, enhancing the network’s flexibility and expressive capacity. Subsequently, the feature map’s dimensionality is reduced using a global average pooling layer, followed by mapping to category probabilities via the Softmax activation function for image classification. This end-to-end image classification network connects the input layer to the output layer.

A batch normalization layer should be included after each convolutional layer in the model. The reduction of height and width is unnecessary due to the prior application of a maximum pooling layer with a span of 2. In the initial remaining block of each subsequent module, the number of channels doubles in comparison to the previous module, while the height and width are halved. The model incorporates a hybrid attention module that combines channel attention and spatial attention mechanisms, allowing the network to more accurately capture essential features in images and achieve outstanding performance in image classification tasks. Additionally, the inclusion of residual connections and a branch structure further enhances the training efficiency and expressive capabilities of the network, leading to improved generalization and robustness.

3. Experiment

The study begins by introducing three benchmark datasets (FER2013, CK+, and Bigfer2013). Advanced techniques were then used to conduct a comparative analysis to verify the performance of the proposed PH-CBAM model on the test set. Key performance indicators such as recognition rate, recall rate, F1 score, and accuracy were utilized for evaluation, as shown in Equations (20)–(23).

A c c u r a c y = (T P + T N) / (T P + T N + F P + F N)

(20)

P r e c i s i o n = T P / (T P + F P)

(21)

R e c a l l = T P / (T P + F N)

(22)

F 1 - s c o r e = (2 \times P r e c i s i o n \times R e c a l l) / (P r e c i s i o n + R e c a l l)

(23)

3.1. Dataset Details

The model is initially trained on the training set and then validated on the test set to determine the best performing model. When the dataset is limited in size, a common practice is to split it into an 80–20 ratio, with 80% used for training and 20% for testing. Two databases are utilized in the training phase: CK and the Extended Cohn–Kanade Dataset (CK+) [27]. The CK+ dataset is extensively used in facial expression recognition and analysis research, offering a wider range of facial expressions, sequences, and higher image quality compared to the original CK database. It comprises 327 sequences from 123 participants, covering seven basic facial expressions such as happy, sad, surprised, fearful, disgusted, angry, and contemptuous. Each sequence captures the transition from a neutral expression to a specific one, making it ideal for studying expression dynamics. Details of the three datasets are shown in Table 2 and Figure 5.

The FER 2013 dataset, short for Facial Expression Recognition 2013, is a publicly available dataset commonly used in the field of facial expression recognition research. It was initially presented as part of a challenge at the 2013 International Conference on Visual Computing (ICCV) and is tailored for automated facial expression recognition tasks. Notably, the dataset comprises a substantial number of facial expression images captured in natural, uncontrolled environments, commonly referred to as the ‘wild’. This characteristic makes the dataset particularly valuable for the development and evaluation of facial expression recognition systems with practical applications. The dataset includes images sourced from the internet, encompassing various lighting conditions, facial obstructions (such as glasses and beards), diverse age groups, genders, and ethnicities. This diversity adds complexity to the dataset, thereby more accurately reflecting real-world scenarios.

The FER 2013 dataset categorizes facial expressions into seven basic emotions: anger, disgust, fear, joy, sadness, surprise, and neutrality. It consists of 35,887 grayscale images with a resolution of 48 × 48 pixels, divided into training, public test, and private test sets for developing and assessing facial expression recognition models.

The Bigfer 2013 dataset is a combination of images from the FER2013 dataset and the ‘Muxspace’ database curated by Brian Lee Yung Rowe, creating a larger dataset focused on facial expression analysis. This dataset, available as a CSV file, consists of black and white images with dimensions of 48 × 48 pixels. It merges 35,887 images from FER2013 and 13,681 images from the ‘Muxspace’ dataset, totaling 49,568 images. The dataset includes 14,685 happy emoji images (29.63%), 13,066 neutral emoji images (26.36%), 6345 sad emoji images (12.8%), 5205 angry emoji images (10.5%), 5142 fear emoji images (10.37%), 4379 surprised emoji images (8.82%), and 755 disgusted emoji images (1.52%). The expansion of this dataset aims to offer a more diverse training resource, encompassing a wide range of facial attributes and their variations (such as pose, lighting, occlusion, etc.) to facilitate the development and training of facial expression algorithms.

3.2. Result

3.2.1. The Performance of the Proposed Model

In this study, we utilized the CK+, FER2013, and Bigfer2013 datasets to train and evaluate our proposed model. The model underwent training for 150 epochs using the Adam optimizer and cross-entropy as the loss function, with an initial learning rate of 0.001. Each training batch consisted of 16 samples. Due to the absence of a distinct test set in the CK+ dataset, validation/test samples were utilized for assessing model performance. The FER2013 and Bigfer2013 datasets were randomly divided into 80% training and 20% validation sets. The experimental results are shown in Figure 6 and Figure 7 and Table 3.

The model achieved a recognition accuracy of 97.13% on the CK+ dataset during training, with the loss value gradually approaching zero in the multi-class classification task. This performance surpasses the standard framework of current FER systems. Evaluation results on the test dataset demonstrate the effectiveness of the model in terms of accuracy, recall, and F1 score. The model effectively identifies most sample images as corresponding emotion categories on the CK+ dataset, as confirmed by the confusion matrix results. In the ROC plot, the AUC area for all eight categories reaches 1.00, with each individual category also exhibiting AUC values close to 1, indicating excellent performance of the network. Furthermore, the confusion matrix displays high classification accuracy for each facial expression. However, the neutral expression recognition accuracy in the confusion matrix reveals challenges in distinguishing it from other expressions, likely due to the limited number of images and the inherent difficulty in identification. In summary, our enhanced FER model exhibits remarkable effectiveness in recognition accuracy and performance evaluation metrics, particularly in accurately recognizing facial expressions across most samples, leading to its superior performance.

The model trained on the FER2013 dataset achieved a recognition accuracy of 68.82% and a loss rate of 0.94. Across the key performance indicators of recall, precision, and F1 score, the model demonstrated an average accuracy, recall, and F1 score of 68.21%, 68.39%, and 68.22%, respectively, when tested on the seven emotion categories of the FER2013 dataset. The model’s confusion matrix from the test samples further validated its effectiveness in FER, particularly in predicting happiness. In the confusion matrix, the classification performance of natural, happy, and surprised expressions was notable, showcasing variations in learning accuracy across different expressions due to the uneven distribution of images. Specifically, the accuracy rates for these expressions were 69%, 88%, and 77%, respectively. Moreover, the AUC value in the ROC curve was 0.93, indicating the high overall efficiency of the model. These findings underscore the model’s capability in FER, with the confusion matrix analysis revealing performance variations across emotion categories. While the model showed strong overall performance, there were disparities in classification accuracy among categories, indicating areas for potential enhancement.

Building upon the FER2013 dataset, the Bigfer 2013 dataset expands the collection with additional annotated images, all standardized to a size of 48 × 48 pixels. This dataset features a diverse set of characters, enhancing the model’s ability to generalize. During model evaluation, emphasis was placed on various metrics such as accuracy, average precision, recall, and F1 score on the validation set. The optimized model achieved notable results with 72.31% validation accuracy, 71.82% average precision, 71.66% recall, and 71.69% F1 score. The analysis of the confusion matrix reveals that the model’s performance improves significantly as the dataset size increases. Specifically, the accuracy of neutral expression recognition sees a notable enhancement due to the larger number of images featuring neutral expressions. This demonstrates that as the number of images grows, the model’s learning capability shows great potential, further confirming the efficiency and reliability of our model in handling extensive datasets. The AUC area under the ROC curve has reached 0.95, marking a 0.2 increase from the previous measurement, showcasing the network’s outstanding performance. Moreover, the AUC area for each expression surpasses 0.90, suggesting that as the number of images increases, the model’s ability to differentiate between different expressions also significantly improves.

3.2.2. Ablation Experiments

The ablation study for the PH-CBAM module aimed to assess the impact of various module configurations on model performance. By integrating multiple attention mechanisms, the hybrid attention module improves model performance by enabling effective focus on different key aspects of input data. Initially, a hybrid attention model was developed by combining spatial and channel attention, along with separate spatial and channel attention modules. Subsequent ablation studies involved removing either the spatial or channel attention module to evaluate their individual contributions to model performance. We tested various ablation strategies, such as removing all attention modules and using the rule activation function, to determine the optimal fusion method. In addition, we also summarized the optimal adaptation modules of the model by observing the results obtained using different attention mechanisms.

The results of the attention ablation experiment provided valuable insights, as shown in Table 4. Removing the spatial attention module was found to have a negative impact on model performance, particularly in tasks requiring spatial information processing. This is because the spatial attention module enables the model to focus more accurately on the spatial dimension of the input data, thereby enhancing performance. Similarly, the channel attention module is crucial for enhancing the model’s comprehension of inter-channel relationships and the significance of individual channels in the input data. Therefore, retaining the channel attention module is vital for improving model performance. By comparing ablation experiments of different mixing protocols, the optimal combination of attention modules was identified. Increasing the number of attention modules was found to enhance performance, while excessive and redundant attention could lead to performance saturation or degradation. Overall, attention ablation experiments allow us to assess the impact of various attention modules on model performance, determine the best attention module configuration, and ultimately enhance model performance and robustness.

In Table 4, we observe that among the three different attention mechanisms, the proposed attention mechanism demonstrates the most significant impact. The OURS method achieved the highest performance with an accuracy of 68.82%, followed by CBAM at 68.07% and Split-channel attention at 67.92%. Additionally, the average accuracy, recall, and F1 score are also the highest among the proposed networks. Notably, the model introduced in this paper contains only 1.98 million parameters, indicating a relatively low parameter count, with fewer than five layers. This model does not employ large-scale convolution or a substantial number of parameters for feature extraction. Consequently, we incorporated various attention mechanisms into the benchmark model to validate that our proposed attention can yield superior results in a model with a limited number of parameters. Our model’s accuracy surpasses that of the benchmark model by 1.45%, and it outperforms CBAM and Split-channel attention by 0.75% and 0.9%, respectively. Furthermore, it is noteworthy that the proposed Split-channel attention exceeds CBAM by 0.15% in terms of performance, demonstrating its superior properties. The OURS method excels across all performance metrics, and while there is a slight increase in computational complexity and the number of parameters, this is deemed acceptable given the performance enhancement. Both CBAM and Split-channel attention methods also represent good trade-offs between performance and computational complexity. In scenarios where computational resources are constrained, a model that omits all attention mechanisms remains a viable option, albeit with slightly diminished performance. The RELU method is relatively less recommended, as it performs the worst across all indicators.

To further investigate the efficacy of PH-CBAM, a heat map was employed to visualize attention on facial expressions, as illustrated in Figure 8. This process involves sampling the feature map to align with the input image size and subsequently mapping the feature map back to the original image for visualization. The resulting diagram depicts the distribution of visual attention within the model. Each column in the graph corresponds to a distinct expression in the CK+ dataset. The first row displays the original expression image, the second row presents the attention features under CBAM, the third row highlights the attentional interest points of split-channel attention, and the fourth row showcases the visual attention map of PH-CBAM. To ensure that the benchmark model remains unchanged, we conducted ablation experiments by substituting each attention mechanism. As evidenced by the comparison of CBAM, split-channel attention, and our proposed model, our attention mechanism demonstrated the most significant impact among the three. The OURS method achieved the highest accuracy of 68.82%, followed by CBAM at 68.07% and split-channel attention at 67.92%. Furthermore, the average accuracy, recall, and F1 score are among the highest across the proposed networks. Notably, our model comprises only 1.98 M parameters and contains fewer than five layers, indicating that it does not depend on a large number of convolutional feature extractions or a vast number of parameters. Consequently, we integrated various attention mechanisms into the benchmark model to validate that our proposed attention mechanism can yield superior results even with a limited number of parameters. The accuracy of our model is 1.45% higher than that of the benchmark model, surpassing CBAM and split-channel attention by 0.75% and 0.9%, respectively. Additionally, it is noteworthy that in a scenario with a limited number of parameters, the performance of split-channel attention is 0.15% greater than that of CBAM, demonstrating its superiority over CBAM.

In the field of facial expression analysis, key facial features such as the eyes, nose, and mouth are crucial for differentiating expressions. Consequently, concentrating on these areas is essential for identifying changes in expression. The CBAM attention visualization, represented by a heat map, primarily targets the central region of the face, effectively extracting core facial features through the attention mechanism. However, this approach neglects the middle and lower regions of the face, where the oral cavity plays a significant role in sentiment analysis. In contrast, split-channel attention visualization directs focus towards the middle and lower parts of the face; however, it exhibits notable limitations and fails to fully extract facial features. Our proposed attention mechanism demonstrates a more comprehensive emphasis on key facial features. Compared to the other two attention mechanisms, our method encompasses a broader range of areas, thereby underscoring the reliability and effectiveness of our proposed mechanism in extracting facial features.

To further assess the reliability of our model, we conducted a t-test. The experimental results are presented in the table, where we evaluated the accuracy of the proposed model alongside the accuracy of each model from the ablation experiments. All t-test results yielded p-values below 0.05, with the lowest value being 0.0004. This indicates that our model demonstrates significant differences compared to the other models tested in the ablation experiments. These findings substantiate that our model enhances overall performance relative to the other models, thereby reflecting its reliability and stability.

The model was trained and compared with various typical models in the field of facial expression recognition. The results, displayed in Table 5, indicate that the proposed model achieved a higher accuracy rate compared to other state-of-the-art models. The CK+ dataset was collected in a controlled environment to minimize noise, and the results of the current methods presented in the table are close to 100%. However, the network we propose still achieves outstanding results. Compared to advanced lightweight models such as CBAM, improved MobileNetV2, and IE-DBN, our attention module emphasizes local areas, thereby enhancing the recognition accuracy of facial emotions. Our proposed network consistently outperforms these models due to its incorporation of both channel and spatial attention, allowing it to focus on a broader range of facial expressions. In contrast, the FER2013 dataset presents larger data volumes and more complex environmental conditions, which pose greater challenges for model evaluation. In this context, the model developed by Sidhom O. et al. utilizes a multi-stage hybrid feature extraction approach to enhance efficiency. Simultaneously, researchers utilizing MobileNetV2 and E-FCNN have improved recognition accuracy by extracting facial texture features while maintaining a lightweight model. However, these networks achieve their results at the expense of overlooking the correlation between overall features and detailed features. By concentrating solely on local features or subtle texture features, they may adversely affect the recognition of similar emotions. The attention module we propose addresses this issue by integrating details with overall features, resulting in excellent performance on the FER2013 dataset.

4. Discussion and Conclusions

In this study, a novel deep learning-based facial expression recognition model is introduced, incorporating a hybrid approach of spatial attention and channel attention. The model demonstrates high accuracy when tested on popular FER datasets such as CK+, FER2013, and Bigfer2013. Performance improvements are assessed through various metrics including accuracy, F1 score, recall, and precision. The accuracy and efficiency of the PH-CBAM calculation in the sub-modules of the ablation experiment have been demonstrated to be high, leading to a significant improvement in accuracy and a more focused representation of facial features. The ablation experiments have validated the effectiveness and progressiveness of the hybrid attention module. Additionally, our model has achieved higher accuracy on three datasets compared to other state-of-the-art models. The utilization of multi-modal hybrid feature extraction has proven beneficial in capturing important features, as the combination of residual networks and the MISH activation function helps prevent the loss of crucial features that may occur with the Relu activation function. This approach enhances gradient preservation and addresses the issue of vanishing gradients. Our model surpasses other advanced frameworks in the accurate recognition of facial emotions across diverse samples.

The PH-CBAM proposed in this study can be adapted to different scenarios by adjusting the coefficient

α

and

β

of the branch. Increasing the

α

coefficient is recommended for focusing on global features, while increasing the

β

coefficient is recommended for focusing on local features. Further experiments are required to determine the specific

α

and

β

coefficient.

The model we propose has fewer parameters, simplifying the training process. Our analysis of confusion matrices and ROC curves across various datasets revealed that the number of expressions in each dataset impacts the model’s category recognition accuracy. Our study underscores the significance of attention modules in facial expression recognition and showcasing their potential in shaping future network structures. Currently, our models employ basic cropping and rotation techniques. In the future, more advanced enhancement methods like brightness adjustment, zooming, and zooming out could be incorporated to enhance the model’s robustness under diverse lighting conditions, face orientations, and facial expressions. The use of hybrid attention in other areas has been extended to prove its effectiveness. Our commitment lies in advancing multi-level network design strategies and delving into the development of novel attention mechanisms to enhance recognition performance. Concurrently, we aim to research more efficient network models that facilitate quicker and more precise performance on visual computing devices.

Author Contributions

S.W. and J.F. wrote the main manuscript text and the main experiments. L.L. and C.S. prepared the translation and edited this paper. All authors wrote and reviewed the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This project was supported by the Science and Technology Research Project of Hebei Provincial Sports Bureau (2024QT01).

Data Availability Statement

Our datasets (CK+, FER2013, and Bigfer2013) are all public datasets. The datasets (CK+, FER2013, and Bigfer2013) used in this study are publicly available from https://www.kaggle.com/datasets/davilsena/ckdataset (accessed on 12 December2023 ), https://www.kaggle.com/datasets/deadskull7/fer2013 (accessed on 13 December2023), https://www.kaggle.com/datasets/uldisvalainis/fergit (accessed on 13 December2023).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Pan, X.; Ying, G.; Chen, G.; Li, H.; Li, W. A deep spatial and tempo-ral aggregation framework for video-based facial expression recognition. IEEE Access 2019, 7, 48807–48815. [Google Scholar] [CrossRef]
Pham, T.-D.; Duong, M.-T.; Ho, Q.-T.; Lee, S.; Hong, M.-C. CNN-Based Facial Expression Recognition with Simultaneous Consideration of Inter-Class and Intra-Class Variations. Sensors 2023, 23, 9658. [Google Scholar] [CrossRef] [PubMed]
Gaddam, D.K.R.; Ansari, M.D.; Vuppala, S.; Gunjan, V.K.; Sati, M.M. Human facial emotion detection using deep learning. In Lecture Notes in Electrical Engineering. In Proceedings of the ICDSMLA 2020: 2nd International Conference on Data Science, Machine Learning and Applications, Pune, India, 21–22 November 2020; pp. 1417–1427. [Google Scholar]
Hossain, S.; Umer, S.; Rout, R.K.; Tanveer, M. Fine-grained image analysis for facial expression recognition using deep convolutional neural networks with bilinear pooling. Appl. Soft Comput. 2023, 134, 109997. [Google Scholar] [CrossRef]
Tamantini, C.; di Luzio, F.S.; Hromei, C.D.; Cristofori, L.; Croce, D.; Cammisa, M.; Cristofaro, A.; Marabello, M.V.; Basili, R.; Zollo, L. Integrating physical and cognitive interaction capabilities in a robot-aided rehabilitation platform. IEEE Syst. 2023, 17, 1–12. [Google Scholar] [CrossRef]
Poulose, A.; Reddy, C.S.; Kim, J.H.; Han, D.S. Foreground Extraction Based Facial Emotion Recognition Using Deep Learning Xception Model. In Proceedings of the 2021 Twelfth International Conference on Ubiquitous and Future Networks (ICUFN), Jeju Island, Republic of Korea, 17–20 August 2021; pp. 356–360. [Google Scholar]
Zhu, X.; Ye, S.; Zhao, L.; Dai, Z. Hybrid attention cascade network for facial expression recognition. Sensors 2021, 21, 2003. [Google Scholar] [CrossRef]
Cheng, Y.; Kong, D. CSINet: Channel–Spatial Fusion Networks for Asymmetric Facial Expression Recognition. Symmetry 2024, 16, 471. [Google Scholar] [CrossRef]
Alonazi, M.; Alshahrani, H.J.; Alotaibi, F.A.; Maray, M.; Alghamdi, M.; Sayed, A. Automated Facial Emotion Recognition Using the Pelican Optimization Algorithm with a Deep Convolutional Neural Network. Electronics 2023, 12, 4608. [Google Scholar] [CrossRef]
Huang, C. Combining convolutional neural networks for emotion recognition. In Proceedings of the 2017 IEEE MIT Undergraduate Research Technology Conference (URTC), Cambridge, MA, USA, 3–5 November 2017; pp. 1–4. [Google Scholar]
Rajan, V.; Brutti, A.; Cavallaro, A. Is cross-attention preferable to self-attention for multi-modal emotion recognition. In Proceedings of the ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual, 22–27 May 2022; pp. 4693–4697. [Google Scholar]
Zhuang, X.; Liu, F.; Hou, J.; Hao, J.; Cai, X. Transformer-based interactive multi-modal attention network for video sentiment detection. Neural Process. Lett. 2022, 54, 1943–1960. [Google Scholar] [CrossRef]
Kuhnke, F.; Rumberg, L.; Ostermann, J. Twostream aural-visual affect analysis in the wild. In Proceedings of the 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), Buenos Aires, Argentina, 16–20 November 2020; pp. 600–605. [Google Scholar]
Savchenko, A.V.; Savchenko, L.V.; Makarov, I. Classifying emotions and engagement in online learning based on a single facial expression recognition neural network. IEEE Trans. Affect. Comput. 2022, 13, 2132–2143. [Google Scholar] [CrossRef]
Wang, Z.; Zeng, F.; Liu, S.; Zeng, B. OAENet: Oriented attention ensemble for accurate facial expression recognition. Pattern Recognit. 2021, 112, 107694. [Google Scholar] [CrossRef]
Huang, Z.Y.; Chiang, C.C.; Chen, J.H.; Chen, Y.C.; Chung, H.L.; Cai, Y.P.; Hsu, H.C. A study on computer vision for facial emotion recognition. Sci. Rep. 2023, 13, 8425. [Google Scholar] [CrossRef] [PubMed]
Vats, A.; Chadha, A. Facial Expression Recognition using Squeeze and Excitation-powered Swin Transformers. arXiv 2023, arXiv:2301.10906. [Google Scholar]
Kollias, D. Abaw: Learning from synthetic data & multi-task learning challenges. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 157–172. [Google Scholar]
Yao, L.; He, S.; Su, K.; Shao, Q. Facial expression recognition based on spatial and channel attention mechanisms. Wirel. Pers. Commun. 2022, 125, 1483–1500. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Wen, Z.; Lin, W.; Wang, T.; Xu, G. Distract your attention: Multi-head cross attention network for facial expression recognition. Biomimetics 2023, 8, 199. [Google Scholar] [CrossRef] [PubMed]
Zhang, X.; Yan, C. E-MobileNeXt: Face expression recognition model based on improved MobileNeXt. Optoelectron. Lett. 2024, 20, 122–128. [Google Scholar] [CrossRef]
Mish, M.D. A self regularized non-monotonic activation function. arXiv 2019, arXiv:1908.08681. [Google Scholar]
Guo, M.H.; Xu, T.X.; Liu, J.J.; Liu, Z.N.; Jiang, P.T.; Mu, T.J.; Zhang, S.H.; Martin, R.R.; Cheng, M.M.; Hu, S.M. Attention mechanisms in computer vision: A survey. Comput. Vis. Media 2022, 8, 331–368. [Google Scholar] [CrossRef]
Zhu, X.; Cheng, D.; Zhang, Z.; Lin, S.; Dai, J. An empirical study of spatial attention mechanisms in deep networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6688–6697. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Lucey, P.; Cohn, J.F.; Kanade, T.; Saragih, J.; Ambadar, Z.; Matthews, I. The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, San Francisco, CA, USA, 13–18 June 2010; pp. 94–101. [Google Scholar]
Zhang, X.; Chen, Z.; Wei, Q. Research and application of facial expression recognition based on attention mechanism. In Proceedings of the 2021 IEEE Asia-Pacific Conference on Image Processing, Electronics and Computers (IPEC), Dalian, China, 14–16 April 2021; pp. 282–285. [Google Scholar]
Zhang, H.; Su, W.; Yu, J.; Wang, Z. Identity–expression dual branch network for facial expression recognition. IEEE Trans. Cogn. Dev. Syst. 2020, 13, 898–911. [Google Scholar] [CrossRef]
Belmonte, R.; Allaert, B.; Tirilly, P.; Bilasco, I.M.; Djeraba, C.; Sebe, N. Impact of facial landmark localization on facial expression recognition. IEEE Trans. Affect. Comput. 2021, 14, 1267–1279. [Google Scholar] [CrossRef]
Zhu, Q.; Zhuang, H.; Zhao, M.; Xu, S.; Meng, R. A study on expression recognition based on improved mobilenetV2 network. Sci. Rep. 2024, 14, 8121. [Google Scholar] [CrossRef]
Sidhom, O.; Ghazouani, H.; Barhoumi, W. Three-phases hybrid feature selection for facial expression recognition. J. Supercomput. 2024, 80, 8094–8128. [Google Scholar] [CrossRef]
Mukhopadhyay, M.; Dey, A.; Kahali, S. A deep-learning-based facial expression recognition method using textural features. Neural Comput. Appl. 2023, 35, 6499–6514. [Google Scholar] [CrossRef]
Jiang, B.; Li, N.; Cui, X.; Liu, W.; Yu, Z.; Xie, Y. Research on Facial Expression Recognition Algorithm Based on Lightweight Transformer. Information 2024, 15, 321. [Google Scholar] [CrossRef]

Figure 1. The model structure we propose. The figure illustrates the precise integration of our attention module within the proposed network model. We apply our attention module to the convolutional output of each layer. Module (A) serves as a multi-feature extraction component, while Module (B) represents the five-layer data structure of the network.

Figure 2. Activation function.

Figure 3. CBAM attention. The module consists of two sub-modules: the Channel Module and the Space Module. (A) The Channel Module. (B)The Space Module. The intermediate feature map is processed through an adaptively detailed module known as the CBAM. The Channel A submodule leverages the maximum pooling output and average pooling output from the shared network, while the Space B submodule employs two similar outputs that converge along the channel axis, subsequently transmitting them to the convolutional layer.

Figure 4. The hybrid attention we proposed. The module consists of two branches: the first branch is the CBAM module, while the second branch is divided into two components based on the channel. These components are utilized for channel attention and spatial attention, respectively. The extracted feature maps from these two components are concatenated, followed by a convolution operation that fuses the channels. Ultimately, the resulting split-channel attention is integrated with the first branch to produce the final output feature map.

Figure 5. Examples of a dataset.

Figure 6. ROC curve (a) and Confusion matrix (b) of CK+ data (class 0–7 represents neutral, anger, contempt, disgust, fear, happiness, sadness, surprise). ROC curve (a) and Confusion matrix (b) of FER2013 dataset (class 0–6 represents angry, disgust, scared, happy, sad, surprised, neutral). ROC curve (a) and Confusion matrix (b) of Bigfer2013 dataset (class 0–6 represents angry, disgusted, scared, happy, sad, surprised, neutral).

Figure 7. (a) Training loss function values of CK+, FER2013, and Bigfer2013 datasets (b) Training precision graph of CK+, FER2013, and Bigfer2013 datasets.

Figure 8. Heatmap results for individual models.

Table 1. The impact of different number of layers on model performance.

Number of Layers	$A c c u r a c y$	Parameter Quantity	Flops
3	91.85%	0.14 M	70.95 M
4	95.11%	0.52 M	88.52 M
5	97.13%	1.98 M	108.77 M
6	95.10%	3.48 M	116.44 M

Table 2. The number of images for each expression in the three datasets.

Dataset	Anger	Disgust	Fear	Happiness	Sadness	Surprise	Neutral	Contempt
CK+	45	59	25	69	28	83	593	18
BigFer2013	5202	755	5142	14,685	6345	4379	13,066
Fer2013	4953	547	5121	8989	6077	4002	6198

Table 3. A comparison of the accuracy of our models on each dataset.

Dataset	Validation Accuracy	Average Precision	Recall	F1 Score
FER2013	68.82	68.21	68.39	68.22
Bigfer 2013	72.31	71.82	71.66	71.69
CK+	97.13	96.16	95.11	95.20

Table 4. Results of ablation experiments.

	Validation Accuracy	Average Precision	Recall	F1 Score	Flops	Params	T-Test
removing all attention	67.37	67.38	67.05	67.08	106.51 M	1.92 M	0.0004
RELU	66.62	66.57	66.38	66.35	108.26 M	1.98 M	0.0006
CBAM	68.07	67.55	67.50	67.40	106.81 M	1.95 M	0.0093
Split-channel attention	67.92	67.67	67.79	67.62	108.47 M	1.96 M	0.0075
Ours	68.82	68.21	68.39	68.22	108.77 M	1.98 M

Table 5. Comparison of the accuracy of each model.

Approach	Dataset	Accuracy (%)
CBAM [28]	CK+	95.1
IE-DBN [29]	CK+	96.02
CCFS+SVM [30]	CK+	96.05
Improved MobilenetV2 [31]	CK+	95.96
Model by Sidhom O et al. [32]	Fer2013	66.1
Self-Cure Net [33]	Fer2013	66.17
Improved MobileViT [34]	Fer2013	62.2
Improved MobilenetV2 [31]	Fer2013	68.62
Ours	CK+	97.13
Ours	Fer2013	68.82

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liao, L.; Wu, S.; Song, C.; Fu, J. PH-CBAM: A Parallel Hybrid CBAM Network with Multi-Feature Extraction for Facial Expression Recognition. Electronics 2024, 13, 3149. https://doi.org/10.3390/electronics13163149

AMA Style

Liao L, Wu S, Song C, Fu J. PH-CBAM: A Parallel Hybrid CBAM Network with Multi-Feature Extraction for Facial Expression Recognition. Electronics. 2024; 13(16):3149. https://doi.org/10.3390/electronics13163149

Chicago/Turabian Style

Liao, Liefa, Shouluan Wu, Chao Song, and Jianglong Fu. 2024. "PH-CBAM: A Parallel Hybrid CBAM Network with Multi-Feature Extraction for Facial Expression Recognition" Electronics 13, no. 16: 3149. https://doi.org/10.3390/electronics13163149

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

PH-CBAM: A Parallel Hybrid CBAM Network with Multi-Feature Extraction for Facial Expression Recognition

Abstract

1. Introduction

2. Methods

2.1. Multi-Feature Extraction Module

2.2. MISH Activation Function

2.3. Parallel Hybrid Attention

2.3.1. Channel Attention

2.3.2. Spatial Attention

2.3.3. Parallel Hybrid Attention Module

2.4. The Proposed Model

3. Experiment

3.1. Dataset Details

3.2. Result

3.2.1. The Performance of the Proposed Model

3.2.2. Ablation Experiments

4. Discussion and Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI