Next Article in Journal
A Privacy-Preserving Trajectory Publishing Method Based on Multi-Dimensional Sub-Trajectory Similarities
Previous Article in Journal
Textile-Based Body Capacitive Sensing for Knee Angle Monitoring
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

CNN-Based Facial Expression Recognition with Simultaneous Consideration of Inter-Class and Intra-Class Variations

1
Department of Information and Telecommunication Engineering, Soongsil University, Seoul 06978, Republic of Korea
2
Department of Intelligent Semiconductor, Soongsil University, Seoul 06978, Republic of Korea
3
School of Electronic Engineering, Soongsil University, Seoul 06978, Republic of Korea
*
Author to whom correspondence should be addressed.
Sensors 2023, 23(24), 9658; https://doi.org/10.3390/s23249658
Submission received: 25 October 2023 / Revised: 28 November 2023 / Accepted: 3 December 2023 / Published: 6 December 2023
(This article belongs to the Section Sensing and Imaging)

Abstract

:
Facial expression recognition is crucial for understanding human emotions and nonverbal communication. With the growing prevalence of facial recognition technology and its various applications, accurate and efficient facial expression recognition has become a significant research area. However, most previous methods have focused on designing unique deep-learning architectures while overlooking the loss function. This study presents a new loss function that allows simultaneous consideration of inter- and intra-class variations to be applied to CNN architecture for facial expression recognition. More concretely, this loss function reduces the intra-class variations by minimizing the distances between the deep features and their corresponding class centers. It also increases the inter-class variations by maximizing the distances between deep features and their non-corresponding class centers, and the distances between different class centers. Numerical results from several benchmark facial expression databases, such as Cohn-Kanade Plus, Oulu-Casia, MMI, and FER2013, are provided to prove the capability of the proposed loss function compared with existing ones.

1. Introduction

Facial expressions have been used as critical and natural signals to represent human emotions and intentions. Therefore, various facial expression recognition (FER) methods have been studied and applied to fields, such as virtual reality (VR) [1], human-robot interaction (HRI) [2], advanced driver assistant systems (ADAS) [3], and disease prevention support systems (DPSS) [4].
Typical FER methods include three stages: (1) facial-component detection, (2) feature point extraction, and (3) facial expression classification. Facial-component detection involves extracting a facial region from an input image to obtain features such as the eyes and nose from the detected facial components. More recently, studies have shown that feature extraction can be classified into spatial [5,6], and temporal feature extraction [7]. Generally, the expression classifier and feature extraction are vital for the accuracy of FER. Many developments have been made to exploit facial expression (FE) classification, including the Bayesian Classifier [8], Hidden Markov Model (HMM) [9], Adaboost [10], and Support Vector Machine (SVM) [11]. Figure 1 shows the details of the conventional FER process.
Recent developments in deep learning have achieved significant advancements in computer vision and image processing [12,13,14,15]. Among the deep learning methods, the Convolutional Neural Network (CNN) has been proven capable of reducing the dependency on analytical models and preprocessing techniques by enabling “end-to-end” direct learning from input images. For example, feature extraction and recognition are jointly learned using deep learning methods [16,17,18].
FER is highly sensitive to intra-class variation according to age and gender, illuminance, and facial pose [19]. In addition, because FER datasets are limited and small, operating a CNN to extract the salient features that represent the facial expressions from the facial image is problematic. Several methods have been explored to overcome this problem [20]. Examples of this are the transfer learning method [21] for solving the overfitting problem in training datasets, and the ensemble architectures [22] and hybrid variant input approaches [23] for extracting discriminative features. Notably, most of these approaches primarily concentrated on designing new deep learning architectures and overlooked the loss function. Additionally, the limited training datasets remain a challenge in improving FER performance.
One method of extracting salient features from limited datasets is to change the traditional loss function of the CNN architecture to reduce the intra-class variation and increase the inter-class variation of the deep features, thereby creating discriminative features. Typically, CNN-based FER optimizes the softmax loss function, which seeks to penalize misclassified samples, encouraging the distinction of features between different classes. The softmax layer is crucial for ensuring that the learned features of various classes remain distinguishable. However, severe intra-class variation remains challenging. Advanced loss functions can be used to address this problem. Generally, advanced loss functions are divided into two categories: Angular-distance-based method (L-Softmax [24], AM-Softmax [25]), Euclidean-distance-based method (contrastive loss [26], triplet loss [27], center loss [28]).
The angular-distance-based losses have made the learned features potentially separable with a larger angular/cosine distance. These losses were reformulated based on the original softmax loss, allowing inter-class separability and intra-class compactness between learned features. However, these loss functions were difficult to converge when trained with complex datasets such as that of FER.
Furthermore, the Euclidean-distance-based losses have embedded the input images in the Euclidean space to decrease intra-class variation and increase inter-class variation. Contrastive and triplet losses increased memory load and training time owing to the complex recombination of training samples. Center loss updated the class center by reducing the distance between the deep features and their corresponding class centers. Nevertheless, it disregarded inter-class variation, thus limiting the FER performance improvement.
To summarize, the existing loss functions for CNN-based FER have the following challenges: (1) the difficulty in convergence with the complex training dataset, (2) the high memory consumption and training time, and (3) the disregard of inter-class similarity.
Given the above analysis, this study presents a variant loss to minimize the distance between the deep features and their corresponding class centers as well as maximize the distances of deep features with their non-corresponding class centers and the distances between different class centers. Figure 2 illustrates the concept of the proposed loss function. Finally, the proposed loss function was assessed on four well-known benchmark facial expression databases: the Cohn-Kanade Plus (CK+) [29], the Oulu-CASIA [30], MMI [31], and FER2013 [32] databases. The contributions of this study can be summarized as follows:
  • A new loss function is proposed to simultaneously consider inter- and intra-class variations, which enables CNN-based FER methods to achieve impressive performance.
  • A new loss function can be easily optimized with various CNN architectures on diverse databases to learn the discriminative power of deep features for the FER problem.
  • Comprehensive experiments on benchmark databases are conducted to prove that the auxiliary CNN architectures trained with the proposed loss function performed much better than with existing loss functions.
The remainder of this paper is organized into four sections. Section 2 summarizes previous loss functions and auxiliary CNN architectures. Section 3 describes the proposed loss function that simultaneously considers intra- and inter-class variations. Section 4 analyzes the simulation results, and Section 5 states the conclusions.

2. Related Work

2.1. Previous Loss Functions

The softmax loss is good at increasing the inter-class variation but cannot decrease the intra-class variation. To tackle this problem, several loss functions have been introduced to reduce the intra-class variation. Most representatively, L-Softmax loss [24] is an improvement over the conventional softmax loss, enabling inter-class separability and intra-class compactness between learned features. With an adjustable margin value, L-softmax could determine an adaptable learning task with flexible difficulty for CNNs. It also prevented overfitting problems to leverage the powerful learning capacity of deep and wide architectures. Nevertheless, when the training dataset has various subjects, the convergence of L-Softmax will be tougher than the softmax loss. AM-Softmax loss [25] used an additive margin strategy to the target logit of softmax loss with features and weights normalized. Although it was intuitively appealing and more interpretable than the L-Softmax [24], selecting the margin hyperparameter was challenging.
Contrastive [26] and triplet losses [27] adopted a pair training technique. In particular, the contrastive loss included negative and positive pairs. Its gradients attracted positive pairs and repelled negative ones. Meanwhile, triplet loss reduced the distance between an anchor and a positive sample and increased the distance between an anchor and a negative sample of a different identity. The training procedure for these losses was still challenging owing to the selection of effective training samples. Center loss [28] decreased intra-class variations during training by penalizing the distances between deep features and their corresponding class centers. By relying solely on training CNNs with center loss, the deep features and class centers might deteriorate to zero. Moreover, the center loss was minimal, and discriminative features could not be achieved. Thus, the center loss should be jointly supervised with the softmax loss during training. However, each identity’s center doubled the memory storage of the last CNN layer.
Range loss [33] was proposed to effectively use the whole long-tailed dataset in the training procedure. The range loss was optimized jointly with softmax loss as supervisory signals to train CNNs. However, the optimization strategy could be challenging because softmax loss requires uniform distribution among all the classes, and the ability to improve inter-class differences within each mini-batch was restricted. Marginal loss [34] could decrease the intra-class variances and enlarge the inter-class distances by focusing on the marginal samples. The marginal, combined with a softmax loss to supervise the learning of CNN jointly, could greatly improve the discriminative capacity of deep features for efficient facial recognition. Even so, the age variance restriction in the training data could significantly reduce the performance when there was a large year gap.
According to Table 1, while existing loss functions achieved promising performance, there is still much room for improvement. To this end, this study proposes variant loss to minimize the distance between the deep features and their corresponding class centers as well as maximize the distances of deep features with their non-corresponding class centers and the distances between different class centers. The proposed loss function is easy to adopt in CNN-based FER methods and achieves outstanding performance.

2.2. Auxiliary CNN Architectures

Given an input image or feature, classification models predict specific labels. In this study, six popular CNN architectures are trained using various loss functions to evaluate the feasibility of the proposed loss function. First, AlexNet [35] has eight layers comprising five convolutional layers and three fully connected layers combined with dropout techniques. Its simplicity and moderate depth made its training fast.
To improve the classification performance, InceptionNet [36] was designed based on the Inception module, which aggregated four parallel branches: three convolution branches with different kernel sizes (1 × 1, 3 × 3, and 5 × 5) and a max-pooling branch. InceptionNet contained 22 layers, including nine Inception modules stacked on top of each other. This design increased the width of the network and adaptability to various scales.
The deep learning networks also suffer from a vanishing gradient problem that impedes accuracy. ResNet [37] was proposed to add skip connections from the input to the output of the convolutional layer to address these problems. The residual block contained two 3 × 3 convolutional layers, each followed by the Batch Norm and ReLU activation function. ResNet-18 was selected to train with the comparative loss function in this study.
DenseNet [38] proposed dense blocks and transition layers. Dense blocks concatenated the output of the previous layer as the input of the next. In this way, a feed-forward nature could be maintained. However, the number of channels would be increased when concatenating layers. The transition was used to control the size of the features by 1 × 1 convolution. Moreover, the height and width of features were reduced through the average pooling layer.
Recently, MobileNetV3 [39] has been applied to mobile and embedded devices owing to its lightweight. It was based on a combination of hardware-aware network architecture search (NAS) algorithm and squeeze-and-excitation (SE) module [40]. The block-wise search algorithm (MnasNet [41]) was employed to identify global network structures, and then the layer-wise search algorithm (NetAdapt [42]) was sequentially employed to adjust individual layers. MobileNetV3 inserted the SE module to build channel-wise attention. The hard-sigmoid function was utilized to substitute the conventional sigmoid in the SE module for more efficient calculation. In addition, the hard-swish function was adopted instead of ReLU for non-linearity improvement.
Finally, ResNeSt [43] is an improved version of ResNet. It combined channel-wise attention with multi-path representation into a unified Split-Attention block. These Split-Attention blocks were stacked to follow the concept of residual learning from the ResNet model [37]. This architecture enhanced learned feature representations for multiple high-level vision tasks, including object detection, image classification, and semantic segmentation. Moreover, it was reported that ResNeSt enabled the acceleration of training and was computationally efficient.

3. Proposed Method

As mentioned previously, a variant loss is proposed to minimize the distance between the deep features and their corresponding class centers as well as maximize the distances of deep features with their non-corresponding class centers, and the distances between different class centers. The new loss function is expressed as follows:
L v = 1 2 i = 1 M ( F ( x i ) c y i 2 2 + λ 1 ϵ 1 + j = 1 , j y i N F ( x i ) c j 2 2 ) + λ 2 ϵ 2 + m N n N n m c m c n 2 2 ,
where y i , x i R d are the ordinary label and input images of i-th sample facial expressions, respectively; d is dimension features. F ( · ) expresses the feature extraction from the CNNs; c y i R d denotes the y i -th class center of the deep features from the CNNs with the same label class y i . M is the number of training data in the batch size; N is the number of classes; c j , c m , and c n R d are the j-th, m-th, and n-th class centers of deep features, respectively. ϵ 1 and ϵ 2 are tolerance parameters that guarantee that the denominator is higher than zero; and λ 1 , λ 2 are the hyperparameters used for balancing these loss terms.
The first term is similar to the center loss and tends to reduce the distance between the deep features and their corresponding class centers. The second and third terms tend to increase the distance between the deep features and their non-corresponding class centers and between class centers, respectively. By minimizing the proposed loss function, the intra-class variations of the deep features are reduced, whereas the inter-class variations continue to increase.
The softmax loss is obviously good at increasing the inter-class variation, and it is tractable and makes it easy to obtain the optimized solution. Therefore, the proposed loss function is applied to the batch data in each iteration to train it with the softmax loss. In addition, the most powerful networks tend to combine specific loss functions such that the supervision signals are more successfully backpropagated, mitigating the training difficulty and improving the robustness of network training [44]. In our study, the overall loss function for training the CNN is computed as the sum of the weights of the softmax and variant losses. In short, the overall loss function is expressed as follows:
L = L s + λ L v ,
where λ is a hyperparameter used for balancing the softmax and the variant losses. The overall system of the CNNs using the proposed loss is illustrated in Figure 3.
In this method, the network parameters include CNN parameters W and the softmax loss parameters θ are updated in mini-batches. Only the gradient of L s is needed to update the softmax loss parameters θ because the L v does not affect it. The gradient of variant loss is used to update W. The gradient of L v with respect to the F ( x i ) is calculated as follows:
d L v d F ( x i ) = ( F ( x i ) c y i ) λ 1 j = 1 , j y i N ( F ( x i ) c j ) ( ϵ 1 + j = 1 , j y i N F ( x i ) c j 2 2 ) 2                      λ 2 m = y i n N n m ( c y i c n ) m N n = y i n m ( c m c y i ) ( ϵ 2 + m N n N n m c m c n 2 2 ) 2 .
In addition, the class center is calculated by averaging the features in the same class and updating in each iteration. The centers are updated as follows:
c k t + 1 = c k t + α c k ,
where α is the learning rate of class centers.
The update of the k-th class center can be computed as a derivative of the variant loss with respect to the class center c k :
c k = i = 1 M δ ( y i , k ) ( c k F ( x i ) ) 1 + i = 1 M δ ( y i , k ) λ 1 i = 1 M ( 1 δ ( y i , k ) ) ( c k F ( x i ) ) ( ϵ 1 + j = 1 , j y i N | | F ( x i ) c j | | 2 2 ) 2                      λ 2 m N δ ( m , k ) n N n m ( c k c n ) ( ϵ 2 + m N n N n m c m c n 2 2 ) 2 ( 1 + m N δ ( m , k ) ) ,
where δ ( y i , k ) and δ ( m , k ) are defined as
δ ( y i , k ) = 1 ; y i = k 0 ; y i k ,
δ ( m , k ) = 1 ; m = k 0 ; m k .
CNNs can be trained utilizing standard stochastic gradient descent (SGD) [45]. The hyperparameters of CNNs contain a batch size M, the number of training iterations T, the learning rates of the weight parameter μ , the learning rates of the class centers α , and balanced terms of the loss function λ , λ 1 , λ 2 . First, the parameters of the CNNs are initialized W, the softmax loss parameters θ , and class center c k . In each iteration, M training images x i are passed into the CNNs to obtain the output of the last fully-connected layer F ( x i ) in each batch. The overall loss of the model and derivative of the loss functions with respect to the output of the last fully-connected layers F ( x i ) are calculated to update the parameters of the CNNs. θ is independent of the variant loss; therefore, only the softmax loss is considered. Furthermore, the gradient of variant loss is used to update W. The update process of W and θ are separated with different derivatives. Finally, the derivative of the variant loss with respect to class center c k is calculated to update the class center c k with the learning rate of the class centers α . The training process for CNNs with proposed loss is interpreted in Algorithm 1.
Algorithm 1 Training process for CNNs with proposed loss
     Input: Training images x i , batch size M, number of training iterations T, learning rates of weight parameter μ , learning rate of class centers α , hyper-parameters λ , λ 1 , λ 2 .
     Initialization: the CNNs parameters W, the softmax loss parameters θ , the class centers c k , the iteration t = 0.
1:
 while   t T   do
2:
    Calculate the deep features, the output of the last fully-connected layers F ( x i ) of M input images in one mini-batch.
3:
    Calculate the overall loss as in (2):
4:
     L = L s + λ L v
5:
    Calculate the gradients for each input i by:
6:
     d L t d F ( x i ) t = d L s t d F ( x i ) t + λ d L v t d F ( x i ) t
7:
    Update parameters θ by:
8:
     θ t + 1 = θ t μ d L t d θ t = θ t μ d L s t d θ t
9:
    Update parameters W by:
10:
     W t + 1 = W t μ d L t d W t = W t μ i M d L t d F ( x i ) t d F ( x i ) t d W t
11:
    Update c k for k-th class center: c k t + 1 = c k t α c k
12:
     t = t + 1
13:
end while
     End of the algorithm: The CNNs parameters W, the softmax loss parameters θ

4. Experiments

4.1. Experimental Setup

The performance of the proposed method was evaluated based on four benchmark facial expression databases: three from a laboratory environment, namely, Cohn-Kanade Plus (CK+) [29], Oulu-CASIA [30], and MMI [31]; and one from a wild environment, FER2013 [32]. A 10-fold cross-validation strategy was employed for model evaluation, especially focusing on scenarios with small and imbalanced datasets, such as CK+, MMI, and Oulu-CASIA. The amount of data for training depends on several factors, such as the task’s complexity, the data’s diversity, the desired output, the data quality, and the deep model architecture. In this study, each of these databases was strictly divided into 90% as a training set and 10% allocated as a testing set. Furthermore, FER is a large-scale dataset; the training and evaluation processes were conducted on its provided datasets. To prevent overfitting issues, we carefully chose the appropriate weight based on the learning process of the model to achieve a satisfactory performance. Several sample images derived from these databases are illustrated in Figure 4. The details of the databases and the number of images for each emotion are presented in Table 2.
To minimize the variations in the face scale and in-plane rotation, the face was detected and aligned from the original database using the OpenCV library with Haar–Cascade detection [46]. The aligned facial images were resized to 64 × 64 pixels. Moreover, intensity equalization was used to enhance the contrast in facial images. A data augmentation technique was used to overcome the restricted number of training images in the FER problem. Furthermore, the facial images were flipped, and each one and its corresponding flipped image was rotated at −15, −10, −5, 5, 10, and 15°. The training databases were augmented 14 times using original, flipped, six-angle, and six-angle-flipped images. The rotated facial images are shown in Figure 5.
The proposed loss function was compared with softmax, center [28], range [33], and marginal losses [34] using the same CNN architectures to demonstrate the effectiveness of the proposed loss function. Accuracy is a crucial quantitative metric to evaluate the performance of the proposed method, which can be calculated as follows:
A c c u r a c y = Number of correct predictions Total number of predictions .
The experiment was conducted in a subject-independent scenario. The CNN architectures were processed with 64 images in each batch. The training was performed using the standard SGD technique to optimize the loss functions. The hyper-parameter λ was used to balance the softmax and variant losses. λ 1 and λ 2 were utilized to balance among these losses in the variant loss, and α controlled the learning rate of the class center c k . All of these factors affect the performance of our model. In this experiment, the values λ = 0.001, λ 1 = 0.4, λ 2 = 0.6, ϵ 1 = ϵ 2 = 0.001 were empirically selected for the proposed loss. For the center, marginal, and range losses, λ was set to 0.001. The detailed specifications of the implemented environment are shown in Table 3.

4.2. Experimental Results

(1) Results on Cohn-Kanade Plus (CK+) database: The CK+ is a representative laboratory-controlled database for FER. It comprises 593 image sequences collected from 123 participants. A total of 327 of these image sequences have one of seven emotion labels: anger, contempt, disgust, fear, happiness, sadness, and surprise, from 118 subjects. Each image sequence starts with a neutral face and ends with the peak emotion. To collect additional data, the last three frames of each sequence were collected and associated with the provided labels. Therefore, a database containing 981 experimental images was constructed. The images were primarily grayscale and digitized to a 640 × 490 or 640 × 480 resolution.
The average recognition precision of the methods based on the loss functions and CNN architectures is listed in Table 4. The accuracy of the proposed loss function was superior to that of the others for all six CNN architectures. For the same loss functions, the accuracy of ResNet was the highest, followed by those of MobileNetV3, ResNeSt, InceptionNet, AlexNet, and DenseNet. Overall, the proposed loss produced an average recognition accuracy of 94.89% for the seven expressions using ResNet.
Table 5 presents the confusion matrix [47] of the ResNet, which was optimized using the proposed loss function. The accuracy of the contempt, disgust, happiness, and surprise labels was significant. Notably, the happiness percentage was the highest at 99.5%, followed closely by surprise, disgust, and contempt at 98.4%, 97.7%, and 93.4%, respectively. The proportions of the anger, fear, and sadness labels were inferior to these emotions because of their visual similarity.
A receiver operating characteristic (ROC) curve [48] and the corresponding area under the curve (AUC) for all expression recognition performances are illustrated in Figure 6. An increase in the AUC signifies an improved ability of the model to differentiate between various classes. The value of the disgust, happiness, and surprise labels reach peak values at 100%. The others also gained a relatively high classified range of 97%, 94%, 89%, and 86% for corresponding emotional classes anger, sadness, fear, and contempt.
(2) Results on Oulu-CASIA database: The Oulu-CASIA database includes 2880 image sequences obtained from 80 participants using a visible light (VIS) imaging system under normal illumination conditions. Six emotion labels were assigned to each image sequence: anger, disgust, fear, happiness, sadness, and surprise. Like the Cohn-Kanade Plus database, the image sequence started with a neutral face and ended with the peak emotion. For each image sequence, the last three frames were collected as the peak frames of the labeled expression. The imaging hardware was operated at 25 fps with an image resolution of 320 × 240 pixels.
The average recognition accuracy of the methods is listed in Table 6. The performance of the proposed loss function was comparable to that of previous ones. Specifically, the proposed loss function achieved an average recognition accuracy of 77.61% for the six expressions using the ResNet architectures.
Table 7 presents the confusion matrix of ResNet trained with the proposed loss function. The accuracy of the happiness and surprise labels increased, with the former achieving 92.1% and the latter gaining 84.0%. The accuracy for anger, disgust, fear, and sadness was inferior, obtaining 66.2%, 70.5%, 76.3%, and 76.5%, respectively.
Figure 7 shows a receiver operating characteristic (ROC) curve, which verifies the performance of a recognition model. The range for all emotional labels was relatively significant. Among them, the result of the happiness, surprise, and sadness class illustrates the AUC over 90%, followed by disgust, fear, and anger at 89%, 86%, and 85%, respectively.
(3) Results on MMI database: The laboratory-controlled MMI database comprises 312 image sequences collected from 30 participants. A total of 213 image sequences were labeled with six facial expressions: anger, disgust, fear, happiness, sadness, and surprise. Moreover, 208 sequences from 30 participants were captured in frontal view. The spatial resolution was 720 × 576 pixels, and the videos were recorded at 24 fps. Unlike the Cohn-Kanade Plus and Oulu-CASIA databases, the MMI database features image sequences labeled by the onset-apex. Therefore, the sequences started with a neutral expression, peaked near the middle, and returned to a neutral expression. The location of the peak expression frame was not provided. Furthermore, the MMI database presented challenging conditions, particularly in the case of large interpersonal variations. Three middle frames were chosen as the peak expression frames in each image sequence to conduct a subject-independent cross-validation scenario.
Table 8 lists the average recognition accuracy of the methods. Our loss function outperformed all the other loss functions by a certain margin. Specifically, the proposed loss function achieved average recognition accuracy of 67.43% for the six expressions using the MobileNetV3 architecture.
Table 9 presents the percentages in the confusion matrix of the MobileNetV3 optimized with the proposed loss function. The accuracy for all emotions was under 80.0%, except for happiness and surprise, which obtained 89.7% and 81.3%, respectively. This may be due to the number of images in each class. An instance of this is fear, which had the fewest labels and whose accuracy was a low 31.0%. Similar results were also confirmed for the accuracy of anger, disgust, and sadness.
A ROC curve and corresponding AUC for all facial expression recognition performances are presented in Figure 8. The value of the fear and anger classes is lowest, with the former achieving 69% and the latter 79%. The value for sadness, disgust, surprise, and happiness was higher, acquiring 83%, 88%, 89%, and 98%, respectively.
(4) Results on FER2013 database: FER2013 is a large-scale, unconstrained database automatically collected by the Google image search API. It includes 35,887 images with a relatively low resolution of 48 × 48 pixels, which are labeled with one of seven emotion labels: anger, disgust, fear, happiness, sadness, surprise, and neutral. The training set comprises 28,709 examples. The public test set consists of 3589 examples; the remaining 3589 images are used as a private test set.
Table 10 lists all the methods’ average recognition accuracy. The accuracy of the proposed loss function greatly exceeds that of the others in all CNN architectures, except AlexNet. The proposed loss function achieved a peak average recognition accuracy of 61.05% for the seven expressions using the ResNeSt architecture.
The confusion matrix of ResNeSt, which was trained with the proposed loss function, is presented in Table 11. The happiness percentage was highest at 80.7%, followed by surprise at 77.4%. The others obtained relatively low prediction ratios.
Figure 9 depicts a ROC curve, where the range for all emotions was over 70%, except for sadness at 69%. Among other expression classes, the result of the surprise class illustrates the highest AUC value at 92%, followed by happiness, disgust, anger, fear, and neutral at 88%, 82%, 74%, 72%, and 72%, respectively.

4.3. Training Time

The training time is essential for evaluating the computational complexity of deep learning networks with specific loss functions. This section compares the training time of the auxiliary CNN architectures with the existing and proposed loss functions. Notably, all loss functions were trained on a single GPU. Depending on the dataset and network architecture, the number of iterations was empirically set to achieve optimal convergence with the corresponding loss function. When the data have been pre-processed, we start measuring the training time T = T1 − T0 with the beginning time T0 at the start of the first iteration and ending time T1 at the finish of the final iteration. As presented in Table 12, the softmax loss trained the fastest because it only uses one term in the mathematical function, followed closely by the center and proposed loss functions. Furthermore, the range and marginal loss functions required longer training times among the compared methods because their complex mathematical functions produced a time-consuming backpropagation process. In summary, only softmax and center loss were marginally faster than the proposed method. However, the proposed method achieved superior performance compared with these loss functions. Therefore, the proposed method is computationally efficient and meets the practical requirements.
To summarize, the computational cost of the loss function in deep learning is critical. The loss function is used for evaluation during training, so a computationally expensive loss function slows down the training process and can cause bottlenecks, especially for large datasets. In addition, designing the loss function depends on the purpose of the output. Therefore, a good trade-off between computational cost and accuracy is desired.

5. Conclusions

Although a loss function can drive network learning, it has received little attention for promoting facial expression recognition (FER) performance. This study presents a new loss function that allows simultaneous consideration of inter- and intra-class variations to be applied to CNN architecture for FER. More specifically, this loss function minimizes the distance between the deep features and their corresponding class centers as well as maximizes the distances of deep features with their non-corresponding class centers and the distances between different class centers. In addition, the proposed loss function improves the testing accuracy of the benchmark FER database compared with several other loss functions. Overall, this study demonstrates that choosing optimal loss functions strongly affects the performance of deep learning networks, even when maintaining their architecture. While the proposed loss function achieved impressive performance, it has not completely solved the unbalanced data problems. To overcome this issue, we plan to apply resampling methods by undersampling with majority class samples, oversampling with minority class samples, and using cost-sensitive learning to focus on the minority classes. In addition, we would like to extend the experiment for facial expression recognition in real-time conditions with a variety of emotions (e.g., embarrassment, adoration, nostalgia, satisfaction, pride, etc.). Currently, the proposed loss function applied to facial expression recognition for masked faces is under investigation and is expected to achieve promising performance.

Author Contributions

Conceptualization, methodology, T.-D.P.; resources, investigation, analysis, writing-original draft preparation, M.-T.D.; software, Q.-T.H.; validation, project administration, S.L.; supervision, writing-review and editing, M.-C.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Korea Institute for Advancement of Technology (KIAT) grant funded by the Korean Government, Ministry of Trade, Industry and Energy (MOTIE) (HRD Program for Industrial Innovation) under Grant P0017011; in part by the Industrial Technology Challenge Track of MOTIE/Korea Evaluation Institute of Industrial Technology (KEIT) under Grant 20012624; in part by the Research and Development Program of MOTIE; and in part by KEIT under Grant RS-2023-00232192.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Jourabloo, A.; De la Torre, F.; Saragih, J.; Wei, S.E.; Lombardi, S.; Wang, T.L.; Belko, D.; Trimble, A.; Badino, H. Robust egocentric photo-realistic facial expression transfer for virtual reality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 20323–20332. [Google Scholar]
  2. Putro, M.D.; Nguyen, D.L.; Jo, K.H. A Fast CPU Real-Time Facial Expression Detector Using Sequential Attention Network for Human–Robot Interaction. IEEE Trans. Ind. Inf. 2022, 18, 7665–7674. [Google Scholar] [CrossRef]
  3. Xiao, H.; Li, W.; Zeng, G.; Wu, Y.; Xue, J.; Zhang, J.; Li, C.; Guo, G. On-road driver emotion recognition using facial expression. Appl. Sci. 2022, 12, 807. [Google Scholar] [CrossRef]
  4. Farkhod, A.; Abdusalomov, A.B.; Mukhiddinov, M.; Cho, Y.I. Development of Real-Time Landmark-Based Emotion Recognition CNN for Masked Faces. Sensors 2022, 22, 8704. [Google Scholar] [CrossRef] [PubMed]
  5. Zhao, G.; Pietikainen, M. Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 915–928. [Google Scholar] [CrossRef] [PubMed]
  6. Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–25 June 2005; IEEE: New York, NY, USA, 2005; Volume 1, pp. 886–893. [Google Scholar]
  7. Niese, R.; Al-Hamadi, A.; Farag, A.; Neumann, H.; Michaelis, B. Facial expression recognition based on geometric and optical flow features in colour image sequences. IET Comput. Vis. 2012, 6, 79–89. [Google Scholar] [CrossRef]
  8. Moghaddam, B.; Jebara, T.; Pentland, A. Bayesian face recognition. Pattern Recognit. 2000, 33, 1771–1782. [Google Scholar] [CrossRef]
  9. Liu, J.; Zhang, L.; Chen, X.; Niu, J. Facial landmark automatic identification from three dimensional (3D) data by using Hidden Markov Model (HMM). Int. J. Ind. Ergon. 2017, 57, 10–22. [Google Scholar] [CrossRef]
  10. Chen, L.; Li, M.; Su, W.; Wu, M.; Hirota, K.; Pedrycz, W. Adaptive feature selection-based AdaBoost-KNN with direct optimization for dynamic emotion recognition in human–robot interaction. IEEE Trans. Emerg. Top. Comput. Intell. 2019, 5, 205–213. [Google Scholar] [CrossRef]
  11. Kotsia, I.; Pitas, I. Facial expression recognition in image sequences using geometric deformation features and support vector machines. IEEE Trans. Image Process. 2006, 16, 172–187. [Google Scholar] [CrossRef]
  12. Li, Z.; Liu, F.; Yang, W.; Peng, S.; Zhou, J. A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 6999–7019. [Google Scholar] [CrossRef]
  13. Duong, M.T.; Hong, M.C. EBSD-Net: Enhancing Brightness and Suppressing Degradation for Low-light Color Image using Deep Networks. In Proceedings of the IEEE International Conference on Consumer Electronics-Asia (ICCE-Asia), Yeosu, Republic of Korea, 26–28 October 2022; pp. 1–4. [Google Scholar]
  14. Hoang, H.A.; Yoo, M. 3ONet: 3D Detector for Occluded Object under Obstructed Conditions. IEEE Sens. J. 2023, 23, 18879–18892. [Google Scholar] [CrossRef]
  15. Karnati, M.; Seal, A.; Bhattacharjee, D.; Yazidi, A.; Krejcar, O. Understanding deep learning techniques for recognition of human emotions using facial expressions: A comprehensive survey. IEEE Trans. Instrum. Meas. 2023, 72, 1–31. [Google Scholar] [CrossRef]
  16. Villanueva, M.G.; Zavala, S.R. Deep neural network architecture: Application for facial expression recognition. IEEE Latin Am. Trans. 2020, 18, 1311–1319. [Google Scholar] [CrossRef]
  17. Ge, H.; Zhu, Z.; Dai, Y.; Wang, B.; Wu, X. Facial expression recognition based on deep learning. Comput. Methods Progr. Biomed. 2022, 215, 106621. [Google Scholar] [CrossRef] [PubMed]
  18. Lee, D.H.; Yoo, J.H. CNN Learning Strategy for Recognizing Facial Expressions. IEEE Access 2023, 11, 70865–70872. [Google Scholar] [CrossRef]
  19. Wu, B.F.; Lin, C.H. Adaptive feature mapping for customizing deep learning based facial expression recognition model. IEEE Access 2018, 6, 12451–12461. [Google Scholar] [CrossRef]
  20. Li, S.; Deng, W. Deep facial expression recognition: A survey. IEEE Trans. Affect. Comput. 2020, 13, 1195–1215. [Google Scholar] [CrossRef]
  21. Akhand, M.; Roy, S.; Siddique, N.; Kamal, M.A.S.; Shimamura, T. Facial emotion recognition using transfer learning in the deep CNN. Electronics 2021, 10, 1036. [Google Scholar] [CrossRef]
  22. Renda, A.; Barsacchi, M.; Bechini, A.; Marcelloni, F. Comparing ensemble strategies for deep learning: An application to facial expression recognition. Expert Syst. Appl. 2019, 136, 1–11. [Google Scholar] [CrossRef]
  23. Liu, C.; Hirota, K.; Ma, J.; Jia, Z.; Dai, Y. Facial expression recognition using hybrid features of pixel and geometry. IEEE Access 2021, 9, 18876–18889. [Google Scholar] [CrossRef]
  24. Liu, W.; Wen, Y.; Yu, Z.; Yang, M. Large-margin softmax loss for convolutional neural networks. Proc. Int. Conf. Mach. Learn. 2016, 2, 507–516. [Google Scholar]
  25. Wang, F.; Cheng, J.; Liu, W.; Liu, H. Additive margin softmax for face verification. IEEE Signal Process. Lett. 2018, 25, 926–930. [Google Scholar] [CrossRef]
  26. Sun, Y.; Chen, Y.; Wang, X.; Tang, X. Deep learning face representation by joint identification-verification. In Advances in Neural Information Processing Systems; Curran: Red Hook, NY, USA, 2014; Volume 27. [Google Scholar]
  27. Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
  28. Wen, Y.; Zhang, K.; Li, Z.; Qiao, Y. A discriminative feature learning approach for deep face recognition. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer International Publishing: Cham, Switzerland, 2016; pp. 499–515. [Google Scholar]
  29. Lucey, P.; Cohn, J.F.; Kanade, T.; Saragih, J.; Ambadar, Z.; Matthews, I. The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. In Proceedings of the IEEE Computer Society Conference Computer Vision and Pattern Recognition Workshops, San Francisco, CA, USA, 13–18 June 2010; IEEE: New York, NY, USA, 2010; pp. 94–101. [Google Scholar]
  30. Zhao, G.; Huang, X.; Taini, M.; Li, S.Z.; PietikäInen, M. Facial expression recognition from near-infrared videos. Image Vis. Comput. 2011, 29, 607–619. [Google Scholar] [CrossRef]
  31. Pantic, M.; Valstar, M.; Rademaker, R.; Maat, L. Web-based database for facial expression analysis. In Proceedings of the IEEE International Conference Multimedia Expo, Amsterdam, The Netherlands, 6 July 2005; IEEE: New York, NY, USA, 2005; pp. 317–321. [Google Scholar]
  32. Goodfellow, I.J.; Erhan, D.; Carrier, P.L.; Courville, A.; Mirza, M.; Hamner, B.; Cukierski, W.; Tang, Y.; Thaler, D.; Lee, D.H.; et al. Challenges in representation learning: A report on three machine learning contests. In Proceedings of the International Conference Neural Information Processing (ICONIP 2013), Daegu, Republic of Korea, 3–7 November 2013; Part III 20. pp. 117–124. [Google Scholar]
  33. Zhang, X.; Fang, Z.; Wen, Y.; Li, Z.; Qiao, Y. Range loss for deep face recognition with long-tailed training data. In Proceedings of the IEEE/CVF International Conference Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5409–5418. [Google Scholar]
  34. Deng, J.; Zhou, Y.; Zafeiriou, S. Marginal loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 60–68. [Google Scholar]
  35. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; Volume 25, pp. 1097–1105. [Google Scholar]
  36. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
  37. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  38. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
  39. Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
  40. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
  41. Tan, M.; Chen, B.; Pang, R.; Vasudevan, V.; Sandler, M.; Howard, A.; Le, Q.V. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2820–2828. [Google Scholar]
  42. Yang, T.J.; Howard, A.; Chen, B.; Zhang, X.; Go, A.; Sandler, M.; Sze, V.; Adam, H. Netadapt: Platform-aware neural network adaptation for mobile applications. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 285–300. [Google Scholar]
  43. Zhang, H.; Wu, C.; Zhang, Z.; Zhu, Y.; Lin, H.; Zhang, Z.; Sun, Y.; He, T.; Mueller, J.; Manmatha, R.; et al. Resnest: Split-attention networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, New Orleans, LA, USA, 19–20 June 2022; pp. 2736–2746. [Google Scholar]
  44. Duong, M.T.; Lee, S.; Hong, M.C. DMT-Net: Deep Multiple Networks for Low-light Image Enhancement Based on Retinex Model. IEEE Access 2023, 11, 132147–132161. [Google Scholar] [CrossRef]
  45. Robbins, H.; Monro, S. A Stochastic Approximation Method. Ann. Math. Stat. 1951, 22, 400–407. [Google Scholar] [CrossRef]
  46. Viola, P.; Jones, M. Rapid object detection using a boosted cascade of simple features. In Proceedings of the IEEE/CVF International Conference Computer Vision, Kauai, HI, USA, 8–14 December 2001; IEEE: New York, NY, USA, 2001; Volume 1, pp. 511–518. [Google Scholar]
  47. Susmaga, R. Confusion matrix visualization. In Proceedings of the Intelligent Information Processing and Web Mining, Zakopane, Poland, 17–20 May 2004; Springer: Berlin/Heidelberg, Germany, 2004; pp. 107–116. [Google Scholar]
  48. Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
Figure 1. Pipeline of the FER system.
Figure 1. Pipeline of the FER system.
Sensors 23 09658 g001
Figure 2. Visualization of the proposed loss for one batch image in the Euclidean space. Supposing the three classes anger, happiness, and surprise in this batch, the proposed loss function aims to reduce the intra-distance d 1 and enhance the inter-distance d 2 and class distance d 3 (Best viewed in color).
Figure 2. Visualization of the proposed loss for one batch image in the Euclidean space. Supposing the three classes anger, happiness, and surprise in this batch, the proposed loss function aims to reduce the intra-distance d 1 and enhance the inter-distance d 2 and class distance d 3 (Best viewed in color).
Sensors 23 09658 g002
Figure 3. Overall system CNNs using the proposed loss function.
Figure 3. Overall system CNNs using the proposed loss function.
Sensors 23 09658 g003
Figure 4. Example face images from CK+ (top), Oulu-CASIA (center), and MMI (bottom) databases. The facial expressions from left to right convey anger, contempt, disgust, fear, happiness, sadness, and surprise. The contempt images of Oulu-CASIA and MMI are null.
Figure 4. Example face images from CK+ (top), Oulu-CASIA (center), and MMI (bottom) databases. The facial expressions from left to right convey anger, contempt, disgust, fear, happiness, sadness, and surprise. The contempt images of Oulu-CASIA and MMI are null.
Sensors 23 09658 g004
Figure 5. Example rotated images from CK+ database. The facial expressions from left to right convey anger, contempt, disgust, fear, happiness, sadness, and surprise. The rotation degrees from top to bottom are −5, −10, −15, 5, 10, 15°.
Figure 5. Example rotated images from CK+ database. The facial expressions from left to right convey anger, contempt, disgust, fear, happiness, sadness, and surprise. The rotation degrees from top to bottom are −5, −10, −15, 5, 10, 15°.
Sensors 23 09658 g005
Figure 6. Recognition performance portrayed as ROC curves and corresponding area under the curve (AUC) for all expression recognition performances with ResNet optimized with the proposed loss on the CK+ database.
Figure 6. Recognition performance portrayed as ROC curves and corresponding area under the curve (AUC) for all expression recognition performances with ResNet optimized with the proposed loss on the CK+ database.
Sensors 23 09658 g006
Figure 7. Recognition performance portrayed as ROC curves and corresponding area under the curve (AUC) for all expression recognition performances with ResNet optimized with the proposed loss function on the Oulu-CASIA database.
Figure 7. Recognition performance portrayed as ROC curves and corresponding area under the curve (AUC) for all expression recognition performances with ResNet optimized with the proposed loss function on the Oulu-CASIA database.
Sensors 23 09658 g007
Figure 8. Recognition performance portrayed as ROC curves and corresponding area under the curve (AUC) for all expression recognition performances with MobileNetV3 optimized with the proposed loss on the MMI database.
Figure 8. Recognition performance portrayed as ROC curves and corresponding area under the curve (AUC) for all expression recognition performances with MobileNetV3 optimized with the proposed loss on the MMI database.
Sensors 23 09658 g008
Figure 9. Recognition performance portrayed as ROC curves and corresponding area under the curve (AUC) for all expression recognition performances with ResNeSt optimized with the proposed loss function on the FER2013 database.
Figure 9. Recognition performance portrayed as ROC curves and corresponding area under the curve (AUC) for all expression recognition performances with ResNeSt optimized with the proposed loss function on the FER2013 database.
Sensors 23 09658 g009
Table 1. The properties of previous loss functions in deep facial recognition.
Table 1. The properties of previous loss functions in deep facial recognition.
Loss FunctionsConsider Intra-Class VariationConsider Inter-Class VariationLimitations
L-Softmax [24]NoYesThe convergence is challenging
AM-Softmax [25]NoYesThe hyperparameter selection is challenging
Contrastive [26]YesYesThe convergence is challenging
Triplet [27]YesYesThe convergence is challenging
Center [28]YesNoA large memory storage is required
Range [33]YesYesA optimization strategy is challenging
Marginal [34]YesYesA optimization strategy is challenging
Table 2. Number of images for each emotion: anger (An), contempt (Co), disgust (Di), fear (Fe), happiness (Ha), sadness (Sa), surprise (Su), neutral (Ne).
Table 2. Number of images for each emotion: anger (An), contempt (Co), disgust (Di), fear (Fe), happiness (Ha), sadness (Sa), surprise (Su), neutral (Ne).
AnCoDiFeHaSaSuNeAll
CK+135541777520784249-981
Oulu240-240240240240240-1440
MMI99-968412696123-624
FER20134953-5475121898960774002619835,887
Table 3. Configuration information of the experimental environment.
Table 3. Configuration information of the experimental environment.
Experimental EnvironmentConfiguration Parameters
CPUIntel ® Xeon ® CPU E5-2620 v2, 48 GB RAM
GPUNVIDIA GeForce RTX 3090
Operating systemUbuntu 22.04
Deep learning frameworkPytorch 1.13.1
Programming languagePython 3.10
Table 4. Performance comparison on the CK+ database in terms of the seven expressions.
Table 4. Performance comparison on the CK+ database in terms of the seven expressions.
MethodAlexNetInceptionNetResNetDenseNetMobileNetV3ResNeSt
Softmax87.3887.1890.6583.6891.6085.58
Center88.0887.8892.4683.3887.7885.98
Range90.5988.0891.7985.2891.5088.68
Marginal89.1886.6887.7884.1889.6886.38
Proposed90.7989.1894.8985.9891.9089.28
Table 5. Confusion matrix of ResNet optimized with the proposed loss on the CK+ database. The labels in the leftmost column and on top represent the ground truth and the prediction results, respectively.
Table 5. Confusion matrix of ResNet optimized with the proposed loss on the CK+ database. The labels in the leftmost column and on top represent the ground truth and the prediction results, respectively.
AnCoDiFeHaSaSu
An86.2%1.4%6.5%0%0%5.1%0.8%
Co3.6%93.4%0%1.5%0%1.5%0%
Di1.7%0%97.7%0%0.6%0%0%
Fe0%2.6%0%87.2%7.6%2.6%0%
Ha0%0%0%0.5%99.5%0%0%
Sa7.5%1.1%1.1%0%0%90.3%0%
Su0.8%0%0%0.4%0%0.4%98.4%
Table 6. Performance comparison on the Oulu-CASIA database in terms of the six expressions.
Table 6. Performance comparison on the Oulu-CASIA database in terms of the six expressions.
MethodAlexNetInceptionNetResNetDenseNetMobileNetV3ResNeSt
Softmax70.5265.1672.4668.6773.2470.23
Center71.9564.0974.9669.0974.8967.23
Range72.1764.3874.1169.5269.0963.80
Marginal70.2468.0971.8868.6773.7468.95
Proposed72.9669.1777.6169.8876.4670.24
Table 7. Confusion matrix of ResNet optimized with the proposed loss function on the Oulu-CASIA database. The labels in the leftmost column and on top represent the ground truth and the prediction results, respectively.
Table 7. Confusion matrix of ResNet optimized with the proposed loss function on the Oulu-CASIA database. The labels in the leftmost column and on top represent the ground truth and the prediction results, respectively.
AnDiFeHaSaSu
An66.2%13.0%6.9%0%13.9%0%
Di12.4%70.5%7.3%2.6%6.8%0.4%
Fe5.8%1.3%76.3%5.4%5.0%16.3%
Ha0%2.1%5.8%92.1%0%0%
Sa12.4%3.8%5.1%1.7%76.5%0.4%
Su1.4%0%11.9%2.7%0%84.0%
Table 8. Performance comparison on the MMI database in terms of the six expressions.
Table 8. Performance comparison on the MMI database in terms of the six expressions.
MethodAlexNetInceptionNetResNetDenseNetMobileNetV3ResNeSt
Softmax57.7653.9261.5960.5261.4454.07
Center58.9858.5261.9259.2964.3657.29
Range62.6761.7561.1354.6864.2055.76
Marginal59.4455.1457.6257.6164.6653.00
Proposed63.1363.7465.8961.1367.4358.83
Table 9. Confusion matrix of MobileNetV3 optimized with the proposed loss function on the MMI database. The labels in the leftmost column and on top represent the ground truth and prediction results, respectively.
Table 9. Confusion matrix of MobileNetV3 optimized with the proposed loss function on the MMI database. The labels in the leftmost column and on top represent the ground truth and prediction results, respectively.
AnDiFeHaSaSu
An57.1%14.3%11.4%3.8%12.4%1.0%
Di13.0%72.2%2.8%4.6%4.6%2.8%
Fe11.5%5.8%31.0%9.2%11.5%31.0%
Ha0%6.3%1.6%89.7%0%2.4%
Sa14.7%13.8%8.8%0%59.8%2.9%
Su4.1%0.8%7.3%1.6%4.9%81.3%
Table 10. Performance comparison on the FER2013 database in terms of the seven expressions.
Table 10. Performance comparison on the FER2013 database in terms of the seven expressions.
MethodAlexNetInceptionNetResNetDenseNetMobileNetV3ResNeSt
Softmax59.7755.9259.2159.1556.3360.85
Center58.4857.4256.7059.8250.4360.93
Range58.6556.2248.3759.5952.9960.96
Marginal59.0457.1257.5158.7156.5659.76
Proposed58.5157.8159.6560.4658.2961.05
Table 11. Confusion matrix of ResNeSt optimized with the proposed loss function on the FER2013 database. The labels in the leftmost column and on top represent the ground truth and prediction results, respectively.
Table 11. Confusion matrix of ResNeSt optimized with the proposed loss function on the FER2013 database. The labels in the leftmost column and on top represent the ground truth and prediction results, respectively.
AnDiFeHaSaSuNe
An55.7%0.4%8.6%6.6%14.6%3.2%10.9%
Di23.2%46.4%7.2%1.8%10.7%3.6%7.1%
Fe8.9%0.2%42.5%4.2%23.4%8.3%12.5%
Ha3.6%0%1.5%80.7%3.6%2.8%7.8%
Sa13.6%0.5%11.8%6.9%48.5%2.8%15.9%
Su4.3%0%7.9%3.9%2.4%77.4%4.1%
Ne9.9%0.2%7.1%8.7%16.8%2.3%55.0%
Table 12. Training time(s) comparison of the auxiliary CNN architecture with different loss functions.
Table 12. Training time(s) comparison of the auxiliary CNN architecture with different loss functions.
MethodsAlexNetInceptionNetDenseNet
CK+Oulu-CASIAMMIFER2013CK+Oulu-CASIAMMIFER2013CK+Oulu-CASIAMMIFER2013
Softmax14614211615049949747915293935043942036
Center15014312115750249349416214184944002050
Range2871293722192619343232653219923826793231246012,113
Marginal15,42415,55212,14915,22215,62515,62115,51346,09012,44315,71712,37262,438
Proposed15715612716352051852216774105034122177
Iterations10,00010,000800010,00010,00010,00010,00030,00010,00010,000800010,000
MethodsResNetMobileNetV3ResNeSt
CK+Oulu-CASIAMMIFER2013CK+Oulu-CASIAMMIFER2013CK+Oulu-CASIAMMIFER2013
Softmax2312522545684311144106632152008161921104108
Center23526525060037310551195241524661779220010,969
Range236125332443577452476445617614,28373685936785664,143
Marginal11,64812,63211,64830,68612,16434,07633,58083,61837,94429,17235,354197,333
Proposed24327126660145610721108294325701780227314,004
Iterations75008000800020,000750020,00020,00050,00020,00015,00020,000100,000
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Pham, T.-D.; Duong, M.-T.; Ho, Q.-T.; Lee, S.; Hong, M.-C. CNN-Based Facial Expression Recognition with Simultaneous Consideration of Inter-Class and Intra-Class Variations. Sensors 2023, 23, 9658. https://doi.org/10.3390/s23249658

AMA Style

Pham T-D, Duong M-T, Ho Q-T, Lee S, Hong M-C. CNN-Based Facial Expression Recognition with Simultaneous Consideration of Inter-Class and Intra-Class Variations. Sensors. 2023; 23(24):9658. https://doi.org/10.3390/s23249658

Chicago/Turabian Style

Pham, Trong-Dong, Minh-Thien Duong, Quoc-Thien Ho, Seongsoo Lee, and Min-Cheol Hong. 2023. "CNN-Based Facial Expression Recognition with Simultaneous Consideration of Inter-Class and Intra-Class Variations" Sensors 23, no. 24: 9658. https://doi.org/10.3390/s23249658

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop