**1. Introduction**

Facial expressions are undoubtedly a dominant, natural, and effective channel used by people to convey their emotions and intentions during communication. Over the last few decades, automatic facial expression analysis has attracted significant attention, and has become one of the most challenging problems in computer vision and artificial intelligence fields. Numerous studies have been conducted on developing reliable automated facial expression recognition (FER) systems for use over a wide range of applications, such as human–computer interaction, social robotics, medical treatment, virtual reality, augmented reality, and games [1–4]. In this work, we constructed deep convolutional neural network (CNN)-based FER models to recognize eight common facial expressions (happy, sad, surprise, fear, contempt, anger, disgust, and neutral) from the AffectNet database of facial expression, valence, and arousal computing in the wild [5]. This was motivated by a upcoming challenge on AffectNet's website [6].

The FER model is usually composed of three main stages: face detection, feature extraction, and emotion classification. First, the face and its components (e.g., eyes, mouth, nose, and eyebrows) are detected from images or video sequences. Then, features that are the most effective at distinguishing one expression from another are extracted from the face region. Finally, a classifier is constructed, given the extracted feature set for each target facial expression. The literature is rich with handcrafted face detection and feature extraction methods for FER that have achieved satisfactory results in laboratory-controlled settings [7–11]. However, these traditional methods have been reported to be incapable of discriminating a grea<sup>t</sup> diversity of unrelated factors in facial expressions (e.g., subtle facial appearances, head poses, illumination intensity, and occlusions) with FER tasks for in-the-wild settings [12,13].

Recently, the success of convolutional neural networks in both computer vision and pattern recognition has promoted a transition in FER from using handcrafted feature-learning methods to using deep learning technologies. A deep learning-based FER system commonly uses a CNN model to extract and learn high-level features directly from input images. Then, an output layer (which usually uses softmax as an activation function) is attached on top of the CNN model to distinguish the emotion to be detected. This allows a faster emotion recognition system with higher accuracy in challenging, real-world environments [14–16]. In this paper, we make use of the powerful feature-learning capacity of the deep CNN to build FER models. However, a few significant problems arise when applying deep learning to FER systems.

First, deep learning-based FER models require a large amount of data for training to acquire suitable values for model parameters. Directly training the FER model on small-scale datasets is prone to overfitting [17], which leads the model to be less generalized and incapable of handling FER tasks in real-world environments. Although a grea<sup>t</sup> effort has been made to collect facial expression training datasets, large-scale, annotated, facial expression databases are still limited [13]. Therefore, overfitting caused by a shortage of data remains a challenging issue for most FER systems.

Second, imbalances in the distribution of facial expression samples from real-world FER datasets may degrade the overall performance of the system [18]. Due to the nature of emotions, the number of collected facial images for the major classes (e.g., happiness, sadness, and anger) is much larger than for the minor classes (e.g., contempt, disgust, and fear). In the AffectNet dataset, happy category comprises about 47% of all the images, whereas contempt category comprises only 1.2%. FER systems being trained on an imbalanced dataset may perform well on dominant emotions, but they perform poorly on the under-represented ones. Usually, the weighted-softmax loss approach [5] is used to handle this problem by weighting the loss term for each emotion class based on its relative proportion in the training set. However, this weighted-loss approach is based on the softmax loss function, which is reported to simply force features of different classes to remain apart without paying attention to intra-class compactness. One effective strategy to address the problem of softmax loss is to use auxiliary loss to train the neural network. For instance, triplet loss [19] and center loss [20] introduce multi-loss learning to enhance the discriminating ability of CNN models. Although these loss functions do boost the discriminative ability of the conventional softmax loss, they usually come with limitations. Triplet loss requires a comprehensive process of choosing image pairs or triplet samples, which is impractical and extremely time-consuming owing to the huge number of pairs and samples in the training phase. Center loss does not consider inter-class similarity, which may lead to poor performance by the FER system. In addition, none of these auxiliary loss functions is able to address data-imbalance problems.

To address the first problem (the shortage of data), in this work, the transfer learning technique is applied to build the FER system. Transfer learning is a machine learning technique by which a model trained on one task is repurposed for another related task. It not only helps to handle the shortage of data but also speeds up training and improves the performance of the prediction model. In this paper, we take a transfer learning approach by employing two recent CNN architectures in two-stage, supervised pre-training and fine-tuning. Specifically, a squeeze-and-excitation network (SENet) model [21] which is pre-trained for the face identification task on the VGGFace2 [22] database from the Visual Geometry Group at Oxford University, was fine-tuned on the AffectNet dataset [5] to recognize the above-mentioned eight common facial expressions.

Tackling the second problem of imbalanced data distribution in existing FER datasets, we propose a new loss function called the weighted-cluster loss, which integrates the advantages of the weighted-softmax approach and the auxiliary loss approach. First, weighted-cluster loss learns a class center for each emotion, which simultaneously reduces the intra-class variations and increases the inter-class differences. Next, the proposed loss gives weights to each emotion class's loss terms based on their relative proportion of the total number of samples in the training dataset. In other words, weighted-cluster loss penalizes networks more for misclassifying samples from minor classes while penalizing those networks less for misclassifying examples from major classes. Furthermore, the training process is simple because weighted-cluster loss does not require preselected sample pairs or triples.

Experiments were conducted to show the effectiveness of the proposed method. In addition to widely used metrics for classification (accuracy, F1-score [23], area under the receiver operating characteristic [ROC] curve [AUC] [24], and area under the precision-recall curve [AUC-PR] [25]), two measures of inter-annotator agreemen<sup>t</sup> (Cohen's kappa [26] and Krippendorff's alpha [27]) are used to evaluate the models. The experimental results with the AffectNet dataset [5] show that our transfer learning-based model with weighted-cluster loss outperforms other models that use either weighted softmax-loss or center loss.

In summary, the main contributions of this paper are listed as follows:


The rest of this manuscript is organized as follows. Section 2 summarizes the existing literature related to facial expression recognition. Then, our proposed method is presented in detail in Section 3. The experiments with results discussion are presented in Section 4. Conclusions drawn from this work, in addition to possible future work are discussed in Section 5.

#### **2. Related Works**

This section summarizes recent studies in the literature that are related to facial expression recognition, deep transfer learning techniques used to solve FER tasks, and recent successful loss functions for training deep models.

#### *2.1. Facial Expression Recognition Approaches*

Over the last few decades, several approaches have been proposed to build FER models. Traditional methods mostly detect the face region and extract the geometry, appearance, texture, or other highlighted facial characteristics using handcrafted features and shallow learning, such as Gabor wavelet coefficients [7], Haar features [8], histograms of local binary patterns (LBPs) [9], LBPs on three orthogonal planes (LBP-TOP) [10], histograms of oriented gradients (HOG) [11], non-negative matrix factorization (NMF) [28], scale-invariant feature transform (SIFT) descriptors [29], and sparse

learning [30]. Overall, these methods can solve the FER tasks in laboratory settings, where emotion images are produced in a controlled manner. However, with the introduction of relatively large real-world databases from emotion recognition competitions such as FER2013 [31] and emotion recognition in the wild [32–36], FER tasks have observed a big transition from laboratory-controlled setups to more challenging real-world scenarios. Traditional methods face many difficulties in solving inter-class similarity and intra-class differences, and are reported to be incapable of addressing the grea<sup>t</sup> diversity of factors unrelated to facial expressions.

Recently, the development of deep learning and the increase in the number of powerful CNN architectures [37–39] in the computer vision field implicitly promote using deep learning methods when building FER models. For example, the winner of the 2013 International Conference on Machine Learning [40] combined a deep CNN with a support vector machine (SVM) classifier to distinguish common emotions. However, despite the impressive feature learning ability of deep learning, difficulties remain when applied to FER such as the shortage of training data, the inter-class similarity, the intra-class differences, and the high imbalance of emotion classes appearing in most existing real-world facial expression databases. This work is taking advantage of deep models to extract robust facial features and translate them to recognize facial emotions. Table 1 summarizes the comparison between our approaches with existing studies. In this table, human resource refers to how much human labor needed to construct features learning/extraction model, computing resource refers to hardware resources needed to operate model, and computational complexity refers to operations per pixel.


**Table 1.** Comparison between FER approaches.

#### *2.2. Transfer Learning for Facial Expression Recognition*

Studies in FER have suffered from the lack of data for training deep CNN models, which may have resulted in overfitting. To work around this problem, transfer learning has been widely used for facial recognition tasks. In fact, the use of auxiliary data can help FER models to obtain a high capacity without overfitting, thus improving the overall performance of the system. Usually, the weights of the CNN are initialized and pre-trained on additional data from other relative tasks (e.g., object detection and face recognition) before being fine-tuned using the target dataset. Clearly, applying transfer learning to the FER task has consistently achieved better results, compared to directly training the network on a small-scale FER dataset. Some popular datasets can be used as auxiliary data, such as ImageNet [41] for object recognition tasks and VGGFace from the Visual Geometry Group [42] for the face recognition task. In the work of Ly et al. [43], a Inception-ResNetV1 model was pre-trained on VGGFace2 [22] and AffectNet [5] then was used to develop a multi-modal 2D and 3D for real-word FER task. However, the number of existing 3D data for FER is limited thus make it difficult to construct a multi-modal FER model. Do et al. [44] used a ResNet-50 model pre-trained on VGGFace2 [22] as a feature extraction model. The model was then integrated with a LSTM [45] model to analyses facial expression on the video data.

It is worthwhile to note that in preliminary experiments, Ngo and Yoon showed the improvement in recognition performance when using ImageNet data [41] as auxiliary data for building a transfer learning-based FER model [46]. The authors fine-tuned a ResNet-50 [47] model, which was pre-trained with ImageNet data. However, the ImageNet dataset was developed for object detection task, which may not be sufficiently related to the FER task. In this paper, to build the transfer learning-based

FER model, we employ the more advanced CNN architecture (i.e., SENet [21]) and then fine-tune it for FER task. Note that this transferred model is pre-trained with VGGFace2 data [22] for the face identification task which is more related to the FER task than object detection tasks. This may help improve the performance of the transfer learning-based FER system. Furthermore, in [46], authors simply fine-tuned their models using softmax loss and did not consider the imbalanced data problem, which degraded the recognition performance of FER system. In this paper, we focus on handling the shortage of data as well as tackling the imbalanced data problem by applying weighted loss and auxiliary loss approaches.

#### *2.3. Data Re-Sampling and Augmentation*

As the nature of emotions, facial expression data collected in real-world settings are highly imbalanced in the number of samples in each class (e.g., the number of images in the happy class is much greater than the number of images that show disgust). Using imbalanced data for training may degrade the performance of FER models [18]. Data re-sampling and generating samples using generative adversarial networks (GAN) [48] are usually considered as solutions to mitigate the imbalanced data problem. Mollahoseini et al. [5] used down-sampling and up-sampling methods to balance the distribution of data in the training set, which alleviates the imbalanced data problem. However, the two data re-sampling methods simply randomly reduce the number of samples in major classes or duplicate samples in minor classes without actually collecting further data. Under-sampling may discards samples that could be important for the model while over-sampling significantly increases the model training time [49]. Lai et al. [50] proposed a GAN that generates frontal face images from input non-frontal face images. This model can be employed to augmen<sup>t</sup> more data for training FER models. Nonetheless, the reliability of the new data needs to be carefully verified before being used to train FER models, otherwise, the data may degrade the performance of FER systems. Our solution is based on a weighted loss approach that tackles the imbalanced data problem without the need of data re-sampling or augmentation step.

#### *2.4. Weighted Loss and Auxiliary Loss*

Weighted-softmax loss gives weights to the loss terms of each emotion class based on its relative numbers of samples in the training set. In this way, the loss function heavily penalizes the FER model for misclassifying examples from minor classes, while lightly penalizing the model for misclassifying examples from major classes. Mollahoseini et al. [5] showed that the weighted-loss methods achieve the highest performance and outperform the re-sampling methods. However, since the weighted-softmax loss function is built based on conventional softmax loss, it inherits the limitations of softmax loss. For example, weighted-softmax loss simply forces features of different classes to remain apart, without paying attention to intra-class compactness.

To tackle the limitations of softmax loss, several auxiliary loss approaches have been proposed, and they can be used to improve the discriminative ability of FER models. Contrastive loss [51] inputs the CNN model with a pair of training samples, which forces the features of same-class pairs to be as similar as possible while requiring a pairwise distance longer than a predefined margin if the input samples belong to different classes. Similar to contrastive loss, triplet loss [19] encourages a distance constraint between samples that come from different classes. Specifically, triplet loss requires one positive sample to be closer to a preselected anchor than one negative sample with a fixed margin. Based on triplet loss, there are two variations proposed to support softmax loss. (N + M)-tuples cluster loss [52] mitigates the difficulty of anchor selection and threshold validation, whereas exponential triplet-based loss [53] gives more weight to difficult samples during an update of network parameters. In spite of some success, these loss functions still need a careful pre-selection process. However, this becomes impossible when the number of training samples is relatively large. Recently, center loss [20] has been proposed for the face recognition task; it penalizes the distance between deep features and their corresponding class clusters in order to reduce intra-class differences. However, center loss does not take inter-class similarity into account. Motivated by this, island loss [54] was proposed to simultaneously force deep features of a sample close to its cluster, and to push class cluster centers far away from each other to enlarge the distance between samples of different expressions. Nonetheless, adding more loss terms usually comes with difficulty in the training phase caused by extra hyper-parameters that need to be adjusted. In addition, none of these aforementioned loss functions take into consideration the imbalance in the number of training samples from each emotion class in the training dataset. This may lead FER models to perform poorly with minority classes, even though they perform well with majority classes.

By analyzing the complementary nature of weighted loss and center loss, we propose a new loss function, named weighted-cluster loss, which not only takes highly skewed facial emotion data into consideration, but it also uses multiple loss terms to improve the performance of the FER models.
