**4. Experiments**

#### *4.1. Experimental Datasets*

#### 4.1.1. AffectNet Dataset

In this work, we consider the problem in recognizing eight common facial expressions from the AffectNet dataset [5]. The AffectNet dataset contains more than 1,000,000 images from the Internet that were obtained by querying different search engines using emotion-related tags. AffectNet is by far the largest database that provides facial expressions in two different emotion models (a categorical model and a dimensional model), which can be used for studies in automated recognition of facial expressions, valences, and arousal in real-world scenarios. About 450,000 images already have manually annotated labels for eight basic expressions which are neutral, happy, sad, surprise, fear, disgust, anger, and contempt, as well as some non-emotion–related image classes such as none, uncertain, and non-face. Figure 2 shows some sample images from the dataset.

**Figure 2.** Sample images from the AffectNet dataset (0: neutral; 1: happy; 2: sad; 3: surprise; 4: fear; 5: disgust; 6: anger; 7: contempt).

In this work, only images of the eight common emotion classes (manually annotated) were used to train the FER model. From each of the eight emotion classes, we randomly selected 500 samples for a validation set and another 500 samples were selected for a test set. The remainder were used for fine-tuning the FER models. The numbers of samples in the training, validation, and test sets are shown in Table 3.


**Table 3.** Numbers of samples in training, validation, and test sets.

Although AffectNet is considered one of the largest facial expression databases, it still has shortcomings when used for training FER models. This database is highly imbalanced, as can be seen in Figure 3. Specifically, as seen in Table 3, the number of images in the largest category (happy with 134,915 images) is approximately 30 times larger than the smallest category (contempt, with 4250 images). Furthermore, images were manually annotated, which may result in a low-reliability dataset. Therefore, transfer learning is still needed to mitigate these drawbacks.

**Figure 3.** Distribution of the eight classes in the training set.

#### 4.1.2. VGGFace2 Dataset

VGGFace2 [22] is a new large-scale face dataset that contains 3.31 million images of 9131 subjects, with an average of 362 images for each subject. Images were downloaded from Google's image search function, and they have large variations in pose, age, illumination, ethnicity, and profession (e.g., actors, athletes, politicians). Figure 4 shows some sample images from the VGGFace2 dataset.

**Figure 4.** Sample images from the VGGFace2 dataset.

#### *4.2. Evaluation Metrics*

There are various evaluation metrics in the literature to measure discriminative performance of the FER model. In addition to several widely used metrics for classification, such as accuracy, F1-score [23], area under the ROC curve (AUC) [24], and area under the precision–recall curve (AUC-PR) [25], two measures of inter-annotator agreemen<sup>t</sup> (Cohen's kappa [26] and Krippendorff's alpha [27]) are used in our work. In statistics, Cohen's kappa measures inter-rater reliability, which is the degree of agreemen<sup>t</sup> among raters, given the same data. Krippendorff's alpha (also called Krippendorff's coefficient) is an alternative to Cohen's kappa for determining inter-rater reliability. Table 4 lists acronyms used in this paper.


**Table 4.** List of acronyms.

#### *4.3. Experiment Setups and Implementation Details*

We experimented with different schemes to point out the effectiveness of using transfer learning as well as the proposed loss function for FER tasks. In detail, we fine-tune the transfer learning-based model (e.g., SE-ResNet-50 pre-trained with VGGFace2 dataset) with different loss settings, such as the conventional softmax loss, center loss with softmax loss [20], weighted-softmax loss [5], center loss with weighted-softmax loss (i.e., replace softmax loss by weighted-softmax loss), and the proposed weighted-cluster loss with weighted softmax loss. In addition, to evaluate the effectiveness of transfer learning, we also trained the base model using only the AffectNet training dataset with the same loss settings as those used for transfer learning-based model. In all experiments, we set the *λ* value (i.e., the scalar used to balance the loss terms) to 0.5 when using center loss, and to 1.0 when using weighted-cluster loss.

The pre-trained model was fine-tuned using the stochastic gradient descent algorithm with hyper-parameters (momentum = 0.9, weight decay = 0.0005). Note that we fine-tuned the pre-trained CNN model using a much smaller dataset (AffectNet compared with VGGFace2), and thus, the initial learning rate was set to 0.001, which is lower than the typical value of 0.01, in order to not drastically alter the pre-trained weights. The learning rate was dropped by a factor of 2 following every 10 epochs of training. For the base models that were trained using only AffectNet data (i.e., no pre-training), the initial learning rate was set to 0.01, and all other settings were kept the same. All experiments were implemented using the PyTorch library and were trained on a four-core Xeon CPU with a single Titan-XP GPU. Batch size for fine-tuning the transfer learning-based models was set to 36, while the batch size for training the base models from scratch was set to 30.

To enrich the scale of the dataset and mitigate the overfitting problem, it is necessary to conduct data augmentation. During the fine-tuning phase, input images were randomly cropped and resized to 224 × 224 pixels; the horizontal flip was randomly extracted from the cropped images. In addition, before being input into the FER model, all input samples were normalized by using the ImageNet mean and standard deviation (std) (mean = [0.485, 0.456, 0.406], std = [0.229, 0.224, 0.225]), which is a common practice for deep CNN models when working with RGB images.

#### *4.4. Results and Discussions*

Table 5 shows the accuracy, F1-score, Cohen's kappa, Krippendorff's alpha, AUC, and AUC-PR of different FER models on the test set. These values are averages over the eight classes. All metrics except for accuracy were calculated in a binary-class manner, whereas accuracy is defined in a multi-class manner. From the results in Table 5, we have the following observations.


**Table 5.** Recognition performance of different FER models on the test set.

First, the proposed FER model achieved the highest performance in terms of all evaluation metrics, outperforming their counterparts. In detail, model 10 (the transferred SE-ResNet-50 model fine-tuned using the proposed joint weighted-cluster and weighted-softmax) achieved recognition accuracy of 60.70%. They led the second-best model (i.e., model 8) by approximately 1% in terms of recognition accuracy. As reported in [5], the average agreemen<sup>t</sup> over eight emotion categories between two human annotators (randomly chosen out of a total of 12 annotators) on only a part of the AffectNet data was only 65.56%, which may be considered the maximum achievable recognition accuracy. This emphasizes the fact that recognizing human emotions from facial expressions in a real-world scenario is a challenging task, and the newly proposed weighted-cluster loss function is capable of addressing challenging factors such as subtle facial appearance, head pose, illumination intensity, and occlusions.

In terms of F1-score, the proposed model (model 10) surpassed its counterparts. In most imbalanced classification problems (e.g., facial expression recognition tasks on an imbalanced dataset like AffectNet), F1-score, which is the weighted average of precision and recall, gives a better measure of incorrectly classified cases than the accuracy metric. In addition, model 10 also achieved the best kappa, alpha, and AUC-PR values, which outperforms models that use other loss functions. Similar to F1-score, these values give us alternatives to accuracy when measuring the reliability of automated FER systems. This shows that the proposed FER model is dependable when it comes to solving facial expression recognition problems in a real-world scenario. It should be noted that model 8 achieves the best AUC value. This is because model 8 achieves good performance on the positive class (high AUC) at the cost of a high false negatives rate. On the other hand, the proposed model (model 10) tries to reduce the false negatives rate while maintaining good performance on the positive class.

Second, under the supervision of the same loss functions, transferred FER models performed better than FER models that are trained from scratch (i.e., models 1, 2, 3, 4, and 5, vs. models 6, 7, 8, 9, and 10, respectively). This shows the benefit of using transfer learning where the pre-trained model, which has the ability to learn features for face identification, can be transferred to recognize facial expressions to improve the recognition performance. The proposed model (model 10) which integrated both transfer learning and weighted loss approaches thus achieved the highest performance.

Next, the performance of model 1 and model 6 (models using conventional softmax loss) are lower than the performance model 5 and model 10 (models using weighted-cluster loss) by large margin (50.65% and 52.22% vs. 56.27% vs. 60.70%). This is because the softmax loss is incapable to handle the imbalanced data problem which leads to the poor performance of FER model on minor classes (e.g., contempt, disgust). In contrast, the proposed weighted-cluster alleviates the imbalanced data problem by giving weight to the loss term of each class. This help improve performance of model on small classes. In addition, center loss-based models (models that were fine-tuned using joint center loss with either softmax loss or weighted-softmax loss) performed worse than the models fine-tuned using only softmax loss or weighted-softmax loss. The recognition accuracy of model 7 which uses center with softmax loss function is only 46.07%, which is even lower than that of model 6 (using softmax loss). Similar phenomena can be found in the case of model 9. Model 9 that using center loss with weighted-softmax loss achieves lower accuracy than model 8 that using only weighted-softmax. This shows that the existing center loss is not suitable for tackling data imbalance

problems where the centers of the major emotion classes are updated more frequently than the centers of minor emotion classes.

Last, model 10 (transferred model that was fine-tuned using the proposed joint weighted-cluster loss and weighted-softmax loss) achieved better performance compared to model 9 (transferred model that was fine-tuned using joint center loss and weighted-softmax loss) in terms of all evaluation metrics (e.g., in terms of accuracy: 60.70% vs. 59.60%, respectively). This shows that the proposed cluster loss effectively handles the limitations of center loss and weighted-softmax loss by not only taking the imbalanced data into consideration but by also simultaneously improving intra-class compactness and enlarging inter-class differences.

Figure 5 shows the training loss and the validation accuracy of different models during the training phase. We can see that the training loss of auxiliary loss-based models (i.e., models 8, 9, and 10) is higher than the model using only softmax loss or weighted-softmax loss. This is because we added more loss terms to the total loss function. During the training phase, we can observe that all models are convergen<sup>t</sup> as epochs are trained.

To compare our model with existing models, we also conducted experiments in [46] where authors used a ResNet-50 model [47] which is pre-trained with ImageNet data [41] for object detection task as their base model and then fine-tuned it with AffectNet data for FER task. It is worthwhile to note that, in [46], authors fine-tuned their model using only the conventional softmax loss. In our experiments, we further fine-tuned the ResNet-50 model (pre-trained with ImageNet) using the same loss settings that were used to fine-tune SE-ResNet-50 model (e.g., weighted-softmax loss, center with softmax loss, and so forth). Recognition performance of ResNet-50-based models on are shown in Table 6.


**Table 6.** Recognition performance of ResNet-50-based models on the test set.

As can be seen in Table 6, the transferred ResNet-50 model that was fine-tuned using the proposed weighted-cluster loss (i.e., model 20) outperforms its counterpart models that were fine-tuned using other loss functions. This strengthens the point that the proposed weighted-cluster loss function is capable to handle the imbalanced data problem in the AffectNet dataset.

By comparing the performance of our models (Table 5) with the performance of the existing models in [46] (Table 6), we have following observation. Using the same loss function, our models (i.e., model 6, 7, 8, 9, 10) that used the SE-ResNet-50 model achieve better recognition performance than the models proposed in [46] (i.e., model 16, 17, 18, 19, 20, respectively) that used the ResNet-50 model pre-trained for object detection. For example, in terms of recognition accuracy, our best fine-tuned model (i.e., model 10) leads the best fine-tuned model that used the ResNet-50 model (i.e., model 20) about 1.25% (e.g., 60.70% vs 59.45%). This is because we used a more advanced CNN architecture (i.e., SE-ResNet-50) that was then pre-trained for face identification, instead of object detection.

**Figure 5.** Learning curves of different models over number of epoch trained. (**a**) Train loss during the training. (**b**) Validation loss during the training. (**c**) Validation accuracy during the training. Best view in color.

To investigate the effectiveness of the proposed weighted-cluster loss function when handling the imbalanced dataset problem, we further plotted in Figure 6 the confusion matrices of the transfer learning-based models that were fine-tuned using the conventional softmax loss (model 1 and model 6) and the proposed joint weighted-cluster loss and weighted-softmax loss (model 10 and model 20).

We can see that weighted-cluster loss effectively solved the high imbalanced data problem with the AffectNet dataset. In particular, the FER performance with minor emotion classes dramatically improved. For example, recognition accuracy of the transferred SE-ResNet-50 model for the contempt class improved by approximately 1000%—from 6% when fine-tuned using the softmax loss, to 59% when fine-tuned using the proposed weighted-cluster loss. For the disgust class, it improved by about 150%—from 36% when fine-tuned using the softmax loss to 54% when fine-tuned using the proposed weighted-cluster loss. This is because the proposed loss function penalizes misclassifying samples from these classes more. However, the FER performance with major emotion classes (e.g., happy and neutral) slightly decreased, because the loss function may not sufficiently penalize misclassifying samples from these classes. The same trend can be found in case of the transferred ResNet-50 model where the recognition accuracy of class contempt increases about 500% (from 10% to 53%) and that of class disgust increases about 150% (from 30% to 47%).

It is worthwhile to note that although the weighted-cluster loss tries to increase the inter-class difference, the similarities between some emotion classes such as happy vs. contempt, neutral vs. contempt, surprise vs. fear, and disgust vs. anger still remain to some extent. This is due to the natural similarity between these emotion classes. For this reason, there were up to 16% samples of the neutral class and and 15% samples of happy class that was misclassified as the contempt class by model 10. In the opposite direction, there was also a high percentage (e.g., 11% and 12%) samples of the contempt class misclassified as the neutral and happy classes, respectively. Similarly, in model 10, 18% samples of surprise class was misclassified as the fear class, and 13% samples of the fear class was misclassified as the surprise class while the disgust and anger classes were falsely recognized as the other at the rate of 15% and 12%, respectively.

**Figure 6.** Confusion matrices of the transfer learning-based models on the test set.

#### *4.5. Threats to Validity*

• Threat to internal validity: Threat to internal validity include errors when implementing the codes and conducting experiments. Although the implementations and experiments were carefully verified, errors are possible.

