*3.2. Fine-Tuning*

In this stage, we add one more fully connected layer to the pre-trained SE-ResNet-50 model (before the output layer; see Table 2) and we then changed the output layer to eight-class classification. In addition, we froze the weights and biases of Stage 1 to Stage 3 on the pre-trained model. This process is usually found in many transfer learning-based systems when computing power is limited. Last we fine-tuned the adjusted model with the AffectNet dataset [5] to recognize the eight facial expressions (happy, sad, surprise, fear, contempt, anger, disgust, and neutral). Specifically,


**Table 2.** Detailed architectures of SE-ResNet-50 in the fine-tuning phase. Convolution blocks (SE-ResNet-50 uses the SE-ResNet block [21]) are shown in brackets, with the numbers of blocks stacked.

#### *3.3. Weighted-Cluster Loss*

#### 3.3.1. Review of Weighted-Softmax Loss

Weighted-softmax [5] often comes as a solution for tackling the imbalanced dataset problem. Weighted-softmax loss weighs the entropy loss of each of the emotion classes by their relative proportion of the total number of samples in the training dataset. The entropy weighted-softmax loss for the *ith* training sample is defined as:

$$L\_{\text{Weighted-softmax}} = -\sum\_{i=1}^{m} W\_{y\_i} \log(\vec{\rho}\_i) \tag{1}$$

where *m* is the number of training samples in the mini batch (i.e., the batch size), *yi* is the class label of the *ith* training sample, *Wyi* denotes the weight assigned to the class where the label *yi*, and *p*ˆ*i* is the softmax-predicted probability of the *ith* sample.

Although weighted-softmax loss is able to tackle the imbalanced dataset problem, it still has some limitations. Since weighted loss simply adds weight to the conventional softmax loss, it is incapable of handling not only high inter-class similarity but also high intra-class variations.

#### 3.3.2. Review of Center Loss

Center loss [20] can be incorporated with conventional softmax loss to reduces the intra-class variations by compressing samples towards their corresponding class center in the feature space during training. Center loss for the *ith* training sample is defined as:

$$L\_{\text{Center}} = \frac{1}{2} \sum\_{i=1}^{m} \|\mathbf{x}\_i - \mathbf{c}\_{y\_i}\|\_{2'}^2 \tag{2}$$

where *L*Center denotes the center loss, **x***i* ∈ R*<sup>d</sup>* denotes the deep feature of the *ith* training sample (i.e., taken from the last fully connected layer, before the output layer), and **<sup>c</sup>***yi* ∈ R*<sup>d</sup>* denotes the center of class label *yi*; d is the feature dimension.

To train the CNN model, joint supervision of softmax loss and center loss is employed as follows:

$$L = L\_{\text{Softmax}} + \lambda L\_{\text{Certer}} \tag{3}$$

where *L* denotes the total loss, *L*Softmax = − ∑*mi*=<sup>1</sup> log(*p*<sup>ˆ</sup>*i*) is the the entropy softmax loss and *λ* is a scalar used for balancing the two losses.

Since inter-class similarities are ignored by the center loss, the class clusters may move closer to, or even overlap, each other. Furthermore, center loss was not proposed to deal with the imbalanced dataset problem. Due to dataset imbalances, the centers of major emotion classes are more frequently updated than those of minor classes, which leads to poor performance of FER model on the minor classes.

#### 3.3.3. The Proposed Weighted-Cluster Loss

To address the limitations of existing loss functions (e.g., weighted-softmax loss and center loss), we propose a new loss function called weighted-cluster loss. It effectively tackles the imbalanced data problem by taking into consideration the imbalanced proportion in the number of samples of each emotion class in the training set. Furthermore, weighted-cluster loss adds a new term to center loss, which simultaneously pulls the centers of each class apart. This may allow the model to simultaneously handle the high intra-variation and the inter-class similarity in the FER dataset. The weighted-cluster loss for the *ith* training sample is given by Equation (4):

$$L\_{\text{Weighted-cluster}} = \frac{1}{2} \sum\_{i=1}^{m} \mathcal{W}\_{y\_i} \frac{\left\| \mathbf{x}\_i - \mathbf{c}\_{y\_i} \right\|\_2^2}{\left( \sum\_{j=1, j \neq i}^{k} \left\| \mathbf{c}\_j - \mathbf{c}\_{y\_i} \right\|\_2^2 \right) + a} \tag{4}$$

where *Wyi* denotes the weight assigned for the class where the label is *yi*, *k* denotes the number of emotion classes, and the constant, *α*, prevents the denominator from equaling 0. In this paper, we set *α* = 1 by default.

In Equation (4), the numerator penalizes the distance between the deep feature of the training sample (i.e., taken from the last fully connected layer before the output layer) and its corresponding center, and the denominator penalizes the distance between the corresponding center and all other class centers. By minimizing weighted-cluster loss, the deep features of training samples from the same class (i.e., the cluster) will be compacted in the feature space, whereas the distance between different classes of clusters will be enlarged. In addition, weight *Wyi* is used to tackle the imbalanced dataset problem by penalizing the models less for misclassifying samples from majority classes while heavily penalizing the model for misclassifying samples from minority classes.

The overall loss function of FER training is given by:

$$\begin{split} L &= L\_{\text{Weighted-softmax}} + \lambda L\_{\text{Weighted-cluster}} \\ &= -\sum\_{i=1}^{m} \mathcal{W}\_{\mathcal{Y}\_{i}} \left( \log(\boldsymbol{\rho}\_{i}) - \frac{\lambda}{2} \frac{\left\| \mathbf{x}\_{i} - \mathbf{c}\_{\mathcal{Y}\_{i}} \right\|\_{2}^{2}}{\sum\_{j=1, j \neq i}^{k} \left\| \mathbf{c}\_{j} - \mathbf{c}\_{\mathcal{Y}\_{i}} \right\|\_{2}^{2} + \alpha} \right) \end{split} \tag{5}$$

where *λ* is used to balance the two losses.

> In this work, we define *Wyi*as:

$$\mathcal{W}\_{\mathcal{Y}\_i} = \frac{\mathcal{N}\_{\text{min}}}{\mathcal{N}\_{\mathcal{Y}\_i}} \tag{6}$$

where *N*min denotes the number of samples from the smallest class (i.e., disgust), and *Nyi* denotes the number of samples from the class where the label is *yi*.

The joint weighted-cluster loss and weighted-softmax loss can be directly used for training deep neural networks. The class centers, **<sup>c</sup>***yi* , are updated each iteration through the training process. Specifically, the class centers are updated based on mini-batches, and use a scalar *γ* to control the learning rate of class centers (and in our experiments, we set *γ* = 1). The partial derivative of the weighted cluster loss, *<sup>L</sup>*Weighted-cluster, with respect to the sample's feature, **x***i*, can be calculated as follows:

$$\frac{\partial L\_{\text{Weighted-cluster}}}{\partial \mathbf{x}\_{i}} = \mathcal{W}\_{y\_{i}} \frac{\mathbf{x}\_{i} - \mathbf{c}\_{y\_{i}}}{\left(\sum\_{l=1, l \neq i}^{k} \left\|\mathbf{c}\_{l} - \mathbf{c}\_{y\_{i}}\right\|\_{2}^{2}\right) + a} \tag{7}$$

The update of the *jth* class center can be calculated with Equation (8):

$$\Delta \mathbf{c}\_{j}^{t} = \sum\_{i=1}^{m} \mathcal{W}\_{y\_{i}} \frac{\delta(y\_{i} = j)(\mathbf{x}\_{i} - \mathbf{c}\_{j})}{\left(\sum\_{l=1, l \neq i}^{k} ||\mathbf{c}\_{l} - \mathbf{c}\_{y\_{i}}||\_{2}^{2}\right) + \alpha} \tag{8}$$

where *<sup>δ</sup>*(*yi* = *j*) = 1 if *yi* = *j*, and *<sup>δ</sup>*(*yi* = *j*) = 0 if *yi* = *j*.

Then, the class centers can be updated in each mini-batch with a learning rate *γ*:

$$\mathbf{c}\_{j}^{t+1} = \mathbf{c}\_{j}^{t} - \gamma \Delta \mathbf{c}\_{j}^{t} \tag{9}$$

In Algorithm 1, we summarize the learning process of the FER model with the joint loss functions. **Algorithm 1** Learning algorithm of the FER model with the jointly loss functions

**Input:** Training data {**<sup>x</sup>***i*}, mini-batch size *m*, number of iterations *T*, learning rates *μ* and *γ*, and hyper-parameters *λ*


3: Calculate the joint loss 4: *Lt* = *<sup>L</sup><sup>t</sup>*Weighted-softmax + *<sup>λ</sup>L<sup>t</sup>*Weighted-cluster 5: Compute the backpropagation error for each *i* 6: *∂L<sup>t</sup> ∂***x***ti* = *<sup>∂</sup>L<sup>t</sup>*Weighed-softmax *∂***x***ti* + *λ∂L<sup>t</sup>*Weighted-cluster *∂***x***ti* 7: Update the weighted-softmax loss parameters *θ* 8: *θ<sup>t</sup>*+<sup>1</sup> = *θt* − *μ∂L<sup>t</sup>*Weighed-softmax *∂θti* 9: Update the weighted-cluster loss parameters **c***j* for each *j* as Equation (9) 10: **c***t*+<sup>1</sup> *j* = **c***tj* − *γ*Δ**c***tj* 11: Update the network parameters *W* 12: *Wt*+<sup>1</sup> = *W<sup>t</sup>* − *μ∂L<sup>t</sup> ∂***x***ti ∂***x***ti ∂W<sup>t</sup>* 13: **end for Output:** Network layer parameters *W*
