*4.4. Confusion Matrixes*

To analyze the experimental results further, the confusion matrixes of SETFNet and SETFNet + global are shown in Tables 6 and 7, respectively. The labels on the left-hand side represent actual classes and those at the bottom represent the predicted classes; each percentage value in the matrix was calculated by dividing the number of a predicted class to the number of the corresponding actual class. After adding the global stream, the recognition rate of each expression is increased by 1–2%. It can be seen from Tables 6 and 7 that whether or not the global face stream is added, both happiness and surprise have high recognition rates, while fear and disgust have relatively low rates. The latter low recognition rates may be due to the slight movement of AUs for fear and disgust, which makes it more di fficult to distinguish them from other expressions. Moreover, disgust is confused with anger, fear, and sadness, and fear is confused with anger, disgust, happiness, and surprise, perhaps because their appearance and movements are similar to each other.

**Table 6.** Confusion matrix of SETFNet. Labels on left-hand side represent actual classes; those on bottom represent predicted classes.



**Table 7.** Confusion matrix of SETFNet + global. Labels on left-hand side represent actual classes; those on bottom represent predicted classes.

SETFNet + global takes the entire face as input. The more input features there are, in general, should increase the true prediction values (values on the diagonal of the confusion matrix) and decrease the false prediction values (the zero value will be unchanged). It is seen from Table 6 that SETFNet + global does increase all true prediction values. However, more input does not always decrease the false prediction values. We can see from Table 7 that increased false prediction values do exist, which are indicated by up-pointing arrows. As the database is small in size, the prediction values could vary due to noise. To ensure that the located false prediction values are increased only as a result of more input features, we located their paired false prediction values as well. Each false prediction value pair appears in the same color in Table 7; for example, 9.54% (fear predicted as anger) and 0% (anger predicted as fear) in green. Only when both paired values are increased can the two expressions be considered as confused with each other more in SETFNet + global.

Under this criterion, we can see that sadness tends to be more recognized as disgust (8.25% versus 3.52%), or disgust tends to be more recognized as sadness (4.08% versus 2.50%), if SETFNet + global is used. The reason for this might be that, in sadness and disgust expression situations, lower cheek areas have an up-and-down movement pattern due to the movement of AU15 or AU10 [44]. When SETFNet + global takes these similar movement patterns as input, sadness will be recognized as disgust more.

Tables 8–11 show the confusion matrix of the comparison algorithms, with the labels on the left-hand side representing actual classes and those at the bottom representing the predicted classes. The confusion matrix of NIRExpNet (Table 8) was adopted from [37] directly. The other matrixes were obtained by implementing the algorithms with MatLab code on the database (tenfold cross-validation). Happiness and surprise again have higher recognition rates than the others in all algorithms. Fear has the lowest average recognition rate, and disgust has a similar average recognition rate to that of anger and sadness. This trend is in accord with what SETFNet reveals.


**Table 8.** Confusion matrixes of NIRExpNet.

**Table 9.** Confusion matrixes of 3D CNN DAP.



**Table 10.** Confusion matrixes of DTAGN.



To further analyze the discrimination ability of different methods, we counted the number of zero false prediction values in each matrix. This number indicates that two corresponding expressions are perfectly recognized by the method. It is observed that NIRExpNet has 20 zero false prediction values, much more than other methods. 3D CNN DAP, DTAGN, and LBP-TOP have a similar number of zero false prediction values (approximately 12). These results indicate that NIRExpNet has the best performance in distinguishing one expression from others. This could be because NIRExpNet is designed specifically for the dataset. The features extracted by NIRExpNet are balanced so the possibility of confusing one expression with others is small.

Some zero false prediction values do not have zero paired values, e.g., the values in red in Table 9. 4.51% of the surprise expression was recognized as anger, but 0% anger was recognized as surprise using 3D CNN DAP. This could be due to the noise of the small dataset.

The F1 score and Matthews correlation coefficient (MCC) are calculated using the confusion matrixes, which are indexes considering accuracy and recall of the classification results and are fairer methods for assessing a classifier. The F1 score and MCC are summarized in Table 12. It is observed that SETFNet and SETFNet + global have the highest F1 and MCC, NIRExpNet has the second-highest values, and 3D CNN DAP the third highest. LBP-TOP and DTAGN have the lowest F1 and MCC. This indicates that SETFNet outperforms other methods in even more rigorous assessment. The order of the F1 and MCC performance of the methods is in accord with accuracy performance. This also indicates that the number of each sub-category is well balanced.


**Table 12.** Comparison of F1 score and MCC of different methods.

#### *4.5. Potential Application and Improvement*

SETFNet, which used three regions of the face as the input, can achieve higher recognition rates than NIRExpNet, which used the entire face as input, because an SE block can automatically allocate the weights to different streams. These results sugges<sup>t</sup> that the automatic allocation of weights to di fferent features will help improve the recognition rate. This idea of automatic allocation may have potential use in other recognition tasks. The SE block can always be added after a feature fusion step to allocate weights to di fferent features to further improve the recognition rate.

SETFNet + global has a slightly higher recognition rate than SETFNet, but consumes much more calculation time. This indicates that a small part of the face could carry most of the expression information. For any other type of facial expression recognition task, we may only analyze the parts of face carrying expression information, which can save much calculation time and make recognition a real-time application.

The highest recognition rate on the Oulu-CASIA NIR facial expression database (dark condition) is 98.6%, achieved by Rivera et al. [45]. A number transitional graph method (DNG) was proposed in [45]. The confusion matrixes achieved by DNG method were summarized in Tables 13 and 14 (adopted from [45] directly), with the labels on the left-hand side representing actual classes and those at the bottom representing the predicted classes. Table 13 is the confusion matrix of DNG using 3D Sobel (DNGS), and Table 14 is the confusion matrix of DNG using nine-plane mask (DNGP). It is seen that the recognition rate of each expression class is more than 97% and similar to each other. This may indicate that the DNG has obtained good enough features to discriminate one expression from others. In terms of zero false prediction values, DNGS has 21 zero false prediction values, and DNGP has 23 zero false prediction values, which are less than all other methods. This indicates that the DNG method can achieve the most un-confused matrix. The F1 and MCC of DNG are higher than other methods, as well (DNGS: F1 0.9859, MCC 0.9830; DNGP: F1 0.9879, MCC 0.9856). This indicates that DNG outperforms other methods in more rigorous assessment.


**Table 13.** Confusion matrixes of DNGS.


**Table 14.** Confusion matrixes of DNGp.

DNG consists of designed feature-extraction and feature-fusion methods, which make the extracted features robust in uneven illumination conditions. This could be the reason why DNG can achieve the best performance. According to the design of the DNG, two aspects could be considered in the future design of the SETFNet. Firstly, the uneven illumination conditions in the database could be taken into account when designing the network, such as using the features extracted from DNG as a stream to the network. Secondly, a more sophisticated fusion method could be used in future design, e.g., the concatenation operation used in this paper could be replaced by the fusion method in DNG.

However, a di fferent form of DNG using hand-crafted features, SETFNet, proposed in this paper extracts features automatically. This design does not need the background knowledge of the data. Specifically, The feature extraction in this paper was finished by using a 3D CNN. Since the dataset used for training the CNN is small in size, the proposed network is not deep enough and may not extract high-level features. To further improve the recognition rate, transfer learning could be used, i.e., training a deeper CNN on a larger dataset and then fine-tuning the network on the NIR database.
