**4. Results**

#### *4.1. Recognition and Reliability Measures*

Recognition rate and the Intra-Class Correlation (ICC) values were exploited to evaluate the proposed automated AU intensity measurement in our study. The statistical index, i.e., the ICC has a range from 0 to 1.

$$ICC = \frac{BMS - EMS}{BMS + (k - 1) \times EMS} \tag{5}$$

$$EMS = \frac{ESS}{(k-1) \times (n-1)}\tag{6}$$

$$BMS = \frac{BSS}{n-1} \tag{7}$$

It is also the measure of conformity for our data set since it has multiple targets. This study is basically where n participants are being judged by k number of judges. In our study, we assume *n* = 6 and *k* = 2. The purpose of using ICC is because it is preferred over the Pearson correlation between measurement and judges. The ICC shows the proportion of total variance between the targets. Here, BMS = Between target Mean Squares, EMS = Residual Mean Squares which are defined by the ANOVA (Analysis of Variance).

#### *4.2. Result Analysis Based on Intensity of Emotions*

The previous section discusses the feature extraction techniques we implemented for measuring AU intensities in facial emotions. The three techniques implemented include the LBP, Histogram of Oriented Gradient Features, and Gabor Features. These are followed by classification techniques such as Support Vector Machine, Random Forest and Nearest Neighbor Classifiers. Given all the image observations, we implement the network for measuring the intensity of emotion recognition for each AU. The results are presented in Table 3.

As shown in Table 3, the best results were achieved with the *LBP* with the nearest neighbor classifier while using all three features. This is because it models static relationships between the AU intensities. All the values in the table are percentage accuracy in detection. AUs which were not present for certain cases, have been indicated as NA (Not Applicable) while zeros indicate that the AU was present, but accuracy was 0 indicating that it was not recognized at all.


**Table 3.** AU Intensity Measurement Results—HOG, Gabor and LBP (Acc: Accuracy Percentage).

For observations in images, which are not very accurate, improvements are seen in features of Gabor wavelets using the random forest classifiers. Table 4 shows the performance of individual features when combined with popular ML algorithms. Besides, a correlation analysis for the AUs was done, for which we have listed a correlation matrix especially for the action units 1 and 2 in the Table 5. This matrix is a relation between the AU1 and AU2. The intensity dependency between both the AUs is proportional to each other. High of AU1 results in a high probability of AU2 and vice-versa. When AU2 is at level 0, AU1 probability is 0.982 and when the intensity at AU2 is level "3" AU1 probability at level "3" is 0.88. By calculating such AU dependency relationships between the action units, the ICC and the accuracy for various algorithms improved. Although not shown in the table, but we would like to mention that the accuracy increased from 68.32% to 71.95% for Random Forest when the AU dependency relationship was used while extracting HOG features. Similarly, for Gabor features, the accuracy increased from 79.11% to 82.13% for when the nearest neighbor algorithm was used. Since the AU intensity inference phase and the feature extraction phase are independent of each other, higher accuracy is achieved. Table 4 also shows that LBP performs best when used with SVM.

**Table 4.** Summary of Accuracies for various Feature Types.



**Table 5.** Correlation Matrix for Action Units.

We also performed a comparison of our work with a few other works. As shown in Table 6, we noted that a few recent works [60,81] had used similar features (HOG and Gabor in both; LBP in one) and the DISFA database. Therefore, to have a fair comparison, we applied our approach. It is clear from Table 6 that the proposed method performs far better than other works while using the same feature selection methods and database. Table 7 presents the characteristics of the comparative state-of-the-art methods listing the databases, and feature extraction & ML methods used in each work. It should be noted that one of the primary differences between our work and other recent works is the use of a greater number of databases for training the ML algorithm while using similar feature extraction algorithm. Although the average accuracy and ICC percentages improved for all feature extraction methods, it is noteworthy that the highest improvement was seen with LBP when the SVM model was trained using 5 databases, as seen in Table 6.

**Table 6.** Comparison with state-of-the-art methods.


**Table 7.** Characteristics of the state-of-the-art comparative methods.


Furthermore, we also evaluated the performance of the proposed method for a few other databases including JAFFE, CK, B-DFE, and our dataset of 200 images. These results are presented in Table 8 and clearly shows that LBP performs better for the first three databases while the performance is quite close to the best performance of Gabor for the CK and databases. A few of these results are also presented using bar charts for a better visual comparison in Figures 6–8 for databases, respectively. It is evident from these figures that LBP gives better results for almost all AUs and gives the best results when combined with SVM. Therefore, we conclude that the LBP feature extraction method, when used with SVM, works best for facial emotion intensity recognition.

(**c**) Using RF, RF-HOG, RF-Gabor, and RF-LBP

**Figure 6.** Comparison of AU intensity Labels on Database.

(**c**) Using RF, RF-HOG, RF-Gabor, and RF-LBP

**Figure 7.** Comparison of AU intensity Labels on Database.

(**a**) Using kNN, kNN-HOG, kNN-Gabor, and kNN-LBP

(**c**) Using RF, RF-HOG, RF-Gabor, and RF-LBP

**Figure 8.** Comparison of AU intensity Labels on Database.

In conclusion, it is evident from Table 3, LBP-kNN detects almost all AUs with high accuracy (>94%) while other techniques show this level of accuracy only for few AUs. Therefore, we LBP-SVM will recognize almost all emotions at all intensity levels better than the other studied techniques. It should also be noted that the average accuracy of detection of intensity of emotion decreases for all

emotions with an increase in intensity; however, LBP+SVM still performs better than Gabor-SVM and HOG-SVM on average.


**Table 8.** Performance of proposed method for different Databases.

Nevertheless, in real-world applications, accuracy is also dependent on various other factors such as the image quality, environment it was captured in (controlled, uncontrolled), angle of the face, age of the person (fine lines on the face can make huge difference in accuracies), and lighting conditions. Also, accuracy differs from men to women, since women tend to express emotions more vividly than men. All these factors will come into consideration for emotion intensity as well as emotion detection for real-world face recognition systems and might significantly affect the performance of any technique.

#### **5. Conclusions and Future Work**

AUs are popularly used for measuring the facial emotion intensities from facial expressions. Use of an adequate amount of data is required for training and testing classifiers for its best performance in terms of accuracy. Majority of the databases consist of posed facial expressions or the emotion labels. Hence our research was focused on using the publicly available databases which have AUs that are annotated on a 6-point intensity scale. The experimentation was performed for both spontaneous and posed facial emotion intensity recognition, where we conclude that AU intensity is not always reliable and accurate in the case of spontaneous facial expressions. This happens due to the ambiguity and dynamic nature of the facial emotions when spontaneous expressions are taken into consideration. For measuring the intensity of emotions, it is not only required to improve the accuracy of feature extraction algorithms but also exploiting the facial actions. It is these spatiotemporal facial action interactions with synchronized and coherent actions that provide a full facial display. In our work, we presented a probabilistic model to calculate the ICC values and accuracies among the dynamic and semantic AU intensity levels. Also, AU intensity recognition is accomplished by integrating the images systematically with the proposed model. The accuracies for various algorithms (LBP, HOG, and Gabor) indicate that LBP achieves the highest accuracy in most cases. As a future work, in neural networks several hidden layers could be added to specifically handle each challenge in the spontaneous intensity of emotion recognition such as the head tilt and angle.

**Author Contributions:** Conceptualization, M.F.H.S. and A.Y.J.; Investigation, D.M.; Methodology, D.M.; Project administration, A.Y.J.; Resources, A.Y.J.; Supervision, A.Y.J.; Validation, M.F.H.S.; Writing—original draft, D.M.; Writing—review & editing, M.F.H.S. and A.Y.J.

**Funding:** Any funding agency did not support this work.

**Acknowledgments:** The authors are thankful to Paul A. Hotmer Family CSTAR (Cyber Security and Teaming Research) lab and the Electrical Engineering and Computer Science Department at the University of Toledo.

**Conflicts of Interest:** The authors declare no conflict of interest.
