**Driver Stress State Evaluation by Means of Thermal Imaging: A Supervised Machine Learning Approach Based on ECG Signal**

**Daniela Cardone 1,\* , David Perpetuini <sup>1</sup> , Chiara Filippini <sup>1</sup> , Edoardo Spadolini <sup>2</sup> , Lorenza Mancini <sup>2</sup> , Antonio Maria Chiarelli <sup>1</sup> and Arcangelo Merla 1,2**


Received: 14 July 2020; Accepted: 13 August 2020; Published: 15 August 2020

**Featured Application: A procedure for a driver's stress state monitoring was provided by means of thermal infrared imaging. It was validated on ECG-derived parameters through the application of supervised machine learning techniques.**

**Abstract:** Traffic accidents determine a large number of injuries, sometimes fatal, every year. Among other factors affecting a driver's performance, an important role is played by stress which can decrease decision-making capabilities and situational awareness. In this perspective, it would be beneficial to develop a non-invasive driver stress monitoring system able to recognize the driver's altered state. In this study, a contactless procedure for drivers' stress state assessment by means of thermal infrared imaging was investigated. Thermal imaging was acquired during an experiment on a driving simulator, and thermal features of stress were investigated with comparison to a gold-standard metric (i.e., the stress index, SI) extracted from contact electrocardiography (ECG). A data-driven multivariate machine learning approach based on a non-linear support vector regression (SVR) was employed to estimate the SI through thermal features extracted from facial regions of interest (i.e., nose tip, nostrils, glabella). The predicted SI showed a good correlation with the real SI (*r* = 0.61, *p* = ~0). A two-level classification of the stress state (STRESS, SI ≥ 150, versus NO STRESS, SI < 150) was then performed based on the predicted SI. The ROC analysis showed a good classification performance with an AUC of 0.80, a sensitivity of 77%, and a specificity of 78%.

**Keywords:** driver stress state; IR imaging; machine learning; support vector machine (SVR); advanced driver-assistance systems (ADAS)

#### **1. Introduction**

According to the latest estimates by the World Health Organization, approximately 1.35 million people die each year from road traffic accidents and between 20–50 million people suffer from non-fatal injuries [1].

Advanced driver-assistance systems (ADAS) are designed to support humans during the driving process, leading to an increase in road safety. Conventional ADAS technologies are mainly based on controlling the vehicle state through proprioceptive (i.e., Odometry, inertial sensors) and exteroceptive sensors (i.e., Lidar, vision sensors, radar, infrared, and ultrasonic sensors) [2]. These state-of-the-art technologies allow for the recognition of objects [3], alerting the driver about dangerous road

conditions [4], providing driver tips to improve their driving comfort and safety [5], recognizing traffic activity and behavior [6], and detecting risky driving conditions [7].

In addition to factors currently evaluated by ADAS technologies, it is also of fundamental importance to monitor the driver's psycho-physiological state that is strictly related to driving performance as reported by National Highway Traffic Safety Administration (NHTSA) [8]. According to the latest estimation by the NHTSA [8], in the USA, more than 2800 people died and approximately 400,000 people were injured in crashes induced by distracted driving in 2018. In particular, driver drowsiness/fatigue and emotion (i.e., visible anger, sadness, crying, and/or emotional agitation) can increase the risk of car accidents by 3.4 and 9.8 folds, respectively [9].

Driver state monitoring is mainly based on two categories of approaches that depend on the nature of the data collected [10]. The first approach, named the behavioral method, is based on monitoring a driver's parameters including gaze direction, blinking frequency, percentage of eye closure (PERCLOS), yawning, and head pose. These parameters are evaluated by means of one or multiple visible cameras. This procedure, based on cameras, is indeed contactless and non-invasive, but it is characterized by relevant technical challenges derived from occlusion, illumination variation, and personal privacy issues. Nonetheless, because of its utility, it is often employed in the automobile industry. The second approach, labeled the physiological method, is instead based on monitoring driver's vital signals, such as those derived from electrocardiography (ECG) [11], photoplethysmography (PPG) [12], electrooculography (EOG) [13], electroencephalography (EEG) [14], galvanic skin response (GSR) [15], and electromyography (EMG) [12].

With focus on driver stress, it is known that a high stress state decreases decision-making capabilities and situational awareness, impairing driving performance [16]. Electroencephalography has been widely employed to monitor drivers' stress. In particular, human stress can be inferred by measuring an increase in heart rate as well as by measuring variations of parameters associated to heart rate variability (HRV) which, in turn, can dynamically reflect the accumulation of mental workload [17]. Among the variety of indices derived from ECG signals, Baevsky [18] proposed the Stress Index (SI) which is indicative of both sympathetic activity and central regulation.

Although the SI is a sensitive and specific metric to stress, it is based on contact-based technology (e.g., ECG). Indeed, to ensure comfort and non-intrusiveness for drivers, the use of contact sensors for data collection should preferably be avoided. Infrared (IR) or thermal imaging is a passive technology able to evaluate the spontaneous emission of body thermal energy and measure the temperature in a contactless manner. Infrared imaging indeed allows to overcome the limitations of contact devices and, importantly, in comparison with visible cameras, is not affected by illumination and can work in a completely dark environment.

Relevantly to the topic of this study, IR imaging allows to infer the peripheral autonomic activity through the modulation of the cutaneous temperature, which is a known expression of the psycho-physiological state of the subject [19,20]. In fact, experienced emotions, including stress or fatigue, can produce changes in skin temperature [21–23]. Particularly, there is great attention in this research field on stress and mental workload monitoring using thermal IR imaging. Puri et al. [24] studied computer users' stress, reporting an increased blood flow in the frontal vessels of the forehead during stressful situations. The thermal metric was shown to be correlated with stress levels in 12 participants performing a Stroop test (*r* = 0.9, excluding one outlier). Pavlidis and colleagues [25] tried to assess the stress level by measuring transient perspiratory responses on the perinasal area using thermal IR imaging. These metrics proved to be a good indicator of stress response, because they were sympathetically driven. The authors applied this approach in the context of surgical training, finding a very high correlation between the GSR (galvanic skin response) and the thermal measurement on the finger (*r* = 0.968) and on the perinasal region (*r* = 0.943). Kang et al. [26] used thermal IR imaging to assess affective training times by monitoring the cognitive load through facial temperature changes. Learning proficiency patterns were based on an alphanumeric task. Significant correlations, ranging from *r* = 0.88 to *r* = 0.96, were found between the nose tip temperature and the response time, accuracy, and the Modified Cooper Harper Scale ratings. Stemberger et al. [27] presented a system for the estimation of cognitive workload levels based on the analysis of facial skin temperature. Beyond thermal infrared imaging of the face, the system relied on head pose estimation, measurement of the temperature variation across regions of the face, and an artificial neural network classifier. The system was capable of accurately classifying mental workload into high, medium, and low levels 81% of the time.

Given the advantages of the use of thermography in driver state monitoring, a relevant number of scientific works on this research field are available. Most of these publications concern driver drowsiness/fatigue monitoring and emotional state detection. Ebrahimian-Hadikiashari et al. [28] investigated driver drowsiness by analyzing the breathing function, monitored by thermography. The authors observed a significant decrease of driver respiration rate from extreme drowsiness to wakefulness conditions. Moreover, Knapik et al. [29] presented an approach for the evaluation of driver's fatigue, based on yawn detection using thermal imaging. Zhang et al. [30] demonstrated the feasibility of discriminating emotions (e.g., fear versus no fear) by means of thermal imaging, assessing the forehead temperature as indicative for the emotional dimension of drivers' fear. Focusing on driver stress monitoring, Yamakoshi et al. [31] combined measures from facial skin temperature and hemodynamic variables. The authors observed an increase of sympathetic activity, peripheral vasoconstriction, hence, a significant decrease in peripheral skin temperature during monotonous driving simulation. Basing on differential skin temperatures between peripheral (i.e., nose tip) and truncal parts of the face (i.e., cheeks, jaw, and forehead), they were able to assess an index of a driver's stress. More recently, Pavlidis et al. [32] studied the effects of cognitive, emotional, sensorimotor, and mixed stressors on driver arousal and performance with respect to a baseline of 59 drivers in a simulation experiment. Perinasal perspiration, revealed by thermal imaging, together with the measure of steering angle and the range of lane departures on the left and right side of the road, showed a more dangerous driving condition in the case of sensorimotor and mixed stressors with respect to the baseline condition.

In this paper, the driver stress state was established by means of IR imaging and supervised machine learning methods. Supervised machine learning approaches are part of artificial intelligence (AI) algorithms, able to automatically learn functions that map an input to an output based on known input–output pairs (training dataset). The function is inferred from labeled training data and can be used for mapping new dataset (test dataset) that allow to evaluate the accuracy of the learned function and understand the level of generalization of the applied model [33].

On the basis of key features of thermal signals extracted from peculiar regions of interest (ROIs), indicative of the psycho-physiological state, an estimation of the ECG-derived SI was performed employing a support vector regression with radial basis function (SVR-RBF) [17]. To test the generalization performances of the model, a leave-one-subject-out cross-validation was utilized. After the cross-validation process, a two-level classification of the stress state (STRESS versus NO STRESS) was performed, relying on the estimated SI.

This work describes a novel approach for a contactless methodology dedicated to driver stress state detection and classification, constituting a significant improvement to actual ADAS technology and, in general, to road security level.

#### **2. Materials and Methods**

#### *2.1. Participants*

The experimental session involved 10 adults (6 males, age range 22–35, mean 28.4). Before the start of the experimental trials, the participants were adequately informed about the purpose and protocol of the study, and they signed an informed consent form outlining the methods and the purposes of the experimentation in accordance with the Declaration of Helsinki [34].

#### *2.2. Procedure and Data Acquisition*

Prior to testing, each subject was left in the experimental room for 20 min to allow the baseline skin temperature to stabilize. The recording room was set at a standardized temperature (23 ◦C) and humidity (50–60%) by a thermostat. 

To perform the experiment, a static driver simulator was used (Figure 1a). It was composed of driver's seat, steering wheel, clutch, brake, and gas pedals, and gearshift. To display the scenario, three 27 inch monitors were used. The total video resolution for the stimulation was 5760 × 1080 pixels. The distance between the driver and road screen was approximately 1.5 m. Participants' horizontal view angle was 150 degrees. The simulator could produce starter and engine sounds, left and right signal indicators, and flashers and wiper blades. In this study, the sound of the engine, starter, and lights switches were provided.

**Figure 1.** Experimental setting for the proposed study: (**a**) driving simulator, lateral view; (**b**) screenshot of the software used for the driving simulation: City Car Driving [34].

Participants sat comfortably on the seat of the driving simulator during both acclimatization and measurement periods.

 The software used for driving simulation was City Car Driving, Home Edition software (version 1.5) [35] (Figure 1b). The experimental protocol consisted of performing a driving simulation lasting 45 min in an urban context. The experimental conditions were set a priori to ensure the adverse driving condition and guarantee the uniformity of the experimental protocol for all the subjects. An overview of the experimental setting of City Car Driving software is reported in Table 1.


**Table 1.** Experimental settings of the driving software City Car Driving.

 

The conditions reported in Table 1 were selected to induce stress in the participants. In particular, the settings associated to traffic and emergency situations guaranteed non-comfortable driving, since the participants were often experiencing non-monotonous situations. ‐

‐

During the execution of the experimental protocol, ECG signals and visible and thermal IR videos were acquired.

The ECG signals were recorded by means of AD Instruments PowerLab system using the lead configuration determined by the Standard Limb Leads (i.e., electrodes positioned at the right arm (RA), left arm (LA), and left leg (LL)) [36].

Visible and thermal IR videos were acquired by the depth camera Intel RealSense D415 and FLIR Boson 320LW IR thermal camera, respectively. The technical characteristics of the two acquisition devices are summarized in Table 2.

**Table 2.** Technical characteristics of the depth camera Intel RealSense D415 and FLIR Boson 320LW IR thermal camera.  


<sup>1</sup> Horizontal field of view.

For the purpose of this study, the two imaging devices were held together and aligned horizontally. Figure 2 shows the entire imaging acquisition system.

**Figure 2.** Imaging acquisition system: (**a**) depth visible camera and (**b**) thermal camera.

#### *2.3. Analysis of ECG Signals*

‐ The ECG signals were recorded at a rate of 1 kHz. The elapsed time periods between the two successive R-peaks of the ECGs (RR signals) were extracted from LabChart7, ADInstruments, and analyzed by the software Kubios HRV Standard [37]. Baevsky's Stress Index (SI) [18] was evaluated for each subject in 30 s consecutive windows. The SI from Kubios is the square root (to make the index normally distributed) of the Baevsky's Stress Index proposed in Reference [18].

Baevsky's SI is calculated based on the distribution of the RR intervals as reported in Equation (1):

$$SI = \frac{AMo \times 100\%}{2Mo \times MxDMn} \tag{1}$$

ൌ

where *Mo* is the mode (the most frequent RR interval), *AMo* is the mode amplitude expressed in percent, and *MxDMn* is the variation scope reflecting the degree of RR interval variability. 

2 ൈ

Values of Baevsky's stress index between 80 and 150 are considered normal [18].

#### *2.4. Analysis of Visible and Thermal Imaging Data*

Visible and IR videos of the subjects' faces were simultaneously recorded during the driving experiment at an acquisition frame rate of 30 Hz and 10 Hz, respectively.

Given the availability of computer vision algorithms for visible videos, in the present study, visible imaging was used as reference for tracking facial landmark features. The purpose of the visible tracking was to transfer the visible facial landmark features tracked to the thermal imagery, estimating the geometrical transformation between the two imaging optical devices. ‐

‐

‐

‐

#### 2.4.1. Visible and Thermal Data Co-Registration ‐

The first step of the developed procedure consisted of an optical co-registration between visible and thermal optics. The co-registration process was a fundamental step of the whole pipeline, since it allowed a proper mapping from an imaging coordinate system to another. ‐ ‐

The optical co-registration relied on procedures implemented in OpenCV [38], and it is described in depth in Reference [39]. A root mean square error (RMSE) value was provided by the co-registration procedure, thus indicating the accuracy in the coordinate transformation from visible to IR imagery at the specific distance of 1 m. 

#### 2.4.2. Facial Landmark Detection in the Visible Domain

Visible videos were analyzed through OpenFace [40,41], an open-source software able to perform facial landmark detection, head pose estimation, facial action unit recognition, and eye-gaze estimation. For each frame, a set of 68 facial landmarks was estimated during the experiment. Figure 3 shows the distribution of the 68 facial landmarks. 

**Figure 3.** Schematic representation of the 68 facial landmarks identified by the algorithm implemented in the OpenFace software.

The landmark detector algorithm within OpenFace relied on the constrained local neural fields (CLNF) procedure [42], whereas the face detector algorithm employed a multi-task convolutional cascaded network (MTCNN) approach [43].

#### 2.4.3. Thermal Data Extraction and Analysis

The sets of the 68 facial landmarks detected in the visible images were identified in the corresponding frames of IR imaging, applying the geometrical transformation obtained from the optical co-registration process. Figure 4a,b show an example of the 68 feature landmarks detected on a visible frame and the set of the 68 points identified on the corresponding thermal image. ‐ 

‐

‐ **Figure 4.** (**a**) Facial landmark identification in the visible image by OpenFace; (**b**) facial landmark identification in the corresponding thermal image applying the geometrical transformation obtained from the optical co-registration process; (**c**) Regions Of Interest (ROIs) identification (nose tip, right and left nostrils, glabella); (**d**) average thermal signals extracted from the ROIs in an exemplificative time window of 50 s. Notice the breathing signal is clearly appreciable from the right and left nostrils' average thermal signals plots.

‐ A fundamental aspect for obtaining an accurate co-registration was the need of temporal synchronization between visible and IR videos. Since the acquisition frame rate of thermal videos was lower than that of the visible camera, the corresponding frames within the visible domain were determined according to the specific timestamps of IR frames. Specifically, among the visible frames acquired around the IR frame timestamp, the one that minimized the temporal difference with IR imaging was chosen. The timestamps of the frames were considered reliable as the videos were acquired on the same PC.

For each thermal video, four ROIs were considered and positioned on facial areas of physiological importance (nose tip, right and left nostrils, and glabella) [44]. The ROIs' coordinates were automatically determined from the location of the 68 landmarks. In this way, the initialization of the position of the ROIs was automatically determined (Figure 4c).

With reference to the topographical distribution of the points as represented in Figure 3, the coordinates of the four ROIs were determined, as described in Table 3:


**Table 3.** Geometrical features of the considered ROIs.

<sup>1</sup> C = circle center; d = circle diameter; <sup>2</sup> Pn = n-th landmark; *n* = 1, . . . , 68.

For each ROI, the average value of the pixels was extracted over time (Figure 4d). Relatively to the nostrils' ROIs, the average value between ROI 2 and ROI 3 was considered for further statistical analysis, them being related to the same physiological process (i.e., breathing function).

For each of the extracted signals, six representative features were computed over consecutive temporal window of 30 s:


2.4.4. Application of Supervised Machine Learning

Firstly, a machine learning approach was utilized to predict SI relying on features extracted from thermal signals. Specifically, an SVR with RBF kernel was trained on the SI obtained from Kubios through a supervised learning procedure. The SVR-RBF was trained on z-scored data with a fixed nonlinearity exponential parameter γ = 1.

Because of the multivariate (6 regressors) SVR approach, in-sample performance of the procedure did not reliably estimate the out-of-sample performance. The generalization capabilities of the procedure were thus assessed through cross-validation. Specifically, a leave-one-subject-out cross-validation was performed [45]. This cross-validation procedure consisted in leaving one subject (specifically all the samples from the same subject) out of the regression and in estimating the predicted output value on the given subject using the other participants as the training set of the SVR model. This procedure was iterated for all the subjects, and further statistical analyses were performed on the out-of-training-sample estimation of SI from thermal features. Such a metric was labelled SIcross.

Although several machine learning approaches could be suited for such a purpose, given the limited number of independent features available and the exploratory nature of the implemented approach, an SVR-RBF followed by a classification procedure was chosen to limit the procedural complexity. In fact, although SVR-RBF is not a sophisticated approach, it ensures performances which are comparable to more complex machine learning techniques [46].

Secondly, SIcross was used to perform a two-level classification of the driver's stress (i.e., STRESS versus NO STRESS). The two classes were defined on the basis of the threshold associated to a stress condition assessed by the SI (i.e., SI > 150 for stress condition) [18]. Notably, the experimental recordings confirmed the accordance between SI and the driving conditions. In particular, stressful situations assessed by SI were associated to adverse events during driving simulations (e.g., traffic accidents, collisions with pedestrians, sudden car braking).

Since the two classes did not have an equal number of samples, a bootstrap procedure was implemented to test classification performance on balanced classes [47]. The performances of the classification were evaluated by means of receiver operating characteristic (ROC) analysis [48].

Figure 5 reports the flow chart relating to the described machine learning approach.

‐ ‐ ‐ ‐ ‐ ‐ ‐ **Figure 5.** Flow chart of the applied machine learning approach: the thermal features are used as predictors whilst the Electrocardiogram(ECG)-derived Stress Index (SI) is considered as the regression output. Support Vector Regression with Radial Basis Function kernel (SVR-RBF) was used as regressor. A leave-one-subject-out cross-validation was employed to test the generalization of the regression. The result of the regression (SIcross) was then used to perform a two-level classification of the driver's stress (i.e., STRESS versus NO STRESS). Since the two classes were not balanced, a bootstrap procedure was implemented. Receiver Operating Characteristic (ROC) analysis was executed to investigate the performance of the classifier.

#### **3. Results**

#### *3.1. Visible and Thermal Imaging Co-Registration and Processing*

*‐* ‐ The spatial RMSE of the optical co-registration was 0.66 ± 0.25 pixels, thus indicating that the accuracy in the coordinate transformation from visible to IR imagery at the specific distance of one meter was less than one pixel.

The percentage value of the correctly identified landmark on the total amount of considered frames and the confidence value in correctly classifying a face are reported in Table 4 for each subject. These parameters were returned by the software OpenFace [40]. The confidence value ranged from 0 (total misclassification of face) to 1 (correct face classification), and it was the result of a landmark detection validation process. In detail, to avoid tracking drift over time, it was necessary to determine if landmark detection succeeded during video processing. The landmark detection validation was performed transforming the area surrounded by the landmarks to a pre-defined reference shape. The vectorized resulting image was then used as a feature vector for a classifier which acts as

the validator (i.e., input of the classifier). To train the classifier on the vectorized reference warp, positive and negative landmark detection examples were considered. The positive samples were ground truth landmark labels, whereas the negative samples were generated from the ground truth labels, applying offset and scale transformations. The classifier employed in OpenFace is SVM [49].


**Table 4.** Indices of performance of landmark identification and face classification on visible imaging.

On average, 94.66% of the video frames were correctly processed, whereas the confidence index for face classification was 0.90%.

To notice, concerning subjects 3 and 5, the average success and confidence scores were lower with respect to the other subjects, given the scarce lighting conditions of the acquisitions (Table 4). However, in general, for all the subjects, only the frames with high confidence and success scores were considered for further analysis (i.e., success index > 90%, confidence value > 0.8). This ensures there was no impact on the ROIs' identification and, consequently, on their features' estimation.

Finally, the average execution time of the developed algorithm was 0.09 s/frames with MATLAB 2016b© (64-bit Windows 7 Pro, Service Pack 1; Intel (R) Core (TM) i5 CPU; 8.00 GB RAM).

#### *3.2. Performances of Supervised Machine Learning Approach*

Across subjects, 849 samples were available for the regression analysis. A significant correlation between SI and predicted SI (SIcross) was obtained (*r* = 0.61, *p* = ~0) (Figure 6a), demonstrating a good performance of the multivariate analysis [50]. The weights associated to each z-scored regressor for each ROI are shown in Figure 6b. Considering that both the regressors and SI were normalized, the values of the weights were indicative of the contribution of each model input in the estimation of the SI.

Since the two classes had a different number of samples (125 samples for STRESS versus 696 samples for NO STRESS conditions), a bootstrap procedure was implemented to provide a classification estimates using balanced classes [47].

‐ **Figure 6.** (**a**) Correlation plot between SI and SIcross. The equation of the interpolating line is reported in the top left section of the graph. A good performance of the multivariate analysis is revealed by the correlation score (*r* = 0.61, *p* = ~0); (**b**) weights associated to each z-scored regressor for each ROI. The weights are indicative of the contribution of each model input in the estimation of the SI.

Figure 7a reports the among iterations average ROC curve (bootstrap performed for *n* = 10,000 iterations). The average area under curve (AUC) was 0.80 and with standard deviation of 0.01. The distribution of the AUC obtained after the bootstrap is reported in Figure 7b.

**Figure 7.** Results after bootstrap procedure (n\_iterations = 10,000). (**a**) Among iteration average ROC curve and (**b**) distribution of the Area Under Curve (AUC) obtained after the bootstrap procedure. The average AUC was 0.80 with a standard deviation of 0.01.

By choosing a specific threshold for SIcross, a sensitivity of 77% and a specificity of 78% were obtained as reported in the confusion matrix (Table 5).


**Table 5.** Confusion matrix of the classification procedure.

#### **4. Discussion**

In this study, a novel method for driver stress evaluation based on thermal IR imaging and supervised machine learning approaches was described. Thermal IR imaging and ECG were acquired on ten subjects, while performing an experiment on a driving simulator using the software City Car Driving v.1.5 [35]. The experimental session consisted of 45 min of urban context driving with pre-established weather and traffic conditions. Electrocardiography (ECG) signals were used to infer the stress condition of the drivers. Among the variety of indices derived from the ECG signals, stress was considered [18]. In this study, the SI was evaluated in consecutive 30 s time windows by the software Kubios [36]. In the same temporal window, six representative features from average thermal signals on four ROIs (i.e., nose tip, left and right nostrils, glabella) were extracted. The thermal signals were automatically determined by a real-time tracking procedure. The tracking relied on state-of-the-art computer vision algorithms applied on visible images and the optical co-registration between the visible and thermal imaging devices ensuring high performance on signal processing and speed of extraction. Indeed, the high performances were highlighted by the percentage of correctly processed frames which reached an average of 94.66% by the confidence index for face classification that was 0.90 and by the average processing time that was only 0.09 s/frames.

A multivariate machine learning approach based on Support Vector Regression (SVR) with Radial Basis Function (RBF) kernel was employed to estimate the ECG-based SI through peculiar thermal features extracted from facial ROIs. Those ROIs were chosen on the basis of their physiological importance for stress detection [43]. A total amount of 18 thermal features (six features for each ROIs) were computed and used as predictors, while the SI, evaluated through the ECG signals, was considered as the regression output. A leave-one-subject-out cross-validation was employed to test the generalization of the regression. This procedure was iterated for all the subjects and further statistical analyses were performed on the out-of-training-sample estimation of SI. Such a metric was labeled as SIcross. The correlation between SI and SIcross was *r* = 0.61 (*p* = ~0) thus indicating a good estimation of the SI through the considered thermal features (Figure 6a). A feature-based analysis was performed to investigate the relevance of each feature (Figure 6b).

Concerning the nose tip region, the most contributing features to the SI estimation were the kurtosis, the skewness and the standard deviation. The weights associated to the kurtosis and skewness had negative values, thus indicating an inverse relation between SI and the features' trends. The opposite trend was observed for the weights associated to the standard deviation. This pattern seems to be correlated with sweating or vasoconstriction phenomena, occurring with increasing stress [51,52]. In fact, an increase in the standard deviation and a reduction of the kurtosis and skewness parameters (i.e., flatness and asymmetry of the distribution of the related signal [53]) can be associated to a decrease of uniformity of the signal, thus indicating the presence, for instance, of "cold spots" typically present during sweating and vasoconstriction processes [19]. Concerning the nostrils region, instead, a strong inverse relation between the weight associated to the standard deviation and the SI was found. Since the thermal signals from nostrils are highly related to the breathing function, a lower signal variation (revealed by a decrease in the standard deviation) could be associated to a high breathing rate [54]. This result is in accordance with the findings from References [55,56], where it was shown that stress is associated with an increased respiratory rate. Finally, referring to the glabella region, the most relevant feature to the SI estimation was the 90th percentile, i.e., the value below which 90% of data falls. The weight associated to the 90th percentile was directly related to the SI, thus indicating that an increase in temperature of the glabella could be indicative of a stress condition. This result is in accordance with the findings reported in Reference [57] in which an increase in forehead temperature was associated to the execution of high difficult tasks.

To be noted, when using non-linear regressors, the contribution of each feature in predicting the output does not only depend on the relative weight but also on the non-linearities of the model. Nonetheless, the SVR-RBF employed a single parameter depicting the non-linearity extent for all the features considered. Thus, although not directly regressing the input with the output, the weights of RBF-SVR were still associated to the importance of each regressor.

The SIcross was, then, used to perform a two-level classification of the driver's stress (i.e., STRESS versus NO STRESS). The two classes were defined on the basis of the threshold associated to a stress condition assessed by Baevsky's SI [18]. Since the two classes were not balanced, a bootstrap procedure was implemented [47]. The ROC analysis showed a good performance of the classifier with an average AUC of 0.80 (Figure 7b), a sensitivity of 77%, and a specificity of 78% (Table 4).

It is worth noting that the cross-validation and the bootstrap procedures provided the generalization performances of the model, testing its applicability to a wide cohort of drivers. In fact, although stress conditions could elicit different physiological responses among subjects, for ADAS applications, it could be more relevant to detect stress conditions across participants, rather than focusing on a single subject's stress level.

The main benefit of the developed method with respect to the available literature is the use of supervised machine learning approaches, based on the only thermal features, without accounting for vehicle- or driver's behavioral-related parameters, reaching performances comparable with more complex approaches [27]. Furthermore, the developed method opens the way to efficient real-time implementation of drivers' stress state monitoring relying only on thermal IR imaging, being the model already validated and ready to use.

Nonetheless, further studies should be performed to increase the sample size. The machine learning approach used in this study relied on supervised learning which is inherently a data-driven analysis; data-driven analysis is highly affected by the sample size and the performance of the model could indeed improve reducing a possible overfitting effect driven by the limited sample numerosity. To be noted, the present study focused on drivers with a limited age range (i.e., 22–35 years old), involving only young subjects. The most important improvement of the method will be to include in the study sample people with a wider age range. In future studies, beyond increasing the sample size and age range, other factors, such as gender, thermal comfort, and weather conditions during simulated driving sessions, will be considered [58–60]. In fact, taking these factors into account could be of fundamental valence in driving stress research, leading to a wide overview of all the aspects concerning the matter of the study.

Furthermore, the present results are relative to simulated driving conditions in which determining variables for IR measurements, like direct ventilation or sunlight, were not considered. Thus, it would be desirable to apply the developed methodology also on real-driving situations, to generalize the applicability of the technique.

As for being state-of-the-art, this is an original and novel study concerning drivers' stress state evaluation by means of thermal imaging, employing supervised machine learning algorithms. This is a preliminary study, addressed to limited and specific experimental conditions which, however, underlines the feasibility of the method to be verified under wider operating situations.

#### **5. Conclusions**

In the present work, a novel and original method allowing for drivers' stress state evaluation was presented. By using machine learning approaches, it was possible to understand and classify, with a good level of accuracy, the stress state of the subjects while driving in a simulated environment. The presented work constitutes the first step towards the establishment of a reliable detection of the stress s in a non-invasive fashion, ensuring to maintain an ecologic condition during measurements.

**Author Contributions:** Conceptualization, D.C., D.P., C.F., A.M.C., A.M.; methodology, D.C., D.P.; software, D.C., D.P., E.S.; validation, D.P.; formal analysis, D.C., D.P.; investigation, D.C., D.P., C.F., A.M.C., L.M.; writing—original draft preparation, D.C., C.F., D.P.; writing—review and editing, A.M.C., A.M.; supervision, A.M.; project administration, A.M.; funding acquisition, A.M. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the grants: PON FESR MIUR R&I 2014-2020-ADAS+, grant number ARS01\_00459 and ECSEL Joint Undertaking (JU) European Union's Horizon 2020 Heliaus, grant number 826131.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Driver Facial Expression Analysis Using LFA-CRNN-Based Feature Extraction for Health-Risk Decisions**

#### **Chang-Min Kim <sup>1</sup> , Ellen J. Hong <sup>2</sup> , Kyungyong Chung <sup>3</sup> and Roy C. Park 4,\***


Received: 26 February 2020; Accepted: 22 April 2020; Published: 24 April 2020

**Abstract:** As people communicate with each other, they use gestures and facial expressions as a means to convey and understand emotional state. Non-verbal means of communication are essential to understanding, based on external clues to a person's emotional state. Recently, active studies have been conducted on the lifecare service of analyzing users' facial expressions. Yet, rather than a service necessary for everyday life, the service is currently provided only for health care centers or certain medical institutions. It is necessary to conduct studies to prevent accidents that suddenly occur in everyday life and to cope with emergencies. Thus, we propose facial expression analysis using line-segment feature analysis-convolutional recurrent neural network (LFA-CRNN) feature extraction for health-risk assessments of drivers. The purpose of such an analysis is to manage and monitor patients with chronic diseases who are rapidly increasing in number. To prevent automobile accidents and to respond to emergency situations due to acute diseases, we propose a service that monitors a driver's facial expressions to assess health risks and alert the driver to risk-related matters while driving. To identify health risks, deep learning technology is used to recognize expressions of pain and to determine if a person is in pain while driving. Since the amount of input-image data is large, analyzing facial expressions accurately is difficult for a process with limited resources while providing the service on a real-time basis. Accordingly, a line-segment feature analysis algorithm is proposed to reduce the amount of data, and the LFA-CRNN model was designed for this purpose. Through this model, the severity of a driver's pain is classified into one of nine types. The LFA-CRNN model consists of one convolution layer that is reshaped and delivered into two bidirectional gated recurrent unit layers. Finally, biometric data are classified through softmax. In addition, to evaluate the performance of LFA-CRNN, the performance was compared through the CRNN and AlexNet Models based on the University of Northern British Columbia and McMaster University (UNBC-McMaster) database.

**Keywords:** facial expression analysis; line segment feature analysis; dimensionality reduction; convolutional recurrent neural network; driver health risk

#### **1. Introduction**

In our lives: emotion is an essential means to deliver information among people. Emotional expressions can be classified in one of two ways: verbal (the spoken and written word) and non-verbal (gestures, facial expressions, etc.) [1,2]. People communicate with others every day, and in such a process, facial expressions account for a significantly high proportion of meaning. Expressions can be used to accurately understand another person's emotional state, and since the perception of the emotion in facial expressions is an essential factor of social cognition, it could be seen as playing an essential role in diverse areas of life [3,4]. As described, facial expressions can be used to understand and empathize with other people's emotions. A number of services and prediction models for analyzing such expressions and understanding users' emotional states have been, and are being, studied [5,6]. Recently studied are lifecare services using gesture recognition or expression analysis to detect the risks to which the elderly and patients with chronic diseases are exposed. Currently, the whole world is entering an aging society, and accordingly, the number of patients with chronic diseases (hypertension, cardiovascular diseases, and coronary artery disease, etc.) is increasing. In addition, even in the low age group, the prevalence of chronic diseases increases due to changes in dietary habits (food with high calories and high sugar content, etc.), lack of exercise, and smoking, etc. [7,8]. From one generation to the next, humankind will require services to continuously monitor and manage chronic diseases. Unless medical service technology makes innovative progress, the demand for such services will continue to increase. There are many cases where the elderly or patients with diseases experience emergencies, major accidents, or death from disease-related reasons such as acute shock. In particular, since car accidents frequently occur due to acute diseases while someone is driving, it is necessary to take urgent action to prevent them [9–11]. When such accidents occur in the absence of a fellow passenger, the driver is unable to take prompt action. Moreover, as autonomous vehicles become more popular, they can operate on cruise control regardless of the status of the driver. In such cases, even after a driver's abnormal health status is detected, he or she might not be able to do something within the so-called "golden time" to address the problem. Although the mortality of patients with chronic diseases has decreased due to medical progress, it is still necessary to continuously manage and prepare for emergency situations [12,13]. To that end, services are being studied that alert friends, hospitals, police stations, etc., after detecting a driver at risk. However, since such studies involved prediction models based on external factor analysis, a driver's potential risk factors have not been applied, and it is difficult to predict accidents that occur due to internal factors. For prediction services in which internal factors apply, accurate prediction requires a particular device to be installed in the vehicle, and a given code of conduct is to be followed. As for the possibility of checking the driver's internal risk factors intuitively, it is possible to judge a dangerous situation through the driver's facial expressions. Thus, this study was conducted for the prediction of risk through recognition of the driver's facial expressions. Concerning current recognition of facial expressions, various studies are in progress [14]. As for the traditional face recognition techniques, classification models using the extraction of handcrafted features (local binary pattern (LBP), histogram of gradients (HOG), gabor, scale-invariant feature transform (SIFT)) for the sensible extraction of the characteristics of face images have usually been used [15–17]. These methods have a problem, however, in which the performance deteriorates when there are various changes in the face in an actual environment. It is difficult to choose the appropriate feature parameters according to the field of application. Transformation can be made in various shapes, but it is necessary to determine the optimal feature parameters via experiential elements and various experiments, which is a problem. Recently, the deep-learning technique has widely been used. As the face recognition technique based on deep learning itself learns high-level characteristics, using a large amount of data built up in various environments, it shows a high recognition performance even in a wild environment. Accordingly, DeepFace (based on AlexNet) uses the locally connected convolution layer to effectively extract local characteristics from the face region [18]. DeepID proposed a lightweight convolutional neural network (CNN)-based face recognizer, using an input resolution with smaller pixels than DeepFace [19]. VGGFace, which appeared later, learned a deep network structure consisting of 15 convolution layers using a data set for high-capacity face recognition made by itself through an Internet search [20]. In addition, various studies were conducted to improve the performance of face recognition models such as DeepID2, DeepID3, and GoogLeNet [21–23]. As yet,

if real-time video data are processed by a face recognition technique based on deep learning, it is necessary to learn many classes. Thus, they mostly show the structure in which a fully connected layer becomes larger, which accordingly decreases the batch size and acts as a factor disturbing convergence in the learning by the neural network. Accordingly, in this paper, to resolve such problems, facial expression analysis of drivers by using line-segment feature analysis-convolutional recurrent neural network (LFA-CRNN) feature extraction for health-risk assessment is proposed. A service using facial expression information to analyze drivers' health risks and alert them to risk-related matters is proposed. Drivers' real-time streaming images, along with deep learning-based pain expression recognition, were utilized to determine whether or not drivers are suffering from pain. When analyzing real-time streaming images, it may be difficult to extract accurate facial expression features if the image is shaking, and it may be difficult or impossible to run the analysis process on a real-time basis due to limited resources. Accordingly, a line-segment feature analysis (LFA) algorithm reduces learning and assessment time by reducing data dimensionality (the number of pixels). Also proposed is increasing the processing speed to handle large-capacity original data and high resolutions. Drivers' facial expressions are recognized through the CRNN model, which is designed to reduce input data dimensionality and to learn the LFA data. The driver's condition is understood based on the University of Northern British Columbia and McMaster University (UNBC-McMaster) database to understand the driver's abnormal condition. A service is proposed for coping with risks, spreading the dangerous conditions concerning health risks that may occur while driving through the notice by understanding the driver's conditions as suffering and non-suffering conditions.

This study is organized as follows. Section 2 presents the trends in face analysis research and also describes the current risk-prediction systems and services using deep learning. Section 3 describes how the dimensionality-reducing LFA technique proposed in this paper is applied to the data generation process, and also presents the CRNN model designed for LFA data learning. Section 4 describes how the UNBC-McMaster database was used to conduct a performance test.

#### **2. Related Research**

#### *2.1. Face Analysis Research Trends*

In early facial expression analysis, various studies were conducted based on a local binary pattern (LBP). LBP is widely used in the field of image recognition thanks to its ability to recognize things, its strength against changes in lighting, and its ease of calculation. As LBP became widely used in face recognition, center-symmetric LBP (CS-LBP) [24] was used in a modified form that can show components in the diagonal direction, reducing the dimension of feature vectors. Also, some studies enhanced the accuracy of facial expression detection by using multi-scale LBP that multiplied the size of the radius and the angle [25,26]. However, the LBP technique is used with techniques for extracting feature vectors in order to increase accuracy. In this case, based on the field of the application, there is difficulty in choosing the appropriate feature vectors. Transformation in various forms is possible, but the optimal feature vector should be decided by experiential elements and from various experiments. If the LFA proposed in this study is used, the minimum necessary data are used when the face is analyzed, so data compression takes place autonomously. Also, since it can be performed through techniques for detecting the face and its outline, it can easily be used in various fields. Studies of face analysis based on point-based features utilizing landmarks are also in progress. Landmark-based face extraction has a very fast process of measuring and restoring the landmark, so it can immediately display changes in the face shape and facial expressions filmed in real time. The weight of the measured landmark can be lightened for uses and purposes, such as character and avatar. Jabon et al. (2010) [27] proposed a prediction model that could prevent traffic accidents by recognizing drivers' facial expressions and gestures. This prediction model generates 22 x and y coordinates on the face (eyes, nose, mouth, etc.) in order to extract facial characteristics and head movements, and it automatically detects movement. It synchronizes the extracted data with simulator data, uses them as input to the classifier, and calculates

a prediction for accidents. Also, Agbolade et al. (2019) [28] and Park (2017) [29] conducted studies to detect the face region based on multiple points, utilizing landmarks to increase the accuracy of face extraction. However, to prevent prediction of the landmark value from falling to the local minimum, it is necessary to pass through a process of correcting the result through plural networks based on the initial prediction value in cascade form. The difficulty in detection differs depending on the set value of the feature point of the face. The more subdivided the overall detected outline, the more difficult it gets. Also, if part of the face is covered, it becomes very hard to measure landmarks. If the LFA proposed in this study is used, it is somewhat possible to escape the impact of light, since only information about the segments is used. Also, there is no increase in the difficulty of detection.

Since the deep learning method shows high performance, studies based on CNNs and deep neural networks (DNNs) are actively conducted. Wang et al. (2019) [30] proposed a method for recognizing facial expressions by combining extracted characteristics with the C4.5 classifier. Since some problems still existed (e.g., overfitting of a single classifier, and a vulnerable generalization ability), ensemble learning was applied to the decision-making tree algorithm to increase classification accuracy. Jeong et al. (2018) [31] detected face landmarks through a facial expression recognition (FER) technique proposed for face analysis, and extracted geometric feature vectors considering the spatial position between landmarks. By implementing the feature vectors on a proposed hierarchical weighted random forest classifier in order to classify facial expressions, the accuracy of facial recognition increased. Ra et al. (2018) [32] proposed a deep learning structure in a block method to enhance the face recognition rate. Unlike the existing method, feature filter coefficients and the weighted values of the neural network (on the softmax layer and the convolution layer) are learned using a backpropagation algorithm. Performing recognition with the deep learning model that learned the selected block region, the result of face recognition is drawn from an efficient block with a high feature value. However, since the face recognition technique based on CNNs and DNNs should generally learn a large amount of classes, there is a structure in which the fully connected layer grows bigger. Accordingly, the structure acts as a factor reducing the batch size and disturbing convergence in the learning by a neural network. If the LFA proposed in this study is used, the input dimension is small. Thus, the disturbance in the convergence from learning (due to the decrease in the batch size that may be generated in the CNN and DNN) can be minimized.

#### *2.2. Facial Expression Analysis and Emotion-Based Services*

FaceReader automatically analyzes 500 features on a face from images, videos, and streaming videos that include facial expressions, and it analyzes seven basic emotions: neutrality, happiness, sadness, anger, amazement, fear, and disgust. It also analyzes the degree of the emotions, such as the arousal (active vs. passive) and the valence (positive vs. negative) online and offline. Research on emotions through analyzing facial expressions has been conducted in various research fields, including consumer behavior, educational methodology, psychology, consulting and counseling, and medicine for more than 10 years. It is widely used in more than 700 colleges, research institutes, and companies around the world [33]. The facial expression-based and bio-signal-based lifecare service provided by Neighbor System Co. Ltd. in Korea is an accident-prevention system dedicated to protecting the elderly who live alone and who have no close friends or family members. The services provided by this system include user location information, health information confirmation, and integrated situation monitoring [34]. Figure 1 shows the facial expression-based and bio-signal-based lifecare service, which consists of four main functions for safety, health, the home, and emergencies.

The safety function provides help/rescue services through tracing/managing the users' location information, tracing their travel routes, and detecting any deviations from them. The health function measures/records body temperature, heart rate, and physical activity level, and monitors health status. In addition, it determines whether or not an unexpected situation is actually an emergency by using facial expression analysis, and provides services applicable to the situation. The home function provides a service dedicated to detecting long-term non-movement and to preventing intrusions

by using closed-circuit television (CCTV) installed within the users' residential space. Lastly, the emergency function constructs a system with connections to various organizations that can respond to any situation promptly, as well as deliver users' health history records to the involved organizations. users' re as well as deliver users' health history records to the involved users' re as well as deliver users' health history records to the involved

**Figure 1.** Facial expression and bio-signal-based lifecare service.

#### **3. Driver Health-Risk Analysis Using Facial Expression Recognition-Based LFA-CRNN**

s' driver's performed to extract the characteristics of the driver's facial image in real time in the transportation odel is proposed, which can recognize the driver's f It is necessary to compensate for senior drivers' weakened physical, perceptual, and decision-making abilities. It is also necessary to prevent secondary accidents, manage their health status, and take prompt action by predicting any potential traffic-accident risk, health risk, and risky behavior that might show up while driving. In cases where a senior driver's health status worsens due to a chronic disease, it becomes possible to recognize accident risks through facial expression changes. Accordingly, we propose resolving such issues with facial expression analysis using LFA-CRNN-based feature extraction for health-risk assessment of drivers. The LFA algorithm was performed to extract the characteristics of the driver's facial image in real time in the transportation support platform. An improved CRNN model is proposed, which can recognize the driver's face through the data calculated in this algorithm. Figure 2 shows the LFA-CRNN-based driving facial expression analysis for assessing driver health risks. s' driver's performed to extract the characteristics of the driver's facial image in real time in the transportation odel is proposed, which can recognize the driver's f

**Figure 2.** Line-segment feature analysis-convolutional recurrent neural network (LFA-CRNN)-based facial expression analysis for driver health risk assessment.

The procedures for recognizing and processing a driver's facial expressions can be divided into detection, dimensionality reduction, and learning. The detection process is a step of extracting the core areas (the eyes, nose, and mouth) to analyze the driver's suffering condition. In the step, there is a preconditioning process to solve the problem that the core areas are not accurately recognized. To extract features from the main areas of frame-type facial images segmented from real-time streaming images, multiple AdaBoost-based input images are divided into blocks. In the dimensionality reduction process, the LFA algorithm reduces the learning and reasoning time by reducing data dimensionality (the number of pixels) in order to increase processing speeds to handle large-capacity original data. High-resolution data and the dimensionality of the input data are reduced. Lastly, in the learning process, drivers' facial expressions are recognized through the CRNN model designed to learn the LFA data. In addition, to confirm a driver's abnormal status based on the UNBC-McMaster shoulder pain expression database, the service proposed determines if the driver is in pain, identifies the driver's health-related risks, and alerts the driver to such risks through alarms. core areas (the eyes, nose, and mouth) to analyze the driver's suffering condition. In the step, there is process, drivers' facial expressions a driver's the driver's

a driver's

#### *3.1. Real-Time Stream Image Data Pre-Processing for Facial Expression Recognition-Based Health Risk Extraction*

Because pre-existing deep-learning models utilize the overall facial image for facial recognition, areas such as the eyes, nose, and lips serving as the main factors for analyzing drivers' emotions and pain status are not accurately recognized. Accordingly, through a detection process module, pre-processing is conducted for dimensionality reduction and learning. To analyze the original data transferred through real-time streaming, input images are segmented at 85 fps, and to increase the recognition rate, the particular facial image sections required for facial expression recognition are extracted using the multi-block method [35]. In particular, in cases where a multi-block is big or small during the blocking process, pre-existing models are unable to accurately extract features from the main areas, and this causes significant errors relating to recognition and learning. To resolve such issues, multiple AdaBoost is utilized to set optimized blocking, and then sampling is conducted. Figure 3 shows the process of detecting particular facial areas. A Haar-based cascade classifier is used to detect the face; Haar-like features are selected to accurately extract the user's facial features, and the AdaBoost algorithm is used for training. At this point, since features can be seen as a face/background-dividing characteristic and as a classifier, each feature is defined as a base classifier or a weak classifier candidate. During iterations, the training samples select one feature demonstrating the best classification performance, and the selected feature is used as the weak classifier in the iteration. The final weak classifiers are used in the weighted linear combination process to acquire the final strong classifiers. and lips serving as the main factors for analyzing drivers' emotions and selected to accurately extract the user's

**Figure 3.** The multiple AdaBoost-based particular facial area detection process.

In the formula in Figure 3, *E*(*x*) is the strong classifier finally found; *e* is the weak classifier drawn in the learning process, and *a* is the weighted value for the weak classifier. *T* is the number of repetitions. In this process, it is very hard to normalize the face if it is extracted without information such as the rotation and position of the face. Extracting the geometrical information of the face, it is necessary to normalize the face consistently. Faces can be classified according to their rotational positions, and if random images do not provide such information in advance, such rotational information must be detected during image retrieval. The detectors learned through multiple Adaboost are serialized, using the simple pattern of the face searcher. Using the serialized detectors, information can be found, such as the position, size, and rotation of the face. As for the simple pattern used in multiple Adaboost learning, the pattern in a basic form was used. 160 was chosen as the number of simple detectors to be found by Adaboost learning, and the processing speed of the learned detectors improved through serialization. The face region calculated through the above process detects the outline of the face through the Canny technique. This is the optimal technique option based on the experimental result. In the early stages, various outline detection techniques were used, but only the Canny method showed a high result.

#### *3.2. Line-Segment Feature Analysis (LFA) Algorithm for Real-Time Stream Image Analysis Load Reduction*

#### 3.2.1. Pain Feature Extraction through LFA

Even after executing facial feature extraction through the procedures specified in Section 3.1, various constraint conditions may arise when extracting a driver's facial features from real-time driving images. In analyzing a real-time streaming image, it may be hard to extract accurate facial characteristics due to the motion of the image. Accordingly, since it is necessary to reduce the dimensionality of facial feature images extracted from real-time streaming images, the LFA algorithm is proposed. The proposed LFA algorithm is a dimensionality-reduction process that reduces learning and reasoning time by reducing data dimensionality (the number of pixels) to increase the processing speed in order to handle the original large-capacity, high-resolution data. To extract information from images, the line information on a 3 × 3 Laplacian mask's parameter-modified filter is extracted, a one-dimensional (1D) vector is created, and the created vector is utilized as the learning model's input data. Based on such a process, this algorithm creates new data through the line-segment features. LFA uses the driver's facial contour lines calculated through the detection process to examine and classify line-segment types. To examine the line-segment types, a filter, *f*, is used, and the elements {1, 2, 4, and 8} are acquired. Figure 4 shows the process where a driver's facial-contour line data are segmented, and the line-segment types are examined through the use of *f*.

#### **Algorithm 1** Image Division Algorithm


 **Figure 4.** Driver facial contour-line data segmentation and line-segment type examination using *f*: (**a**) Division of image (LFA data 16 pieces); (**b**) Max-pooling and reshape.

Figure 4 shows the first LFA process. The contour line image calculated through pre-processing (detection) had a size of 160 × 160, and this image was segmented into 16 parts, as shown in Figure 4a. This process is calculated as shown in Algorithm 1. The segmented parts have a size of 40 × 40, and the segments are arranged in a way that does not modify the structure of the original image. These segments are max-pooled via the calculation shown in Figure 4b, and the arrangement of the segments is adjusted. This process is defined in Equation (1):

$$\begin{aligned} D\_{\mathbf{w}}D\_{\mathbf{h}} = \mathbf{4}, \mathbf{4}, P\_{\mathbf{w}} = \frac{\mathbf{w}}{D\_{\mathbf{h}}}, P\_{\mathbf{h}} = \frac{H}{D\_{\mathbf{h}}} P[\mathbf{n}, \mathbf{m}] = \mathbf{x}[n \ast P\_{\mathbf{w}} : (n+1) \ast P\_{\mathbf{w}}, m \ast P\_{\mathbf{h}} : (m+1) \ast P\_{\mathbf{h}}], \\ (0 \le \mathbf{n} \le 4, 0 \le \mathbf{m} \le 4) M P[\mathbf{n}, \mathbf{m}] = \max - \text{pooling}(P[n, m]), \end{aligned} \tag{1}$$

[n, m] = − ([, ]), <sup>ℎ</sup> <sup>ℎ</sup> <sup>ℎ</sup> memorizes the segmented data's max Equation (1) is a calculation where the contour line image obtained during pre-processing is divided into 16 equal segments, and the divided segments are max-pooled. *D<sup>w</sup>* and *D<sup>h</sup>* denote the number of segments in the width and height, respectively, and *P<sup>w</sup>* and *P<sup>h</sup>* denote the size of the segmented data from dividing the contour line image by *D<sup>w</sup>* and by *D<sup>h</sup>* , respectively. *P* indicates the space for memorizing the segmented data, and the segmentation position is maintained through *P*[n, m], in which n and m refer to a two-dimensional array index, having a value ranging between 0 and 4. *MP* memorizes the segmented data's max-pooling results. In every process, the sequence of the segmented images must not be lost. The sequence of the re-arranged segments must not be lost as well. Figure 4b shows the calculation where a convolution between the segment images and the filter is calculated: the parameters of the segmented images are converted, the sum of the parameters is calculated, and one-dimensional vector data are generated. The number of segmentations and the size of images in this process were the values selected through experiential selection, and after experimenting on various conditions, the optimal variables were calculated.

3.2.2. Line-Segment Aggregation-Based Reduced Data Generation for Pain Feature-Extracted Data Processing Load Reduction

The information from the line segment (LS) extracted (based on real-time streaming images) is matched with a unique number. The unique numbers are 1, 2, 4, and 8; they have a value that does not overlap another value, and the aggregate value deduces mutually different values. The LFA algorithm uses a 2 × 2 filter having a unique number for matching normal line-segment data. The LS has a value of 0 or 1, and where a filter consisting of a unique number is matched with the LS, only the areas having 1 as the unique number are displayed. A serial number is given to express information on segments, which is visual data in a series of information on numbers. That is converted to a series of information on numbers for easy counting of various segments (curve, horizontal line, and vertical line, etc.). Namely, visual data are converted into a series of patterns (numbers). Figure 5 shows the process where a segmented image is converted into 1D vector data.

 

parameter is changed, and the image's 1 parameter is replaced with the

, …, x

, …, Y

, …, p , …, p ], …, x , …, p

**Figure 5.** Conversion from a segment image to 1D vector data: (**a**) Multiply operation; (**b**) Sum operation.

 <sup>ℎ</sup> A segment of an image utilizing contour line data has a parameter of 0 or 1, as shown in Figure 5a. The involved segment is a line segment when this parameter is 1, and is background when the parameter is 0. Such segment data are calculated with the filter, *f*, in sequence. The segment data have a size of 20 × 20, and filter *f* is 2 × 2. The 2 × 2 window is used to calculate a convolution between the segment data and filter *f*. At this point, the window moves one pixel at a time (stride = 1) to scan the entire area of the segmented image. Each scanned area is calculated with filter *f*; the parameter is changed, and the image's 1 parameter is replaced with the *f* parameter. The process in Figure 5b is calculated as shown in Algorithm 2.

**Algorithm 2** 1D Vector Conversion Algorithm

```
Input: [x1 = [p1
                  , p2
                      , . . . , p16
                               ], x2 = [p1
                                           , p2
                                               , . . . , p16
                                                         ], . . . , xn = [p1
                                                                         , p2
                                                                             , . . . , p16
                                                                                       ]]
def Convert image to a 1D vector
    Label = [1, 2, 8, 4]
    Y = List()
    for xi
           in [x1
                 , x2
                     , . . . , xn] do
    // Sub1 is a list to save the result of a piece of the image.
       sub1 = List ()
       for pi
              in xi do
       // Sub2 is a list to save the result of the image of the matched piece
       // with label data.
         sub2 = List ()
         for w from 0 to W-fw+1 do
            for h from 0 to H-fh+1 do
              p = pi
                     [w:w + fw, h:h + fh]
              p = p.reshape(−1) ∗ Label
              sub2.append(sum(p))
         sub1.append(sub2)
       Y.append(sub1)
Output: Y[Y1
               , Y2
                   , . . . , Yn]
```
Equation (2) shows the calculation between the segment image and filter *f*, in which *f<sup>w</sup>* and *f<sup>h</sup>* represent the size. At this point, *f* has a fixed parameter and a fixed size; *x<sup>i</sup>* is a partial area of the segment image, and segments it into pieces the same size as *f*. Once a convolution between such segmented data and *f* is calculated, the calculated results are added up and recorded in *P<sup>i</sup>* .

$$f = [[1,2], [8,4]]f\_{\mathcal{W}}, \\ f\_{\hbar} = 2, 2 \\ P\_{\mathbf{i}} = \sum\_{w=0}^{W} \sum\_{h=0}^{H} \mathbf{x}\_{\mathbf{i}}[w:w+f\_{\mathbf{w}}, h:h+f\_{\hbar}] \otimes f,\tag{2}$$

For example, when the segment image has parameters set to [[1, 0], [0, 1]], as a convolution between the segment image and *f* is calculated, the segment image's parameters are changed to [[1, 0], [0, 4]]. These changed parameters obtain different values according to the position of each parameter due to *f*. Adding them up shows a different result, according to the data expression of the scan area. Table 1 shows the type and sum of lines according to the scanned areas.


**Table 1.** Type and sum of lines according to the scanned areas.

When scanned areas are expressed as 0, they are considered the background (as shown in Table 1), and the summed value is also expressed as 0. On the other hand, all the areas expressed as 1 are considered active, and the side acquires a value expressed as 15. Other areas, according to the position and number of 1s, are expressed as point, vertical, horizontal, or diagonal, and are given a unique number. Despite being identical line types, all data are assigned a different number according to the expressed position, and the summed value is the unique value. For example, vertical is one of the line types detected in areas expressed as 0110 or 1001. However, each summed value is either 6 or 9 and has a different unique value. This means that the same line types are considered different lines based on their line-expressed positions. In addition, each line type's total cannot exceed 15. The data calculated through such a process will tie and save the line types (total) calculated per segment as a 1D vector, and will create a total of 16 1D vectors. Each vector has a size of (20 − 2 + 1) × (20 − 2 + 1) = 841, and each vector's parameter has a value ranging from 0 to 15.

3.2.3. Unique Number-Based Data Compression and Feature Map Generation for Image Dimensionality Reduction

The 16 one-dimensional vector data calculated through the process shown in Figure 5 consist of unique values according to the line type determined through the information calculated by segmenting the facial image into 16 parts and matching each part with a particular filter. Such vector data consist of parameters ranging from 0 to 15. Each parameter has a unique feature (line-segment information). This section describes how cumulative aggregate data are generated based on the parameter value owned by each segment. The term "cumulative aggregate data" refers to data generated through a process where a parameter value is utilized as an index to generate a 1D array having a size of 16. The involved array's factor increases by 1 every time each index is called. Figure 6 shows the process where cumulative aggregate data are generated.

#### **Algorithm 3** Cumulative Aggregation Algorithm

```
Input: [x1 = [p1 = [v1
                        , v2
                           , . . . vm], p2
                                         , . . . , p16
                                                  ], x2
                                                       , . . . , xn]
def Cumulative aggregation used to make LFA data
    Y = List()
    for xi
           in [x1
                 , x2
                     , . . . , xn] do
       sub1 = List()
       for p in xi do
         sub2 = array(16){0, . . . }
         for i from p do
           sub2[i]++
         sub1.append(sub2)
       Y.append(sub1)
Output: Y
                                            , … v , …, p , …, x
                                          , …, x
                                  sub2=array(16){0, …}
```
**Figure 6.** Process of cumulative aggregate data generation: (**a**) Index the parameters of the fragmented image to increase the elements of the array; (**b**) 1D array for 1 piece~1D array 16 piece.

As shown on the right side of Figure 6a, the parameters of the data segmented through the previous process are utilized as an array of the index, and a 1D array having a size of 16 is generated according to each segment. This array is shown in Figure 6b, and the factor value of the index position corresponding to each parameter of the one-dimensional array having a maximum size of 16 increases by 1. The process in Figure 6b is calculated as shown in Algorithm 3. Since this process is applied to each segment, an array having a size of 16 and corresponding to each segment is generated for each segment, and a total of 16 arrays are generated. These are known as LFA data and are shown in Figure 7a.

**Figure 7.** Learning process using LFA data: (**a**) LFA data for one image; (**b**) LFA-Learn process.

The LFA process in Figure 7a restructures each array generated through each segment image in the appropriate order (in the order prioritized based on the segmentation position). Through this,

–

of a general CRNN learning model architecture for drivers' pain status, and the learn

convolution layer's batch

for one image, the LFA data calculated through the LFA process are expressed as two-dimensional sequences with a size of 16 × 16. This is used as input for the CRNN.

#### *3.3. LFA-CRNN Model for Driver Pain Status Analysis*

Once a feature map is generated, and the image is deduced through facial and contour line detection, the pre-processing of the given input images restructures them into two-dimensional arrays having a size of 16 × 16 through the LFA process. Specifically, the dimensionality is reduced through the LFA technique. Since LFA always has the same output size and consists of aggregate information on the line segment contained in the image, the reduced data themselves can be considered unique features. In addition, a learning model dedicated to LFA data is designed instead of a general CRNN learning model architecture for drivers' pain status, and the learning process is performed as well. Figure 8 shows the structure of the proposed LFA-CRNN model. of a general CRNN learning model architecture for drivers' pain status, and the learn

**Figure 8.** The proposed LFA-CRNN architecture.

convolution layer's batch – The LFA-CRNN architecture is a CRNN learning model. It consists of one convolution layer, and expresses a feature map as sequence data through the reshaped layer. The features that changed into sequence data are transferred to the dense layer through two bidirectional gated recurrent units (BI-GRUs), and the sigmoid layer serves as the final layer before the results are output. Through the convolution layer's batch normalization (BN), the risk of depending on and overfitting the learning-speed improvement and initial weighted-value selection is reduced [36–38]. Since this learning model uses dimensionality-reduced LFA data, the compressed data themselves can be considered one feature. Accordingly, to express one major feature as a number of features in a convolution, the input-related expressions are diversely divided through a total of 64 filters having a size of 16 × 16. The value deduced through such a process passes through the BN and generates a series of feature maps through the rectified linear unit (ReLU) layer. Such feature maps are restructured through the reshape layer into 64 sequence data having a size of 256 and are used as the RNN model's input. The RNN model consists of two BI-GRUs, one with 64 nodes and one with 32 nodes. The data deduced through this process are delivered to the sigmoid layer through the dense layer. At this point, the dropout layer is arranged between the dense layer and the sigmoid layer to prevent calculation volume reduction and overfitting [39–41]. Lastly, through the Sigmoid class, nine types of pain are classified. In this model, the pooling layer generally used in the pre-existing CNN and CRNN models is not used. Since the input LFA data themselves have a considerably small size of 16 × 16, and consist of the cumulative number of line segments owned by the images when the involved data are compressed, the main features may be damaged or removed. In addition, in this model, BN and the dropout layer are arranged instead of the pooling layer, and the convolution's stride and padding are set to *1* and *same*, respectively. We used the convolution layer to get a variety of information about the expression of individual, highly-concentrated LFA data by designing the model like Figure 9. Thus, the filter of the convolution layer was set to 16 × 16 with stride = 1 and padding = "same." Through this, one LFA data size is maintained, and because of the weighted value of the filter, it can express a

RNN model's input. The RNN model consists of

convolution's stride and padding

lot of information. The data are used as input in each cycle of the RNN, and through the previous characteristics, strong characteristics are gradually detected from within. ' the system's

–

"same." Through this, one LFA data size is maintained, and because of the

**Figure 9.** Driver pain status analysis process and its performance evaluation.

#### **4. Simulation and Performance Evaluation**

A simulation was conducted in the following environment: a Microsoft Windows 10 pro 64-bit O/S on an Intel Core(TM) i7-6700 CPU (3.40 GHz) with 16GB RAM, and an emTek XENON NVIDIA GeForce GTX 1060 graphics card with 6GB of memory. To implement this algorithm, we utilized OpenCV 4.2, Keras 2.2.4, and the Numerical Python (NumPy) library (version 1.17.4) based on Python 3.6. OpenCV was used to perform the Canny technique during pre-processing by the LFA, and the calculation of the queue generated in the LFA process was performed using the NumPy library. The neural network model was implemented through Keras. Figure 9 shows the process by which the driver's pain status is analyzed and under which the system's performance was evaluated.

To evaluate the performance of the LFA-CRNN model-based face recognition (suffering and non-suffering expressions), the UNBC-McMaster database was used. In addition, a comparison was made with AlexNet and CRNN Models. The experiment of this paper chose the basic structure of the proposed model, the CRNN model and the AlexNet model generally well known for image classification to compare the performance. The UNBC-McMaster database classifies pain into nine stages (0~8) using the Prkachin and Solomon Pain Intensity (PSPI) scale, with data consisting of 129 participants (63 males and 66 females). The accuracy and loss measurement test were based on such data, calculated through pre-processing (face detection and contour line extraction). The LFA conversion process was used as the LFA-CRNN's input, and the CRNN [42] and AlexNet [43] for performance comparison used the data calculated through the face detection process. The test was conducted by taking 20% of the data from the UNBC-McMaster database [44] as the test data, and utilizing 10% of the remaining 80% as verification data. In the process of classifying data, to prevent data from leaning too much towards a particular class, the classification was undertaken by designating a specific percentage for each class. Specifically, 42,512 data units consisted of 29,758 learning data units, 3401 verification data units, and 8503 test data units.

Figure 10 shows the results of the accuracy and loss, using the UNBC-McMaster Database. As shown in Figure 10, the LFA-CRNN showed the highest accuracy, with AlexNet second and the CRNN third. AlexNet showed a large gap between the training data and verification data. The CRNN showed a continuous increase in the training data accuracy but showed a temporary decrease in the verification data accuracy due to overfitting. Although the LFA-CRNN proposed in this paper showed a bit of a gap between the learning and validation data, such a gap is not considered significant. Since no temporary decrease was shown in the validation data, it was confirmed that no learning overfitting occurred; loss data showed the same patterns. AlexNet showed the highest gap between learning and validation data, in terms of loss. The CRNN showed a continuous decrease of loss in both learning and validation data, but showed a temporary increase in validation data. Therefore, the LFA-CRNN can be considered more reliable than both AlexNet and traditional CRNN.

CRNN's input, and the CRNN [42] and AlexNet [43] for

(**b**) loss

**Figure 10.** Accuracy and loss measurement results using the University of Northern British Columbia (UNBC)-McMaster shoulder pain expression database.

Figure 11 shows the accuracy and loss achieved with the test data. As shown in the figure, the LFA-CRNN had the highest accuracy at approximately 98.92% and the lowest loss at approximately 0.036. The CRNN showed temporary overfitting during learning, and this was determined to be the reason why its accuracy was lower than the LFA-CRNN. Likewise, it was determined that AlexNet showed a performance decrease in its accuracy due to the verification data's wide gap. The test results shown in Figures 10 and 11 can be summarized as follows. As far as UNBC-McMaster-based learning is concerned, the LFA-CRNN model showed no rapid change in accuracy and loss, and it was confirmed that a stable graph was maintained as the epochs progressed (i.e., no overfitting or

large gap). In addition, compared to the basic models, the proposed method showed the highest performance with an accuracy of approximately 98.92%.

verification data's

**Figure 11.** Accuracy and loss with the test data.

To measure the accuracy and reliability of the proposed algorithm, precision, recall, and the receiver operating characteristic (ROC) curve [45] were measured. Figure 12 shows the results achieved.

(**c**) ROC curves

**Figure 12.** Results of precision and recall, plus the receiver operating characteristic (ROC) curve evaluation for each algorithm.

In Figure 12, the precision results show the percentage of the number of samples actually determined to be true out of the samples predicted to be true for each pain severity class. The

CRNN's precision and recall were

LFA-CRNN showed the following results: 0 = 98%, 1 = 81%, 2 = 63%, 3 = 63%, 4 = 19%, 5 = 74%, 6 = 78%, 7 = 100%, and 8 = 100%. Such results are quite poor, compared to the results achieved by AlexNet and the CRNN. It was determined that such results are attributable to the dimensionality reduction LFA technique. Since the dimensionality reduction technique itself either compresses the original image to generate new data or reduces the data size by using particular features consisting of strong features, it removes specific features and only uses strong features. However, only the LFA-CRNN was able to detect data having a PSPI of 8. In addition, as a result of confirming the average precision, both LFA-CRNN and AlexNet showed an average precision of 75%, while the CRNN showed an average precision of 56%. In addition, the recall measurements were similar to the precision results. The LFA-CRNN showed an average recall of 75%, AlexNet showed an average recall of 73%, and the CRNN showed an average recall of 56%. Based on this test, it was confirmed that it was difficult for all the models to detect data having a PSPI of 4, and that only the LFA-CRNN was able to detect data having a PSPI of 8. To sum up all experiments, the proposed LFA-CRNN model showed a stable graph in the learning process, and in the performance evaluation, using the test data, it showed the highest performance of 98.92%. In addition, its loss measurement showed the lowest result at approximately 0.036. Although the LFA-CRNN's precision and recall were quite poor, its average precision was 75% (as high as the precision by AlexNet), and it showed the highest average recall at 75%.

The LFA-CRNN proposed in this study showed higher accuracy, using fewer input dimensions than comparable models. We judge that this is because of the effect of the maximum removal of unnecessary regions. We examined the metadata necessary for analyzing test data of facial expressions and judged that the color and area (size) constituting the images were unnecessary elements. Thus, the remaining element was the information about segments, and we set up a hypothesis for a sentiment analysis algorithm through this. When people analyze facial expressions, they do not usually consider colors, and the color element was removed, using the understanding of emotions through the shapes of the mouth, eyes, and eyebrows. Also, the images with colors removed were similar to the images expressed with the outline. In learning with the neural network model, a big loss of data took place when images were reduced via max-pooling and stride in the processing, and the overfitting and wind-up phenomena occurred. Thus, we devised a method for reducing the size of the images, and that method is LFA. LFA maintained information about segments as much as possible to prevent data loss that might occur during processing, utilizing data with both color and unnecessary areas removed. In other words, when we extracted emotions, necessary elements were maintained as much as possible, and all other information was minimized. We judge that LFA-CRNN shows high accuracy for these reasons.

#### **5. Conclusions**

With this paper's proposed method, health risks due to an abnormal health status that may occur while someone is driving are determined through facial expressions, a representative medium capable of confirming a person's emotional state based on external clues. The purpose of this study was to construct a system capable of preventing traffic accidents and secondary accidents resulting from chronic diseases, which are increasing as our society ages. Although automated driving systems are being mounted on vehicles and are commercialized based on vehicle technology advancements, such systems do not take into consideration driver status. If abnormal health status in a driver is detected while the vehicle is in motion, it may operate normally, but the drivers might not be able to meet the required "golden time" to address any health problem that arises. Our system checks the driver's health status based on facial expressions in order to resolve to a certain extent problems related to chronic diseases. To do so, in this paper, an LFA dimensionality reduction algorithm was used to reduce the size of input images, and the LFA-CRNN model receiving the reduced LFA data as input was designed and used to classify the status of drivers as being in pain or not. The LFA is a method where a series of filters is used to assign a unique number to the line-segment information that makes up a facial image, and then, the input image is converted into a two-dimensional array having a size of

16 × 16 by adding up the unique numbers. As the converted data are learned through the LFA-CRNN model, facial expressions indicating pain are classified. To evaluate performance, a comparison was made with pre-existing CRNN and AlexNet models. The UNBC-McMaster database was used to learn pain-related expressions. As far as the accuracy and loss calculated through learning are concerned, the LFA-CRNN showed the highest accuracy at 98.92%, a CRNN alone showed accuracy of 98.21%, and AlexNet showed accuracy of 97.4%. In addition, the LFA-CRNN showed the lowest loss at approximately 0.036, the CRNN showed a loss of 0.045, and AlexNet showed a loss of 0.117. Although the LFA-CRNN's precision and recall measurement results were quite poor, average precision was 75%, which is as high as the 75% precision achieved by AlexNet.

We optimized the facial expressions and the data sources for the LFA-CRNN, and intend to compare the processing times of several models and improve the accuracy in the future. The proposed LFA-CRNN algorithm shows high dependency on the outline detection method. This is self-evident, because LFA is based on segment analysis. We are devising an outline detection technique that can optimally be applied to LFA based on this fact. In addition, the LFA performance process generates a one-dimensional sequence before the production of a two-dimensional LS-Map. It is expected that by converting this, a class can be produced that can be used in the neural network model. Through this improvement process, we will combine the LFA-CRNN model with a system for recognition of facial expressions and motions that can be used in services like smart homes and smart health care, and we plan to apply that to mobile edge computing systems and video security.

**Author Contributions:** K.C. and R.C.P. conceived and designed the framework. E.J.H. and C.-M.K. implemented LFA-CRNN model. R.C.P. and C.-M.K. performed experiments and analyzed the results. All authors have contributed in writing and proofreading the paper. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work is supported by the Korea Agency for Infrastructure Technology Advancement (KAIA) grant funded by the Ministry of Land, Infrastructure and Transport (Grant 20CTAP-C157011-01).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Application of Texture Descriptors to Facial Emotion Recognition in Infants**

#### **Ana Martínez † , Francisco A. Pujol \*,† and Higinio Mora †**

Department of Computer Technology, University of Alicante, 03690 San Vicente del Raspeig-Alicante, Spain; amlopez.ana@gmail.com (A.M.); hmora@ua.es (H.M.)

**\*** Correspondence: fpujol@ua.es

† These authors contributed equally to this work.

Received: 3 December 2019; Accepted: 30 January 2020; Published: 7 February 2020

**Featured Application: A system to detect pain in infants using facial expressions has been developed. Our system can be easily adapted to a mobile app or a wearable device. The recognition rate is above 95% when using the Radon Barcodes (RBC) descriptor. It is the first time that RBC is used in facial emotion recognition.**

**Abstract:** The recognition of facial emotions is an important issue in computer vision and artificial intelligence due to its important academic and commercial potential. If we focus on the health sector, the ability to detect and control patients' emotions, mainly pain, is a fundamental objective within any medical service. Nowadays, the evaluation of pain in patients depends mainly on the continuous monitoring of the medical staff when the patient is unable to express verbally his/her experience of pain, as is the case of patients under sedation or babies. Therefore, it is necessary to provide alternative methods for its evaluation and detection. Facial expressions can be considered as a valid indicator of a person's degree of pain. Consequently, this paper presents a monitoring system for babies that uses an automatic pain detection system by means of image analysis. This system could be accessed through wearable or mobile devices. To do this, this paper makes use of three different texture descriptors for pain detection: Local Binary Patterns, Local Ternary Patterns, and Radon Barcodes. These descriptors are used together with Support Vector Machines (SVM) for their classification. The experimental results show that the proposed features give a very promising classification accuracy of around 95% for the Infant COPE database, which proves the validity of the proposed method.

**Keywords:** emotion recognition; pattern recognition; texture descriptors; mobile tool

#### **1. Introduction**

Facial expressions are one of the most important stimuli when interpreting social interaction, as they provide information on the identity of the person and on his emotional state. Facial emotions are one of the most important signal systems when expressing to other people what happens to human beings [1].

The recognition of facial expressions is especially interesting because it allows for detecting feelings and moods in people, which are applicable in fields such as psychology, teaching, marketing or even health, which is the main objective of this work.

The automatic recognition of facial expressions could be a great advance in the field of health, in applications such as pain detection in people unable to communicate verbally, decreasing the continuous monitoring by medical staff, or for people with Autism Spectrum Disorder, for instance, who have difficulty when understanding other people emotions.

Babies are one of the biggest groups that cannot express pain verbally, so this impossibility has created the necessity of using other media for its evaluation and detection. In this way, pain scales based on vital signals and facial changes have been created to evaluate the pain of neonates [2]. Thus, the main objective of this paper is to create a tool which reduces the continuous monitoring by parents and medical staff. For that purpose, a set of computer vision methods with supervised learning have been implemented, making it feasible to develop a mobile application to be used in a wearable device. For the implementation, this paper has used the Infant COPE database [3], a database composed of 195 images of neonates, which is one of the few available public databases for infants' pain detection.

Regarding pain detection using computer vision, several previous studies have been carried out. Thus, Roy et al. [4] proposed the extraction of facial features for automatic pain detection in adults, using the NBC-McMaster Shoulder Pain Expression Archive Database [5,6]. Using the same database, Lucey et al. [7] developed a system that classifies pain in adults after extracting facial action units. More recently, Rodriguez et al. [8] used Convolutional Neural Networks (CNNs) to recognize pain from facial expressions and Ilyas et al. [9] implemented a facial expression recognition system for traumatic brain injured patients. However, when focusing on detecting pain in babies, very few works can be found. Among them, Brahnam et al. used Principal Components Analysis (PCA) reduction for feature extraction and Support Vector Machines (SVM) for classification in [10], obtaining a recognition rate of up to 88% using a grade 3 polynomial kernel. Then, in [11], Mansor and Rejab used Local Binary Patterns (LBP) for the extraction of characteristics, while, for classification, Gaussian and Nearest Mean Classifier were used. With these tools, they achieved a success rate of 87.74–88% for the Gaussian Classifier and of 76–80% with the Nearest Mean Classifier. Similarly, Local Binary Patterns were used as well in [12] for feature extraction and SVM for classification, obtaining an accuracy of 82.6%. More recently, and introducing deep learning methods, Ref. [13] fused LBP, Histogram of Oriented Gradients (HOG), and CNNs as feature extractors, with SVM for classification, with an accuracy of 83.78% as the best result. Then, in [14], Zamzmi et al. used pre-trained CNNs and a strain-based expression segmentation algorithm as a feature extractor together with a Naive Bayes (NB) classifier, obtaining a recognition accuracy of 92.71%. In [15], Zamzmi et al. proposed an end-to-end Neonatal Convolutional Neural Network (N-CNN) for automatic recognition of neonatal pain, obtaining an accuracy of 84.5%. These works validated their proposed methods using the Infant COPE database mentioned above.

Other recent works tested their proposed methods with other databases. Thus, an automatic discomfort detection system for infants by analyzing their facial expressions in videos from a dataset collected at the hospital Maxima Medical Center in Veldhoven, The Netherlands, was presented in [16]. The authors used again HOG, LBP and SVM with 83.1% correctly detected discomfort expressions. Finally, Zamzmi et al. [14] used CNNs with transfer learning as a pain expression detector, achieving 90.34% accuracy in a dataset recorded at Tampa General Hospital and in [15] obtained an accuracy of 91% for the NPAD database.

On the other hand, concerning emotion recognition and wearable devices, most of the proposed methods until now relied on biomedical signals [17–20]. When using images, and more specifically, facial images to recognize emotions, very few wearable devices can be found. Among them, one can find the work by Kwon et al. [21], where they proposed a glasses-type wearable system to detect a user's emotion using facial expression and physiological responses, reaching around 70% in the subject-independent case and 98% in the subject-dependent one. In [22], another system to automate facial expression recognition that runs on wearable glasses is proposed, reaching a 97% classification accuracy for eight different emotions. More recently, Kwon and Kim described in [23] another glassed-type wearable device to detect emotions from a human face via multi-channel facial responses, obtaining an accuracy of 78% at classifying emotions. Wearable devices have also been designed for infants to monitor vital signs [24], body temperature [25], health using an electrocardiogram (ECG) sensor [26], or as a pediatric rehabilitation device [27]. In addition, there is a growing number of mobile applications for infants, such as SmartCED [28], which is an Android application for epilepsy

diagnosis, or commercial devices with a smartphone application for parents [29] (smart socks [30] or the popular video monitors [31]).

However, no smartphone application or wearable device related to pain detection through facial expression recognition in infants has been found. Therefore, this work investigates different methods to implement a reliable tool to assist in the automatic detection of pain in infants using computer vision and supervised learning, extending our previous work presented in [2]. As mentioned before, texture descriptors and, specifically, Local Binary Patterns, are among the most popular algorithms to extract features for facial emotions recognition. Thus, this work will compare the results after applying several texture descriptors, including Radon Barcodes, which is the first time that they are used to detect facial emotions, this being the main contribution of this paper. Moreover, our tool can be easily implemented in a wireless and wearable system, so it could have many potential applications, such as alerting parents or medical staff quickly and efficiently when a baby is in pain.

This paper is organized as follows: Section 2 explains the main features about the methods used in our research and outlines the proposed method; Section 3 describes the experimental setup and the set of experiments completed and their interpretation; and, finally, conclusions and some future works are discussed in Section 4.

#### **2. Materials and Methods**

In this section, some theoretical concepts are explained first. Then, at the end of the section, the method followed to determine whether a baby is in pain or not is described.

#### *2.1. Pain Perception in Babies*

Traditionally, babies' pain has been undervalued, receiving limited attention due to the thought of babies suffering less pain than adults because of their supposed 'neurological immaturity' [32,33]. This has been refuted through several studies over the last few years, especially by the one conducted by the John Radcliffe Hospital in Oxford in 2015 [34], which concluded that infants' brains react in a very similar way to adult brains when they are exposed to the same pain stimulus. Recent works suggest that infants' units in hospitals must adopt reliable pain assessment tools, since they may derive in short- and long-term sequels [35,36].

As mentioned before, the impossibility of expressing pain in a verbal way has created the need of using other media to assess pain, detect it, and take the appropriate actions. This is why pain assessment scales based on behavioral indicators has been created, such as PIPP (Premature Infant Pain Profile) [37], CRIES (Crying; Requires increased oxygen administration; Increased vital signs; Expression; Sleeplessness) [38], NIPS (Neonatal Infant Pain Scale) [39], or NFCS (Neonatal Facial Coding System) [40,41]. While most assessment scales use vital signals such as heart rate or oxygen saturation, NFCS is based on facial changes through face muscles, mainly on forehead protrusion, contraction of eyelids, nasolabial groove, horizontal stretch of the mouth, and tense tongue [42]. Figure 1 shows a graphical example of the NFCS scale. As this paper uses an image database, this last scale is ideal to determine if the babies are or not in pain, by analyzing the facial changes in different areas according to the NFCS scale.

**Figure 1.** Facial expression of physical distress is the most consistent behavioral indicator of pain in infants.

#### *2.2. Feature Extraction*

Feature extraction methods of facial expressions can be divided depending on their approach. Generally speaking, features are extracted from facial deformation, which is characterized by changes in shape and texture, and from facial motion, which is characterized by either the speed and direction of movement or deformations in the face image [43,44].

As explained in the last section, in this paper, the NFCS scale has been selected, since its reliability, validity, and clinical utility has been extensively proved [45,46]. The criteria of classification of pain in the NFCS scale is based on facial deformations and it depends on the texture of the face. Texture descriptors have been widely used in machine learning and pattern recognition, being successfully applied to object detection, face recognition, and facial expression analysis, among other applications [47]. Consequently, three texture descriptors are taken into account in this research: the popular Local Binary Pattern descriptor; then, a variation of this descriptor, the Local Ternary Patterns; and, finally, a recently proposed descriptor, the Radon Barcodes, which are based on the Radon transform.

#### 2.2.1. Local Binary Patterns

Local Binary Patterns (LBP) are a simple but effective texture descriptor which label every pixel of the image analyzing its neighborhood. It identifies if the grey level of every neighbor pixel is above a certain threshold and codifies this comparison with a binary number. This descriptor has become very popular due to its good classification accuracy and its low computational cost, which allows real-time image processing in many applications. In addition, this descriptor has a great robustness when there are varying lighting conditions [48,49].

On its basic version, LBP operator works with a 3 × 3 matrix that goes across the image pixel by pixel, identifying the grey values of its eight neighbors and taking as a threshold the grey value of the central pixel. Thus, the binary code is obtained as follows: if the neighbor pixels has a lower value than the central one, they will coded as 0; otherwise, their code will be 1. Finally, each binary value is weighted by its corresponding power of two and added to obtain the LBP code of the pixel. In Figure 2, a graphic example is shown.

**Figure 2.** Graphic example of the LBP descriptor.

This descriptor has been extended over the years, so that it can be used in circle neighborhoods of different sizes. In this circular version, neighbors are equally spaced, allowing the use of any radio and any number of neighboring pixels. Once the codes of all pixels are obtained, a histogram is created. It is also common to divide the image into cells, so that a histogram per cell would be obtained, being finally concatenated. In addition, the LBP descriptor has uniformity, which reduces negligible information significantly, and therefore it provides low computational cost and invariance to rotations, which become two important properties when applied to facial expression recognition in mobile and wearable devices [50].

#### 2.2.2. Local Ternary Patterns

Tan and Triggs [51] presented a new texture operator which is more robust to noise than LBP in uniform regions. It consists of an LBP extended into 3-valued codes (0, 1, −1). Figure 3 shows a practical example of how Local Ternary Patterns (LTP) work: first, threshold *t* is established. Then, if any neighbor pixel has a value below the value of the central pixel minus the threshold, it is assigned −1 and, if the value is over the value of the central pixel plus the threshold, it is assigned 1. Otherwise, it is assigned 0. After the thresholding step, the upper pattern and lower pattern are constructed as follows: for the upper pattern, all 1's are assigned 1, and the rest of the values (0s and −1's) are assigned 0; for the lower pattern, all −1's are assigned 1, and the rest of the values (0s and 1's) are assigned 0. Finally, both patterns are encoded in two different binary codes, so this descriptor provides two binary codes for one pixel instead of one as LBP does, that is, more information about the texture of the image. All of this process is shown in Figure 3.

**Figure 3.** Graphic example of the LTP descriptor.

The LTP operator has been applied successfully to similar applications as LBP, including medical images, human action classification and facial expression recognition, among others.

#### 2.2.3. Radon Barcodes

The Radon Barcodes (RBC) operator is based on the Radon transform, which is having an increasing interest in image processing, since it is extremely robust to noise and presents scale and rotation invariance [52,53]. Moreover, it has been used for years to process medical images, and is the basis of current computerized tomography. As mentioned before, facial expression features are based on facial deformations and involve changes in shape, texture, and motion. As Radon transform presents valuable features regarding image translation, scaling, and rotation, its application to facial recognition of emotions has been considered in this work.

Essentially, Radon transform consists of an integral transform which projects all pixels from different orientations to a single vector. Consequently, RBCs are basically the sum (integral) of the values along lines constituted by different angles. Thus, Radon transform is first applied to any input image, and then projections are performed. Finally, all the projections are thresholded individually to generate code sections, which are concatenated to build the Radon Barcode. A simple way for thresholding the projection is to calculate a typical value using the median operator applied on all non-zero values of each projection [53]. Algorithm 1 shows how RBC works [53] and in Figure 4 a graphic example is shown.


**Figure 4.** Graphic example of an RBC descriptor.

Until now, the main application of Radon Barcodes comes from medical image retrieval, where it has given high accuracy. As in the recognition of facial expressions robustness in orientation, illumination, and scale changes are needed, we consider that the RBC descriptor can be a good technique to provide a reliable classification of pain/non-pain in infants using facial images, being the first time that RBC are used in these kinds of applications.

#### *2.3. Classification: Support Vector Machines*

In order to classify properly the features extracted using any of the descriptors defined above, Support Vector Machines (SVM) are chosen.

The main idea of SVM is to select a hyperplane that is equidistant to the training examples of every class to be classified so that the so-called maximum margin hyperplane between classes is obtained [54,55]. To define this hyperplane, only the training data of each class that fall right next to those margins are taken into account, which are called support vectors. In this work, this hyperplane would be the one which separates the characteristics obtained from pain and non-pain facial images. In cases where a linear function does not allow for separating the examples properly, a nonlinear SVM is used. To define the hyperplane in this case, the input space of the examples X is transformed into a new one, Φ(X), where a linear separation hyperplane is constructed using kernel functions as they are represented in Figure 5. A kernel function *K*(*x*, *x* ′ ) is a function that assigns to each pair of elements *x*, *x* ′ ∈ X a real value corresponding to the scalar product of the transformed version of that element in a new space. There are several types of kernel, such as:

• Linear kernel:

$$\mathbf{K}\left(\mathbf{x},\mathbf{x}'\right) = <\mathbf{x},\mathbf{x}'>,\tag{1}$$

• P-Grade polynomial kernel:

$$K\left(\mathbf{x}, \mathbf{x}'\right) = \left[\gamma < \mathbf{x}, \mathbf{x}' > +\tau\right]^p,\tag{2}$$

• Gaussian kernel:

$$\mathcal{K}\left(\mathbf{x},\mathbf{x}'\right) = \left[\exp(-\gamma \left\|\left\|\mathbf{x},\mathbf{x}'\right\|^2)\right], \qquad \gamma > 0,\tag{3}$$

where *γ* > 0 is a scaling parameter and *τ* is a constant.

**Figure 5.** Representation of the transformed space for nonlinear SVM.

The selection of the kernel depends on the application and situation, and a linear kernel is recommended when the linear separation of data is simple. In the rest of the cases, it will be necessary to experiment with the different functions to obtain the best model for each case, since kernels use different algorithms and parameters.

Once the hyperplane is obtained, it will be transformed back into the original space, thus obtaining a nonlinear decision boundary [2].

#### *2.4. The Proposed Method*

Our application has been implemented in MATLAB© R2017. The toolboxes that have been used are *Statistics and Machine Learning* and *Computer Vision System*. As mentioned in Section 1, for the development of the tool, the Infant COPE database [3] has been used. This is a database that is composed of 195 color images of 26 neonates, 13 boys, and 13 girls, with an age between 18 hours and 3 days. For the images, the neonates have been exposed to the pain of the heel test and to three non-painful stimuli: a corporal disturbance (movement from one cradle to another), air stimulation applied to the nose, and the friction of a wet cotton in the heel. In addition, images of resting infants have been taken.

As mentioned before, this implementation could be applied to a mobile device and/or a wearable system, so that, on the one hand, a baby monitor would continuously analyze the images it captures. On the other hand, the parents or medical staff would wear a bracelet or have a mobile application to warn them when the baby is suffering pain. The diagram in Figure 6 shows a possible example of the implementation stages.

**Figure 6.** Flowchart of the different stages.

The first step is pre-processing the input image by detecting infants' faces and then resizing the resulting images and converting them into grey scale. All images are normalized to a size of 100 × 120 pixels. Afterwards, features have been extracted using the texture descriptors mentioned before. The NFCS scale will be followed, so descriptors have been applied only to relevant facial areas to the NFCS scale: right eye, left eye, mouth, and brow. These areas are manually selected with sizes 30 × 50 pixels for eyes, 40 × 90 pixels for mouth, and 15 × 40 pixels for brow. It was possible to make an analysis to find the ideal sizes for each part due to the small size of the used database. Feature vectors from each area have been concatenated to obtain the global descriptor.

Finally, a previously trained SVM classifier decides if the input frame corresponds with a baby in pain or not. The system will be continuously monitoring the video frames obtained and sending an alarm to the mobile device if a pain expression is detected.

#### **3. Results**

In this section, a comparison of three different methods for feature extraction is completed: Local Binary Patterns, Local Ternary Patterns, and Radon Barcodes. According to the results obtained in [2], a Gaussian Kernel has been chosen for SVM classification, since it provides an optimal behavior for the Infant COPE database. SVM has been trained with 13 pain images and 13 non-pain images, and the tests have been performed with 30 pain images and 93 non-pain images different from the training stage. The unbalanced number of images is due to the number of pictures of each class available in the database.

To evaluate the tests, confusion matrices, cross-validation and error rate have been used. In this case, error rate has been calculated as the number of incorrect predictions divided by the total number of evaluated predictions.

#### *3.1. Results on LBP*

The parameters to be considered on the LBP descriptor are the radius, the number of neighbors and the cell sizes. As mentioned before, images has been previously cropped into four different areas. According to the previous results in [2], the best recognition rate is obtained when each of these areas is not divided into cells. Therefore, as it is shown in Figure 7, the recognition rates for all the possible combinations with radius 1, 2, and 3, and neighbors 8, 10, 12, 16, 18, 20, and 24 have been calculated to select the optimum values.

**Figure 7.** Recognition rate according to radius and neighbors.

As shown in Figure 7, the parameters with the best recognition rate are radius 2 and 18 neighbors. This combination presents the following confusion matrix *CMLBP*:

$$\text{CM}\_{LBP} = \begin{pmatrix} 27 & 3 \\ 10 & 83 \end{pmatrix} \tag{4}$$

It implies that there are three false positives and 10 false negatives, thus having an error rate of 10.57% and, therefore, a successful recognition rate of 89.43%.

#### *3.2. Results on LTP*

In this case, the parameters to be calculated on the LTP descriptor are the same as in LBP, but adding threshold *t*. Let us consider the same values for the parameters which gave the best result for LBP (radius 2 and 18 neighbors), and values from *t* = 1 to 10 for the threshold have been chosen.

As is shown in Figure 8, the best result is obtained for threshold *t* = 6, which presents the next confusion matrix *CMLTP*:

$$\text{CM}\_{LTP} = \begin{pmatrix} 20 & 10 \\ 3 & 90 \end{pmatrix} \tag{5}$$

It implies that there are 10 false positives and three false negatives, thus having an error rate of 10.57% and, therefore, a recognition rate of 89.43%.

**Figure 8.** Recognition rate according to LTP threshold.

#### *3.3. Results on RBC*

The parameter to be calculated in the RBC method is the number of projection angles. To do this, typical values 4, 8, 16, and 32, as considered in [53], have been chosen. The results of the carried tests are shown in Figure 9.

**Figure 9.** Recognition rate according to RBC projections.

As we can see in Figure 9, the best result is obtained with four projections, which presents the next confusion matrix *CMRBC*:

$$\text{CM}\_{\text{RBC}} = \begin{pmatrix} 27 & 3 \\ 3 & 90 \end{pmatrix} \tag{6}$$

It implies that there are three false positives and three false negatives, thus having an error rate of 4.88% and, therefore, a recognition rate of 95.12%.

#### *3.4. Final Results and Discussion*

As shown throughout this section, the best results are obtained by RBC with a recognition rate of 95.12%, followed by LBP and LTP with a recognition rate of 89.43 %. These results show the validity of applying Radon Barcodes to facial emotion recognition, as seen in Section 2, and it can be then concluded that the RBC descriptor is a reliable, robust texture descriptor against noise and scale and rotation invariance.

Taking into account the cross-validation values of each method, LBP has a value of 7.69%, LTP obtains 19.23%, and RBC a cross-validation score of 11.54%. With these results, it can be said that, in terms of being independent from the training images, LBP is better than LTP and RBC. Considering the runtime to identify the pain in an input image, LBP takes around 20 ms in processing a frame, LTP around 300 ms, and RBC around 30 ms. Therefore, in terms of cross-validation score and execution time results, LBP obtains better results. However, RBC behaves much better in terms of recognition rate. In Table 1, there is a summary of the obtained results.


**Table 1.** A summary of texture descriptors' results.

Considering that typically videos work at 25–30 frames per second, it can be said that both LBP and RBC would be able to analyze all frames detected in a second, allowing the system to be integrated in a mobile app or a wearable device. However, since facial expressions do not change drastically in less than a second, the recognition process would not lose accuracy by just analyzing a few frames per second, instead of 25–30. This would also reduce workload, getting a more efficient tool in terms of speed, as a result.

Finally, in Table 2, there is a comparison between our research and some previous works. All of these works have made use of the Infant COPE database and different feature extraction methods and classifiers such as texture descriptors, deep learning methods, or supervised learning methods.


**Table 2.** Comparison with other works.

From the comparison of Table 2, it can be observed that the proposed method with Radon Barcode achieves the best recognition rate, over 10%, compared with previous works working with the same database. Therefore, it can be said that the proposed method can be used as a reliable tool to classify infant face expressions as pain or non-pain. Moreover, the time to process the algorithm makes it feasible to be implemented in a mobile app or a wearable device.

Finally, from the results in Table 2, it must be pointed out that different research that has used the same algorithms may provide different recognition rate results. This may be the result of the pre-processing stage in each work or due to the input parameters of the different feature extraction methods and/or the classifier used.

#### **4. Conclusions**

In this paper, a tool to identify infants' pain using machine learning has been implemented. The system achieves a great recognition rate when using Radon Barcodes, around 95.12%. This is the first time that RBC is used to recognize facial expressions, which proves the validity of the Radon Barcodes algorithm for the identification of emotions. In addition, as shown in Table 2, it has been proved that Radon Barcodes improved the recognition results compared to other recent proposed methods. Furthermore, the time to process frames for pain recognition with RBC makes it possible to use our system in a real mobile application.

In relation to this, we are currently working in implementing the tool in real time and designing a real wearable device to detect pain with facial images. We are beginning a collaboration with some hospitals to perform different tests and develop a prototype of the final system. Finally, we are also working with other infant databases and datasets with other ages to check the functionality and validity of the implemented tool, and the definition of a parameter to estimate the degree of pain is also under research.

**Author Contributions:** Conceptualization, F.A.P.; Formal analysis, H.M.; Investigation, A.M.; Methodology, F.A.P.; Resources, H.M.; Software, A.M.; Supervision, F.A.P.; Writing—original draft, A.M.; Writing—review & editing, H.M. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work has been partially supported by the Spanish Research Agency (AEI) and the European Regional Development Fund (FEDER) under project CloudDriver4Industry TIN2017-89266-R, and by the Conselleria de Educación, Investigación, Cultura y Deporte, of the Community of Valencia, Spain, within the program of support for research under project AICO/2017/134.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Detection of Emotion Using Multi-Block Deep Learning in a Self-Management Interview App**

**Dong Hoon Shin <sup>1</sup> , Kyungyong Chung <sup>2</sup> and Roy C. Park 3,\***


Received: 6 October 2019; Accepted: 7 November 2019; Published: 11 November 2019

**Abstract:** Recently, domestic universities have constructed and operated online mock interview systems for students' preparation for employment. Students can have a mock interview anywhere and at any time through the online mock interview system, and can improve any problems during the interviews via images stored in real time. For such practice, it is necessary to analyze the emotional state of the student based on the situation, and to provide coaching through accurate analysis of the interview. In this paper, we propose detection of user emotions using multi-block deep learning in a self-management interview application. Unlike the basic structure for learning about whole-face images, the multi-block deep learning method helps the user learn after sampling the core facial areas (eyes, nose, mouth, etc.), which are important factors for emotion analysis from face detection. Through the multi-block process, sampling is carried out using multiple AdaBoost learning. For optimal block image screening and verification, similarity measurement is also performed during this process. A performance evaluation of the proposed model compares the proposed system with AlexNet, which has mainly been used for facial recognition in the past. As comparison items, the recognition rate and extraction time of the specific area are compared. The extraction time of the specific area decreased by 2.61%, and the recognition rate increased by 3.75%, indicating that the proposed facial recognition method is excellent. It is expected to provide good-quality, customized interview education for job seekers by establishing a systematic interview system using the proposed deep learning method.

**Keywords:** self-management interview application; emotion analysis; facial recognition; image-mining; deep convolutional neural network

#### **1. Introduction**

Recently, Korea's youth unemployment rate has been high, to the extent that people aged 30 to 40 constitute more than half (56.7%) of the highly educated but economically inactive population (i.e., they cannot find good jobs). Accordingly, the severity of social waste through unemployment is increasing [1]. According to one analysis, many highly educated people who could have high-level careers have been produced through university education, but they are unable to find a good position right after graduation because they graduate without an appropriate interview clinic and without information and coaching on practical employment skills [2]. A situation in which they cannot work full time is problematic in job-seeking, and a common problem is that they have a lot of idle time as they go to graduate school, work as a freelancer, or work part-time due to parenting duties, despite their ability to hold down a good job. In addition, the hardest part when a job applicant seeks employment is preparing for the interviews [3]. In particular, since there are no set answers for

an interview, interviewees must exhibit their capabilities differently, depending on their individual living environment and values. Also, since there are big differences among individuals (e.g., posture, eye contact, unhelpful language habits), one-on-one consulting is required, and content is needed by which students can practice their interview techniques anywhere and at any time (e.g., the night before the interview, in the train when going to an interview, and in the waiting room before the interview). The demand for online interview content is increasing, which allows a last check briefly prior to the interview [4]. Facial recognition systems can be broadly divided into face area detection and facial recognition. Face area detection determines the position of the face, size, posture, etc., in the video, and helps create a certain image for facial recognition [5,6]. Types of face detection include (1) the knowledge-based method that uses information about the typical face, (2) the feature-based method that looks for easily detected characteristics, despite changes in posture or lighting, (3) the template-matching method, which stores the basic shape of a few faces and performs a comparison with the input images, and (4) the appearance-based method, learning the face model from training images representative of the diversity in faces [7–9]. As a study for facial recognition, algorithms such as Haar, scale invariant feature transform (SIFT), ferns, modified census transform (MCT), histogram of oriented gradients (HOG), etc., are used to extract the feature factors of an image, and face analysis is actively performed based on them [10,11]. Recently, deep learning-based facial recognition has also been widely used, and a method of automatically extracting feature factors using a convolutional filter based on a convolutional neural network (CNN) has been used [12–14]. When a face is recognized using a specific factor, it is difficult to extract and select an optimal specific factor, depending on the original image state and application, and it is also difficult to determine a feature factor through various experimental and empirical factors.

We developed a self-management interview system and conducted a study on deep learning-based face analysis for emotion extraction to provide accurate interview services. Unlike the basic structure for learning the whole-face image, in this paper, a deep convolutional neural network (DCNN) method [15] for image analysis through a multi-block process helps the user learn after sampling the core facial areas, which is important for emotion analysis during face detection. The system proposed in this paper is expected to contribute to the creation of job opportunities by providing customized interview education that enables efficient interview management that is not constrained by space and time, and that provides an appropriate level of interview coaching. The figure included in this image is the author, who agreed to provide the figure.

The study is organized as follows. Section 2 describes the research related to facial recognition-based application services, and technology using facial recognition. Section 3 describes detection of user emotion using multi-block deep learning in the self-management interview application. For that purpose, also described is the image multi-block process to extract the face's feature points, plus multi-block selection and extraction of main features, and a proposed experiment with the deep learning process in face detection. Section 4 describes the proposed mobile service for real-time interview management, and Section 5 provides a conclusion.

#### **2. Related Research**

#### *2.1. Facial Recognition-Based Application Services*

Recently, various services based on facial recognition have been provided. The face recognition process is as follows. A camera captures a face image. Then, the eyes, eyebrows, nose, and mouth, which are the main factors for emotion extraction, are analyzed to extract characteristics data, and they are compared with feature data in a database provided for face analysis in facial recognition. The facial recognition technology analyzes facial expressions to determine emotional states, such as happiness, surprise, sadness, disgust, fear, confusion, etc., and is used for advertising-effect measurement, marketing, and education [16]. Founded by the Massachusetts Institute of Technology Media Lab, Affectiva released Affdex, a solution for recognizing facial expressions and identifying emotional

states [17]. Figure 1 shows the Affectiva facial recognition platform. Affectiva modularizes emotion recognition-related artificial intelligence (AI) technology, distributes it through its website in the form of a software development kit (SDK), and opens it for use by various engineers and in business fields. It applies an emotion recognition solution to Tega, a robot that teaches foreign languages, and presents functions that provide appropriate content and gives rewards by understanding children's facial expressions. In addition, the facial recognition technology has been widely applied to various fields, such as locating crime suspects and lost children, and enabling mobile payments, in particular. It is used to arrest criminal suspects and locate lost children through artificial intelligence cameras attached on the chest, based on an agreement with US police [18].

**Figure 1.** Facial recognition platform with respect to Affectiva [17]. \* The figure included in this image is the author, who agreed to provide the figure.

#### *2.2. Technology Using Facial Recognition*

The AdaBoost algorithm is a technology often used for face detection, and is a method for creating strong criteria for selection by combining weak criteria, which has advantages [19,20]. This reduces the probability of drawing wrong conclusions, and increases the probability of accurately assessing problems that are difficult to judge. For facial recognition, a face area image is required. In order to increase the success rate with face detection and facial recognition, the impacts of lighting and inclination should be minimized, and images should be normalized. So, as images are normalized, the probability of errors decreases [21]. Video-based emotion recognition analyzes the characteristics of the face in a video. At first, this study used classic machine learning and computer vision. For example, the characteristics of the face were extracted based on the gradients of the face extracted from video. The characteristics were analyzed, using algorithms like a support vector machine (SVM) or random forest, to figure out the facial expression. And yet, there is a singularity effect according to the surrounding background or the illumination intensity of the video. In addition, accuracy is greatly affected by the angle of the face. Figure 2 shows a facial recognition algorithm using a face database. The dataset used in the early stages was secured in a limited environment; however, videos that contain everyday situations are in the dataset [22].

Figure 3 shows how to recognize a face in a three-dimensional (3D) image [23,24]. The flow of the method can be separated from the training phase and the test phase. In the training phase, face data are collected from 3D images, and pretreatment is performed to obtain a clean 3D face without bending. Preprocessed data include facial features from a feature extraction system [25]. The features, such as extracted face data, are stored in a feature database [26,27]. Next, in the test phase, the entered target faces are the same as those used in the training phase and during the 3D face data collection, pretreatment, and feature-extraction steps. In the feature-matching phase, match scores are calculated by comparing the target face with the database saved in the training phase. When the match score is judged to be high enough, this algorithm determines that the target face has been recognized. Also,

facial recognition technology is used in the augmented reality field. It is a method to extract feature points, and draws the coordinate plane of the face to calculate the position, to make a 3D effect of the product, and overlap it onto the face [28,29].

**Figure 2.** Facial recognition algorithm using a face database. \* The figure included in this image is the author, who agreed to provide the figure.

**Figure 3.** A 3D facial recognition of augmented reality system.

#### **3. Detection of Emotions Using Multi-Block Deep Learning in Self-Management Interviews**

Figure 4 shows the whole process of the system described in this paper. First, we proceed with facial recognition, where features are extracted. Multi-block sampling is performed by extracting feature points from the recognized faces. Sampled data are extracted through deep learning based on a DCNN. Analysis is conducted based on the extracted emotions, and the analyzed data are managed by the interview system proposed in this paper. Interview management is done through the application itself. The CAS-PEAL face database is used for facial recognition, and the Cohn-Kanade database is used for emotion extraction.

#### *3.1. Image Multi-Block Process for Face Main Point Extraction*

In this paper, we developed a self-management interview system and conducted a study on deep learning-based face analysis for emotion extraction to provide accurate interview services. Unlike the basic structure for learning the whole-face image, the deep learning method proposed in this paper is a model that helps the user learn from images of multi-block core areas, such as the eyes, nose, and mouth, which are important factors for emotion analysis during face detection. The proposed learning structure of the DCNN consists of a multi-block process of entered face images and multi-block deep learning. In the multi-block process, the input image is blocked based on multiple AdaBoost. The multi-block deep learning model is executed by considering the sizes of the original image and of the sampled image that is blocked for area extraction. When both processes are completed, the whole face image and the sampled multi-block image have been learned, making it possible to use them during the emotion detection stage afterwards. The recognition process of the multi-block deep learning algorithm consists of a multi-block process, multi-block selection, and a multi-block deep

performance process. In the existing deep learning model, facial recognition utilizes the whole face, which causes a problem in that areas such as the eyes, nose, and mouth (the key factors for analyzing emotions) are not recognized correctly. In this paper, therefore, the recognition rate was improved by extracting the specific parts of a face image required for emotion extraction by the multi-block method. In particular, if the multi-block is large or small in the blocking process, features of the main areas cannot be extracted accurately, which causes large errors in recognition and learning.

In this paper, multiple AdaBoost was used to carry out sampling by setting the optimal blocking. Figure 5 shows the process of detection and classification with multiple AdaBoost. Multiple AdaBoost creates a stronger classifier by combining weak classifiers, allows a weak classifier to determine whether the image is a face or not when there is a certain purpose. It is designed to select a feature of a rectangular shape with the fewest errors in order to let a weak classifier use the fewest improperly classified training videos, and, in turn, have the optimum threshold classification function.

**Figure 5.** The process of detection and classification with multiple AdaBoost. \* The figure included in this image is the author, who agreed to provide the figure.

For this process, training images and sample images were required, so by using the CAS-PEAL face database, our database included 99,594 images with a variety of poses, expressions, and lighting levels from 1040 individuals (595 male and 445 female). Domains of faces to be extracted were defined as positive (object) samples, while images other than a face were defined as negative (non-object, background) samples. Also, we use the Cohn-Kanade database to analyze perceived facial emotions

from data in this database that include 486 sequences from 97 poses [30,31]. At this time, positive images must have pixels of the same size, and detection should be made by aligning the positions of eyes, noses, and mouths so they are the same as much as possible. Learning data should include information on whether the image belongs in the positive or negative category. In addition, features for distinguishing a face from the background are also required. Such features could be presented as a classifier to distinguish/classify an object. Since these features are a base classifier and a candidate for a weak classifier, it was necessary to decide how many times the process of weak classifier selection should be repeated [32,33]. In other words, it was necessary to determine how many weak classifiers should be combined into one stronger classifier, and to select one feature having the best performance in classifying training samples by class and to calculate a weak classifier for the corresponding iteration [34]. Therefore, we used a weighted linear combination of *T* weak classifiers, as shown in Equation (1).

$$\mathbf{E}(\mathbf{x}) = a\_1 e\_1(\mathbf{x}) + \dots + a\_T e\_T(\mathbf{x}) = \sum\_{t=1}^T a\_l e\_l(\mathbf{x}) \tag{1}$$

E: final strong classifier,

*e*: weak classifier,


#### *3.2. Multi-Block Selection and Extraction of Main Area Features*

− −

The multi-block selection process selects blocks to be used for actual recognition among the multi-blocks previously delivered through the feature numerical analysis. For accurate emotion analysis, the user's eyes, nose, and mouth, which are the main feature points, should be clearly identified, and they can be classified according to the degree of rotation of the face. If there is no information on specific points in the whole image, the rotation information should be detected during the multi-block process. Figure 6 shows the whole facial recognition and emotion analysis process.

**Figure 6.** The whole facial recognition and emotion analysis process. \* The figure included in this image is the author, who agreed to provide the figure.

Face detection was made by moving a 24 × 24 pixel block; for simple patterns in multiple AdaBoost learning, basic patterns were used. In addition, the number of simple detectors to be searched by the learning process was selected as 160, and the learned detectors became serialized, in turn enhancing the processing speed. The learned detectors were serialized into 10 stages in which 16 learned detectors belong in an arbitrary manner. Parameters for each stage were adjusted, and as for images in multiple AdaBoost learning, 24 × 24 resolution was used. For detection by size, the input images were classified

−

(based on the degree of down-sampling) into three types, and the face was detected from among the down-sampled images. In detection by rotation, facial images rotated at −5 ◦ to +5 ◦ , +15◦ to +25◦ , and −15◦ to −25◦ were learned by AdaBoost. Then, by using the serialized detector, they were each analyzed and, in order of detection, rotation of the face was classified. The detected faces were classified into nine types, and the information about the locations of the detected faces was provided as well. Figure 7 shows the face image–detection process. For face detection, 80 × 60 down-sampled images were used for detecting a large face, 108 × 81 down-sampled images were used for a medium-sized face, and 144 × 108 down-sampled images for a small face. The sequence of detection by size was selected to enhance the detection speed and was done as follows: Detection of 80 × 60 down-sampled images was first, followed by the 108 × 81 down-sampled images, and then, the 144 × 108 images. If a detected face overlapped the block detected in the face from the down-sampled image in the preceding step, that detection was not valid. The input image was searched for among the down-sampled images, and when it was detected, a block of the face from the detected image was cut and then normalized to the predesigned size and passed to the next process. At the time, principal component analysis (PCA) was used to measure similarity with the input image, verifying the face. This is a process of rotating an image by using the verified information on rotation of the face, until the rotation of the face in the image becomes almost zero. When the rotation of the face was verified to be between +15◦ and +25◦ , the face was rotated by −20◦ , and when the rotation of the face was between −15◦ and −25◦ , the face was rotated by +20◦ . − − −

**Figure 7.** The face image detection process. \* The figure included in this image is the author, who agreed to provide the figure.

When the user's face was extracted in the aforementioned process, the positions of the eyes and nose should be extracted. The patterns for the person's eyes could be extracted by using the facial image obtained through the face detection process. Eyes and nose extraction can be classified into three stages. The first stage was to designate a region to search for the eyes in the facial image obtained by the face detector. From this stage, it was possible to roughly estimate the position of the eyes, even if they were not precise, and such an estimated position could be defined as a certain domain. In the second stage, the region for the eyes must be clearly defined, as shown in Figure 8 After defining the eyes region, we used multiple AdaBoost to determine 12 × 12 pixel eye images and 12 × 12 pixel non-eye images to prepare the serialized eye detector. This was to classify these eyes from other eyes. Then, AdaBoost went through a process of detecting block images that had the eyes in the designated region. The last stage was to use PCA, trained with eye images, to measure the similarity of each eye image and to select the image with the highest similarity. As shown in Figure 8, the position of the eyes could be defined as the center point of a verified eyes image.

**Figure 8.** The eye detection process. \* The figure included in this image is the author, who agreed to provide the figure.

In order to detect a nose's location, it was necessary to designate a nose search region on the face image acquired during the face search, which is the same process as required for the eye search. Although the exact location of the nose cannot be specified, a rough location can be estimated, and the predictive value of the location of the nose can be defined for certain regions. Figure 9 shows the nose detection process.

**Figure 9.** The nose detection process. \* The figure included in this image is the author, who agreed to provide the figure.

In order to determine whether the image was actually a nose or not, multiple AdaBoost was used to learn 12 × 12 nose images and non-nose images in order to create a serialized nose detector and go through the process of finding the block images that were detected as noses within the defined region in the first step. The last step was to calculate the similarity of the nose images acquired during the second step, and compare nose images to find one with the best similarity. As with the eye image search process, the nose location was the center point of a verified nose image. After the detection of eyes and nose locations, the face normalization process was followed. Face normalization is a process to calculate the accurate location, size, and rotation of the face using the locations of both the eyes and the nose. The face image was warped to ensure consistency among the different forms of a face. The actual emotion images can be created by finding the eyes and the nose in the image and by going through image warping based on that information. At this time, the size of the normalized images and the location of each part may vary depending on the design of the recognizer.

#### *3.3. Face Detection Using the Multi-Block Deep Learning Process*

For a deep learning model of the proposed user emotion extraction, this experiment extracted emotions using a DCNN based on the multi-blocked sample images of the major face areas, and the images with completed feature extractions, which was intended to minimize the performance time from entry, and the classification of the images. Figure 10 shows the multi-block deep learning structure proposed in this study. The emotion model was extracted by delivering a block target that included information about the features from the images learned by the DCNN in the multi-block and block selection stages. A convolution operation was conducted between the original images and the multi-block images extracted by sampling. This brought into relief the features of the major face areas for the extraction of emotions through the feature extraction filter. The filter coefficient for the feature extraction filter was set to a random value in the early stages, and was then set to the optimal filter coefficient with the least error rate through learning. Next, the process of reducing the images was executed, analyzing the features of the extracted images, and filtering the optimum features. At this time, the general DCNN launched a method for minimizing the cross-entropy loss function so it can be similar to the softmax result from image data entered from the multi-block and feature extraction stages.

**Figure 10.** The proposed multi-block deep learning structure. \* The figure included in this image is the author, who agreed to provide the figure.

This study defines two cross-entropy loss functions like those in Equations (2) and (3) to deliver knowledge: In Equation (2), loss function *L*<sup>1</sup> is a cross-entropy function based on a recognition result error for the label. In Equation (3), loss function *L*<sup>2</sup> is the cross-entropy function representing the error with the block target representing the predicted probability value of the DCNN.

<sup>ଵ</sup> ൌ െ ∑ ሺൌሻ ൈ ሺൌ|; ሻ




ሺൌ|; ሻ

$$L\_1 = -\sum\_{n=1}^{|V|} H(y=n) \times \log P(y=n|\mathbf{x};\ \theta) \tag{2}$$

$$L\_2 = -\sum\_{n=1}^{|V|} q(y = n | \mathbf{x}; \,\theta\_E) \times \log \mathbb{P}(y = n | \mathbf{x}; \,\theta) \tag{3}$$

In the formula, *q* is the softmax probability value formed by learning the features of multi-blocked images, while *P*(*y* = *n <sup>x</sup>*; <sup>θ</sup>) is the probability the DCNN learned by utilizing the features of the whole images, *n* is the index of the feature category, and |*v*| is the total number of classes. This study used both the knowledge block target containing the feature information of the multi-block images delivered by the DCNN while learning, and the existing true class target value so as to allow learning everything. It extracted accurate emotions, utilizing feature extraction of the whole image area and the features of the blocked images of the key areas, giving a different weighted value to each of the two loss functions, *L*<sup>1</sup> and *L*2, as seen in Equation (4):

$$L = \; a \times L\_1 + (1 - a) \times L\_2 \; \; 0 < a < 1\tag{4}$$

In addition, for face area detection and estimation analysis, the CAS-PEAL face database was employed. The learning data in the database used consisted of classes of facial expression information for a total of 1040 persons, and consisted of a total of 1240 sheets of images for each class. The data for deep learning was composed of 10 sheets per emotion class, with noise added to the learning data. On the other hand, the AlexNet [35] structure, which is used a lot for facial recognition, was selected for comparison with the deep learning method proposed in this paper. Figure 11 shows a comparison of emotion recognition accuracy with the proposed method against the accuracy with AlexNet. The facial expressions were recognized through extraction of the entire face area and the main areas, and the accuracy of extracting emotions according to the expressions was compared, with the proposed method showing accuracy about 3.75% higher.

**Figure 11.** Comparison of emotion recognition accuracy of the proposed model and AlexNet.

Figure 12 shows the distribution of the times required to extract the face area, and the results of face area detection. As a result of one experiment, the proposed method had a faster processing time and a lower error rate than the basic method that did not go through smoothing. In addition, as the dispersion of the processing time was only a little, it turned out to be a normalization method suitable for real-time processing.

ൌ α ൈ <sup>ଵ</sup> ሺ1െሻ ൈ <sup>ଶ</sup>

,0 ൏ ൏ 1

**Figure 12.** Face area detection time (in milliseconds).

Figure 13 shows the distribution of the processing time to detect the eye area, and the results of eye area detection. In eye area detection, the distribution of the processing time was not affected by normalization; however, there was a difference in the error rate. As a result of the experiment, the proposed method was deemed excellent.

Figure 14 shows the distribution of the processing time to detect the nose area, and the results of nose area detection. As with eye area detection, the proposed method showed excellent performance in terms of average processing time and error rate.

**Figure 14.** Nose area detection time (in millisecconds).

#### **4. Mobile Service for Real-Time Interview Management**

The self-management interview system was developed as a mobile application for smooth interview coaching. When the interview app is used, a real-time video is taken and transmitted to the server. At this time, the person's emotional state is presented through voice and facial recognition in the video, and real-time coaching is provided accordingly. In addition, including various types of interview coaching content and self-diagnosis programs, it is an effective system for speech practice as well as interviews. Figure 15 shows the image-analysis algorithm-based emotion matching. As for the image-analysis algorithm, the faces and eyes were detected, using an Extensible Markup Language (XML) classifier, and based on the detected images, emotions were extracted from a comparative analysis by the CAS-PEAL face database and Sort image. The figure included in this image is the author, who agreed to provide the figure.

**Figure 15.** The structure of the interview management system. \* The figure included in this image is the author, who agreed to provide the figure.

− Figure 16 shows a system that analyzes images by capturing one frame after dividing a video into frame units. System functions include video playback, analysis visualization, recognition options, rotation options, binary processing, curve graph representation of emotions, object feature analysis, etc. After capturing the video, the user selects the part to be recognized with the recognition option and then recognizes that part through a binarization process. The binarization function finds the feature points of the image. There may be a rotated face in the captured image, so there is also an option that rotates the face to the correct position. This function offers a selection range of −25 to +25 degrees. There are eight emotions for analyzing a person's feelings through the recognition function: Neutral (usual expression), contempt, disgust, anger, happiness, surprise, sadness, and fear. There is also

a function that graphs the emotions in each image captured from the video. Analysis of the image in Figure 16 confirms the person is happy. Object characterization analysis shows the gender and age, and features like a mustache, beard, and eyeglasses. According to the analysis, the image in the current frame is male, 20–30 years old, without a mustache or beard, and no glasses. A screen shot from the facial recognition and emotion analysis results of the interview management system is shown in Figure 17.

−

**Figure 16.** The real-time interview management system. \* The figure included in this image is the author, who agreed to provide the figure.


**Figure 17.** Facial recognition and emotion analysis results from the interview management system. \* The figure included in this image is the author, who agreed to provide the figure.

For the mobile service configured in this study, an application was developed utilizing Android Studio 9 (Pie) on an Intel Core i7-4770 CPU at 3.40 GHz, with 16 GB of RAM running the Windows 10 Enterprise 64-bit environment. The figure included in this image is the author, who agreed to provide the figure. For the real-time interview and automatic coaching service, an app was configured that has a server for interview management, a module for automatic coaching based on the interview when a user uses the service, and a user interface for the relevant services. When the user touches each button in the real-time interview management mobile application, including voice evaluation, interview evaluation, and comprehensive interview from the main screen, that input is passed to the service use information page, providing values for pronunciation, interview, and coaching, for the function interviewCode. On the service use information page, the value of interviewCode is forwarded as an intent that is distinguished as a value for each variable and is displayed, applying a message image for the corresponding voice evaluation, interview evaluation, comprehensive interview, and start button.

Splash screens for the facial recognition and emotion analysis results of self-managed interviews are shown in Figure 18. Once the interview evaluation begins, for evaluation questions, the application calls up the interview question API(Application Programming Interface) in the server, brings up the index of the relevant questions, the content of the questions, and information about the company that set the questions, and displays them in the application view. This was designed so that, once recording begins, the application calls the Android internal camera and conducts image recording and voice recording for encoding, so that both image and voice are included in the video. When the recording ends, the file-upload API in the server is immediately run to upload the user information, question index, and video file to the server, and once uploading is completed, the analysis procedure is launched through a module. On the module analysis information page, at regular intervals, the application continuously calls up the module analysis results API in the server. When the module analysis is completed, the user moves to the interview evaluation results page. Then, with the values coming from the module, the result is displayed in percentages of the emotions (including neutral, contempt, disgusted, angry, happy, surprised, scared, and fear) in the criteria for analysis.

**Figure 18.** The real-time interview management mobile application. \* The figure included in this image is the author, who agreed to provide the figure.

#### **5. Conclusions**

In this paper, we developed a self-management interview system and conducted a study on deep learning-based face analysis for emotion extraction to provide an accurate interview evaluation service. A self-management interview system was developed as a mobile application for smooth interview coaching. When the interview service is used, a real-time video is recorded and transmitted to the server. At this time, the person's emotional state is presented through voice and facial recognition from the video, and real-time coaching is provided accordingly. In addition, including a variety of interview coaching content and self-diagnosis programs, the proposed system is effective for speech practice as well as interview practice. Unlike the basic structure for recognizing a whole-face image, the deep learning method for image analysis in this system helps the user learn after sampling the core areas that are important for sentiment analysis during face detection through a multi-block process. In the multi-block process, multiple AdaBoost is used to perform sampling. After sampling, an XML classifier is used to detect the main features, which are set at threshold values to remove elements

that interfere with facial recognition. In addition, the extracted images are detected by using the CAS-PEAL face database to classify eight emotions (e.g., neutral, contempt, disgusted, angry, happy, surprised, scared, and fear), and services are provided through the application. In the experiment results, facial expressions were recognized through extraction of the entire face area and the main areas. The accuracy from extracting emotions based on the recorded expressions was compared, and the extraction time of the specific areas was decreased by 2.61%, and the recognition rate was increased by 3.75%, indicating that the proposed facial recognition method is excellent. The extracted emotions are provided through an interview management app, and users can efficiently access the interview management system based on them. We believe the interview coaching application will be utilized to provide an interview education that matches students with employment coaches, and it will provide quality job interview–education content for students in the future. The system proposed in this paper is expected to contribute to the creation of job opportunities by providing customized interview education that enables efficient interview management, is not constrained by space and time, and provides an appropriate level of interview coaching.

**Author Contributions:** K.C. and R.C.P. conceived and designed the framework. D.H.S. implemented Multi-Block Deep Learning for a Self-Management Interview App. R.C.P. and D.H.S. performed experiments and analyzed the results. All authors have contributed in writing and proofreading the paper.

**Funding:** This research was funded by a National Research Foundation of Korea (NRF) grant funded by the Korea government (2019R1F1A1060328).

**Acknowledgments:** We appreciate very much the author and researchers who agreed to provide the images used in this paper.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Cost-Effective CNNs for Real-Time Micro-Expression Recognition**

#### **Reda Belaiche \*, Yu Liu, Cyrille Migniot, Dominique Ginhac and Fan Yang**

ImViA EA 7535, University Bourgogne Franche-Comté, 21000 Dijon, France; Yu\_liu@etu.u-Bourgogne.fr (Y.L.); Cyrille.Migniot@u-bourgogne.fr (C.M.); Dominique.Ginhac@u-bourgogne.fr (D.G.); fanyang@u-bourgogne.fr (F.Y.)

**\*** Correspondence: Reda.Belaiche@u-bourgogne.fr

Received: 5 June 2020; Accepted: 16 July 2020; Published: 19 July 2020

**Abstract:** Micro-Expression (ME) recognition is a hot topic in computer vision as it presents a gateway to capture and understand daily human emotions. It is nonetheless a challenging problem due to ME typically being transient (lasting less than 200 ms) and subtle. Recent advances in machine learning enable new and effective methods to be adopted for solving diverse computer vision tasks. In particular, the use of deep learning techniques on large datasets outperforms classical approaches based on classical machine learning which rely on hand-crafted features. Even though available datasets for spontaneous ME are scarce and much smaller, using off-the-shelf Convolutional Neural Networks (CNNs) still demonstrates satisfactory classification results. However, these networks are intense in terms of memory consumption and computational resources. This poses great challenges when deploying CNN-based solutions in many applications, such as driver monitoring and comprehension recognition in virtual classrooms, which demand fast and accurate recognition. As these networks were initially designed for tasks of different domains, they are over-parameterized and need to be optimized for ME recognition. In this paper, we propose a new network based on the well-known ResNet18 which we optimized for ME classification in two ways. Firstly, we reduced the depth of the network by removing residual layers. Secondly, we introduced a more compact representation of optical flow used as input to the network. We present extensive experiments and demonstrate that the proposed network obtains accuracies comparable to the state-of-the-art methods while significantly reducing the necessary memory space. Our best classification accuracy was 60.17% on the challenging composite dataset containing five objectives classes. Our method takes only 24.6 ms for classifying a ME video clip (less than the occurrence time of the shortest ME which lasts 40 ms). Our CNN design is suitable for real-time embedded applications with limited memory and computing resources.

**Keywords:** computer vision; deep learning; optical flow; micro facial expressions; real-time processing

#### **1. Introduction**

Emotion recognition has received much attention in the research community in recent years. Among the several sub-fields of emotion analysis, studies of facial expression recognition are particularly active [1–4]. Most of the affective computing methods in the literature apply the emotion model presented by Ekman [5] that reported seven basic expressions: anger, fear, surprise, sadness, disgust, contempt and happiness. Ekman developed the Facial Action Coding System (FACS) to describe the facial muscle movements according to the action units , i.e., the fundamental actions of individual muscles or groups of muscles that can be combined to represent each of the facial expressions. These facial expressions can thus be labeled by codes based on the observed facial movements rather than from subjective classifications of emotion.

In contrast to the traditional macro-expression, people are less familiar with micro facial expressions [5,6], and even fewer know how to capture and recognize them. A Micro-Expression (ME) is a rapid and involuntary facial expression that exposes a person's true emotion [7]. These subtle expressions usually take place when a person conceals his or her emotions in one of the two scenarios: conscious suppression or unconscious repression. Conscious suppression happens when one deliberately prevents oneself from expressing genuine emotions. On the contrary, unconscious repression occurs when the subject is not aware of his or her true emotions. In both cases, MEs reveal the subject's true emotions regardless of the subject's awareness. Intuitively, ME recognition has a vast number of potential applications across different sectors, such as the security field, neuromarketing [8], automobile drivers' monitoring [9] and lies and deceit detection [6].

Psychological research shows that facial MEs generally are transient (e.g., remaining less than 200 ms) and very subtle [10]. The short duration and subtlety levy great challenges on a human trying to perceive and recognize them. To enable better ME recognition by humans, Ekman and his team developed the ME Training Tool (METT). Even with the help of this training tool, human can barely achieve around 40% accuracy [11]. Moreover, humans' decisions are prone to being influenced by individual perceptions that vary among subjects and across time, resulting in less objective results. Therefore, a bias-free and high-quality automatic system for facial ME recognition is highly sought after.

A number of earlier solutions to automate facial ME recognition have been based on geometry or appearance feature extraction methods. Specifically, geometric-based features encode geometric information of the face, such as shapes and locations of facial landmarks. On the other hand, appearance-based features describe the skin textures of faces. Most existing methods [12,13] attempt to extract low-level features, such as the widely used Local Binary Pattern from Three Orthogonal Planes (LBP-TOP) [14–16] from different facial regions, and simply concatenate them for ME recognition. Nevertheless, transient and subtle ME inherently makes it challenging for low level-features to effectively capture essential movements in ME. At the same time, these features can also be affected by irrelevant information or noise in video clips, which further weakens their discrimination capabilities, especially for inactive facial regions that are less dynamic [17].

Recently, more approaches based on mid-level and high-level features have been proposed. Among these methods, the pipeline composed of optical flow and deep learning has demonstrated its high effectiveness for MEs recognition in comparison with traditional ones. The studies applying deep learning to tackle the ME classification problem usually considered well-known Convolutional Neural Networks (CNNs) such as ResNet [18] and VGG [19]. These studies re-purposed the use of off-the-shelf CNNs by giving them input data taken from the optical flow extracted from the MEs. While achieving good performance, these neural networks are quite demanding in terms of memory usage and computation.

In specific applications, for example, during automobile driver monitoring or student comprehension recognition in virtual education systems, fast and effective processing methods are necessary to capture emotional responses as quickly as possible. Meanwhile, thanks to great progresses in parallel computing, parallelized image processing devices such as embedded systems are easily accessible and affordable. Already well-adopted in diverse domains, these devices possess multiple strengths in terms of speed, embeddability, power consumption and flexibility. These advantages, however, are often at the cost of limited memory and computing power.

The objective of this work was to design an efficient and accurate ME recognition pipeline for embedded vision purposes. First of all, our design took into account thorough investigations on different CNN architectures. Next, different optical flow representations for CNN inputs were studied. Finally, our proposed pipeline achieved accuracy for ME recognition that is competitive with state-of-the-art approaches while being real-time capable and using less memory. The paper is organized as follows. In Section 2, several recent related studies are reviewed. Section 3 explains the proposed methodology in order to establish cost-effective CNNs for fast ME recognition. Section 4 provides experimental results and performance evaluations. Lastly, Section 5 concludes the paper.

#### **2. Related Works**

MEs begin at the onset (first frame where the muscles of the facial expressions start to contract), finish at the offset (last frame, where the face returns to its neutral state) and reach their pinnacle at the apex frames (see Figure 1). Because of their very short duration and low intensity, ME recognition and analysis are considered difficult tasks. Earlier studies proposed using low-level features such as LBP-TOP to address these problems. LBP-TOP is a 3D descriptor extended from the traditional 2D LBP. It encodes the binary patters between image pixels, and the temporal relationship between pixels and their neighboring frames. The resulting histograms are then concatenated to represent the temporal changes over entire videos. LBP-TOP has been widely adopted in several studies. Pfister et al. [14] applied LBP-TOP for spontaneous ME recognition. Yan et al. [15] achieved 63% ME recognition accuracy on their CASME II database using LBP-TOP. In addition, LBP-TOP has also been used to investigate differences between micro-facial movement sequences and neutral face sequences.

**Figure 1.** Example of a Micro-Expression (ME): the maximum movement intensity occurs at the apex frame.

Several studies aimed to extend low-level features extracted by LBP-TOP as they still could not reach satisfactory accuracy. For example, Liong et al. [20] proposed assigning different weights to local features, thereby putting more attention on active facial regions. Wang et al. [12] studied the correlation between color and emotions by extracting LBP-TOP from the Tensor-Independent Color Space (TICS). Ruiz-Hernandez and Pietikäinen [21] used the re-parameterization of second order Gaussian jet on the LBP-TOP, achieving a promising ME recognition result on the SMIC database [22]. Considering that LBP-TOP consists of redundant information, Wang et al. [23] proposed the LBP-Six Intersection Points (LBP-SIP) method which is computationally more efficient and achieves higher accuracy on the CASEME II database. We also note that the STCLQP (Spatio-Temporal Completed Local Quantization Patterns) proposed by Huang et al. [24] achieved a substantial improvement for analyzing facial MEs.

Over the years, as research showed that it is non-trivial for low-level features to effectively capture and encode a ME's subtle dynamic patterns (especially from inactivate regions), other methods shifted to exploit mid or high-level features. He et al. [17] developed a novel multi-task mid-level feature learning method to enhance the discrimination ability of the extracted low-level features. The mid-level feature representation is generated by learning a set of class-specific feature mappings. Better recognition performance has been obtained with more available information and features more suited to discrimination and generalization. A simple and efficient method known as Main Directional Mean Optical-flow (MDMO) was employed by Liu et al. [25]. They used optical flow to measure the subtle movement of facial Regions of Interest (ROIs) that were spotted based on the FACS. Oh et al. [26] also applied the monogenic Riesz wavelet representation in order to amplify subtle movements of MEs.

The aforementioned methods indicate that the majority of existing approaches heavily rely on hand-crafted features. Inherently, they are not easily transferable as the process of feature crafting and selection depends heavily on domain knowledge and researchers' experience. In addition, methods based on hand-crafted features are not accurate enough to be applied in practice. Therefore, high-level feature descriptors which better describe different MEs and can be automatically learned are desired. Recently, more and more vision-based tasks have shifted to deep CNN-based solutions due to their superior performance. Recent developments in ME recognition have also been inspired by these advancements by incorporating CNN models within the ME recognition framework.

Peng et al. [27] proposed a two-stream convolutional network DTSCNN (Dual Temporal Scale Convolutional Neural Network) to address two aspects: the overfitting problem caused by the small sizes of existing ME databases and the use of high-level features. We can observe four characteristics of the DTSCNN: (i) separate features were first extracted from ME clips from two shallow networks and then fused; (ii) data augmentation and higher drop-out ratio were applied in each network; (iii) two databases (CASME I and CASME II) were combined to train the network; (iv) the data fed to the networks were optical-flow images instead of raw RGB frames.

Khor et al. [28] studied two variants of an Enriched LRCN (Long-Term Recurrent Convolutional Network) model for ME recognition. Spatial Enrichment (SE) refers to channel-wise stacking of gray-scale and optical flow images as new inputs to CNN. On the other hand, Temporal Enrichment (TE) stacks obtained features. Their TE model achieves better accuracy on a single database, while the SE model is more robust against the cross-domain protocol involving more databases.

Liong et al. [29] designed a Shallow Triple Stream Three-dimentional CNN (STSTNet). The model takes input stacked optical flow images computed between the onset and apex frames (optical strain, horizontal and vertical flow fields), followed by three shallow Convolutional Layers in parallel and a fusion layer. The proposed method is able to extract rich features from MEs while being computationally light, as the fused features are compact yet discriminative.

Our objective was to realize a fast and high-performance ME recognition pipeline for embedded vision applications under several constraints, such as embeddability, limited memory and restricted computing resources. Inspired by existing works [27,29], we explored different CNN architectures and several optical flow representations for CNN inputs to find cost-effective neural network architectures that were capable of recognizing MEs in real-time.

#### **3. Methodology**

The studies applying deep learning to tackle the ME classification problem [30–33] usually used pretrained CNNs such as ResNet [18] and VGG [19] and applied transfer learning to obtain ME features. In our work, we first selected off-the-shelf ResNet18 because it provided the best trade-off between accuracy and speed on the challenging ImageNet classification and was recognized for its performance in transfer learning. ResNet [18] explicitly lets the stacked layers fit a residual mapping. Namely, the stacked non-linear layers are let to fit another mapping of *F*(*x*) := *H*(*x*) − *x* where *H*(*x*) is the desired underlying mapping and *x* the initial activations. The original mapping is recast into *F*(*x*) + *x* by feedforward neural networks with shortcut connections. ResNet18 has 20 Convolutional Layers (CLs) (17 successive CLs and 3 branching ones). Residual links after each pair of successive convolutional units are used and the kernel size after each residual link is doubled. As ResNet18 is designed to extract features from RGB color images, it requires inputs to have 3 channels.

In order to accelerate processing speed in the deep learning domain, the main current trend in decreasing complexity of CNN is to reduce the number of parameters. For example, Hui et al. [34] proposed a very compact LiteFlowNet which is 30 times smaller in the model size and 1.36 times faster in the running speed in comparison with the state-of-the-art CNNs for optical flow estimation. In [35], Rieger et al. explored parameter-reduced residual networks on in-the-wild datasets, targeting real-time head pose estimation. They experimented with various ResNet architectures with a varying number

of layers to handle different image sizes (including low-resolution images). The optimized ResNet achieved state-of-the-art accuracy with real-time speed.

It is well known that CNNs are created for specific problems and therefore over-calibrated when they are used in other contexts. ResNet18 was made for end-to-end object recognition: the dataset used for training had hundreds of thousands of images for each class and more than a thousand classes in total. Based on that: (i) An ME recognition study considers at most 5 classes, and the datasets of spontaneous MEs are scarce and contain far fewer samples, and (ii) optical flows are high-level features contrary to low-level color features and so require shallower networks; we have empirically reduced the architecture of ResNet18 by iteratively removing residual layers. This allowed us to assess the influence of the depth of the network on its classification capacities in our context and therefore to estimate the relevant calibration of the network.

Figure 2 illustrates the reduction protocol: at each step the last residual layer with two CLs is removed and the previous one is connected to the fully connected layer. Only networks with an odd number of CL are therefore proposed. The weights of all CNNs are pretrained using ImageNet. As highlighted in Table 1, the decrease in the number of CLs has a significant impact on the number of learnable parameters of the network, which directly affects the forward propagation time.

**Figure 2.** Depth reduction of a deep neural network: in the initial network, each residual layer contains two Convolutional Layers (CLs) (**left**); the last residual layer is removed (**middle**) to obtain a shallower network (**right**).

**Table 1.** Number of CLs and the number of learnable parameters in the proposed architectures.


Once the network depth has been correctly estimated, the dimensions of the input have to be optimized. In our case, CNNs take optical flows extracted between the onset and apex frames of ME video clips. It is between these two moments that the motion is most likely to be the strongest. The dimensionality of inputs determines the complexity of the network that uses them, since the reduction in input channels dictates the number of filters to be used throughout all following layers of the CNN. The optical flow between the onset (Figure 3a) and the apex (Figure 3b) typically has a 3-channel representation to be used in a pretrained architecture designed for 3-channel color images. This representation, however, may not be optimal for ME recognition.

From the assumption of brightness invariance, the movement of each pixel between frames over a period of time is estimated and represented as a vector (Figure 3c) indicating the direction and intensity of the motion. The projection of the vector on the horizontal axis corresponds to the Vx field (Figure 3d) while its projection on the vertical axis is the Vy field (Figure 3e). The Magnitude (M) is the norm of the vector (Figure 3f). Figure 4 illustrates this representation of one optical flow vector.

**Figure 3.** Optical flow is computed between the onset (**a**) and the apex (**b**): vectors obtained for a random sample of pixels (**c**), Vx field (**d**), Vy field (**e**) and Magnitude field (**f**).

**Figure 4.** Visualization of M, Vx and Vy for one optical flow vector.

When classifying ME, the resulting matrices Vx , Vy and M are traditionally given as inputs to the CNN. Nonetheless, the third channel is inherently redundant since M is computed from Vx and Vy. Optical flow composed of the 2-channel Vx and Vy field could already provide all relevant information. Furthermore, we hypothesize that even a single channel motion field itself could be descriptive enough. Hence, we have created and evaluated networks taken as input for the optical flow in a two-channel representation (Vx-Vy) and in an one-channel representation (M, Vx or Vy). For this purpose, the proposed networks begin with a number of CLs related to the depth optimization followed by a batch normalization and ReLU. Then the networks end with a maxpooling layer and a fully connected layer. The Figure 5 presents the architectures used with one to four CL according to the results of the experiments in Section 4. As illustrated in Table 2, a low dimensional input leads to a significant reduction in the number of learnable parameters and therefore in the complexity of the system.

**Figure 5.** Proposed networks composed of one to four (from left to right) CLs for various representations of the optical flow as input.

**Table 2.** Number of learnable parameters according to the dimensionality of the input of the network.


#### **4. Experiments**

#### *4.1. Dataset and Validation Protocol Presentation*

Two ME databases were used in our experiments. CASME II (Chinese Academy of Sciences Micro-Expression) [15] is a comprehensive spontaneous ME database containing 247 video samples, collected from 26 Asian participants with an average age of 22.03 years old. Compared to the first database, the Spontaneous Actions and Micro-Movements (SAMM) [36] is a more recent one consisting of 159 micro-movements (one video for each). These videos were collected spontaneously from a demographically diverse group of 32 participants with a mean age of 33.24 years old and a balanced gender split. Originally intended for investigating micro-facial movements, SAMM initially collected the seven basic emotions.

Both the CASME II and SAMM databases were recorded at a high-speed frame rate of 200 fps. They also both contain "objective classes," as provided in [37]. For this reason, the Facial MEs Grand Challenge 2018 [38] proposed to combine all samples from both databases into a single composite dataset of 253 videos with five emotion classes. It should be noted that the repartition is not very well balanced. Namely, this composite database is composed of 19.92% "happiness", 11.62% "surprise", 47.30% "anger", 11.20% "disgust" and 9.96% "sadness".

Similarly to [38], we applied the Leave One Subject Out (LOSO) cross-validation protocol for ME classification, wherein one subject's data is used as a test set in each fold of the cross-validation. This is done to better reproduce realistic scenarios where the encountered subjects are not present during training of the model. In all experiments, recognition performance is measured by accuracy, which is the percentage of correctly classified video samples out of the total number of samples in the database.

The Horn–Schunck method [39] was selected to compute optical flow. This algorithm was widely used for optical flow estimation in many recent studies for virtue of its robustness and efficiency. Throughout all experiments, we trained the CNN models with a mini-batch size of 64 for 150 epochs using the RMSprop optimization. Feature extraction and classification were both handled by the CNN. Simple data augmentation was applied to double the training size. Specifically, for each ME video clip used for training, in addition to the optical flow between the onset and apex frame, we also included a second flow computed between the onset and apex+1 frame.

#### *4.2. ResNet Depth Study*

In order to find the ResNet depth which permits an optimal compromise between the ME recognition performance and the number of learnable parameters, we tested different CNN depths using the method described in Section 3. The obtained accuracies are given in Table 3:

**Table 3.** Accuracies varied by the number of Convolutional Layers (CLs) and associated number of learnable parameters.


We observed that the best score was achieved by ResNet8, which had seven CLs. However, the scores achieved by different numbers of CL did not vary much. Furthermore, beyond seven CL, adding more CL did not improve the accuracy of the model. The fact that accuracy does not increase along with depth confirms that multiple successive CL are not necessary to achieve a respectable accuracy. The most interesting observation was that with a single CL, we achieved a score that is not very far from the optimal score while the size of the model was much more concise. This suggests that instead of deep learning, a more "classical" approach exploiting shallow neural networks presents an interesting field to explore when considering portability and computational efficiency for embedded systems. That is the principal reason we restricted our study to shallow CNNs.

#### *4.3. CNN Input Study*

In this subsection, we study impacts of optical flow representations on ME recognition performance. Two types of CNN have been investigated, one with 1-channel input (Vx, Vy, or M) and the other one using the 2-channel Vx-Vy pair. Due to the fact that off-the-shelf CNNs typically take 3-channel inputs and are pretrained accordingly, applying transfer learning to adapt to our models would have been a nontrivial task. Instead, we created custom CNNs and trained them from scratch. Table 4 shows the recognition accuracies of different configurations using a small number of CNN layers.

**Table 4.** Accuracies under various CNN architectures and optical flow representations.


We can observe that the Vx-Vy pair and Vy alone gave the best results, both representations achieving 60.17% accuracy. On the other hand, using Magnitude alone leads to a similar accuracy to those of Vy and the Vx-Vy pair with a score of 59.34%. Vx got the worst results overall, with a maximum score of 54.34%. This observation indicates that the most prominent features for ME classification might indeed be more dominant in vertical movement rather than the horizontal movement. This assumption is logical when thinking about the muscle movements happening in each known facial expression.

To better visualize the difference in the high-level features present in Vx, Vy and the Magnitude, we did an averaging on all the different samples according to their classes. The result can be seen in Figure 6. We observed that Vx exhibits a non-negligible quantity of noise. Magnitude and Vy, on the other hand, had clear regions of activity for each class. The regions of activity were aligned with the muscles responsible of each facial expression.

**Figure 6.** Average optical flow obtained in the dataset per ME class. Studied classes are in order from left to right: happiness, surprise, anger, disgust and sadness.

#### *4.4. Classification Analysis*

In order to understand obtained results, we measured cosine similarity of features extracted by three CNNs: ResNet8 (Section 4.2), Vx-Vy-3-CL and Vy-3-CL (Section 4.3). Usually, the convolutional layers of CNNs are considered as different feature extractors; only the last fully connected layer directly performs the classification task. The features just before classification can be represented in vector format. Cosine similarity measures the similarity between two vectors *a* and *b* using Equation (1):

$$\text{cosine}(a, b) = \frac{a^T b}{||a|| \, ||b||} \tag{1}$$

Cosine similarity values fall within the range of [−1, 1]; values closer to 1 indicate higher similarity between two vectors. Tables 5–7 display the cosine similarity values: with two samples five ME classes, we calculated intra-similarity and average inter-similarity of each class using the same configuration for three CNNs.

**Table 5.** Cosine similarity for the 3-CL CNN with single-channel input Vy.



**Table 6.** Cosine similarity for the 3-CL CNN with double-channel inputs (Vx-Vy).

**Table 7.** Cosine similarity for ResNet8.


Firstly, we observed that diagonal values (intra-class) across all three CNNs were significantly higher in comparison with other values (inter-class). This illustrates that all three CNNs are capable of separating different ME classes. Secondly, the intra-class cosine similarity of ResNet is closer to 1, suggesting that ResNet features are more discriminative. We hypothesize that our simplified CNNs with reduced layers extract less refined features, resulting in the minor decrease in performance (61.00% vs. 60.17%).

#### *4.5. Performance Evaluations*

In this subsection, we describe measuring our proposed method in three aspects: recognition accuracy, needed memory space and processing speed. Since we obtained optimal results by using the Vy field and 3-layer CNN, further evaluations concentrated on this particular configuration.

**Evaluation on recognition accuracy:** We performed an accuracy comparison of five objective ME class recognition (see Table 8). Our best CNN reached a similar performance as those of other studies using the same protocol of validation. It is worth mentioning that Peng et al. [40] employed a macro-to-micro transferred ResNet10 model and obtained a better result. Their work used four Macro-Expression datasets (>10 K images) and some preprocessing, such as color shift, rotation and smoothing. These additional operations make their proposed method difficult for deployment on embedded systems. After seeing the confusion matrix of our model (Figure 7), we also noticed that the distribution of correct assessments for Vy was more balanced than the ones gotten from [28] (Figure 8).

**Table 8.** Comparison between our method and those of other top-performers from literature.


The DTSCNN proposed by Peng et al. in [27] opted for two optical flows computed differently from a ME sample, which made the whole network robust to different frame rates of ME videos. In detail, the first optical flow is calculated using 64 frames around the apex to adapt to the frame rate of CASME I. Similarly, the second optical flow is given by the 128 frames around the apex adapted to the frame rate of CASME II. In case the number of frames composing the ME is not sufficient, a linear interpolation method is used to normalize the video clips. Their study used two CNNs in parallel to extract two separate features before concatenating them. The resultant feature vector was then

fed as input to an SVM to be classified. The DTSCNN was tested on four classes (positive, negative, surprise and other) from a composite dataset consisting of the CASME I and CASME II databases, and it achieved an average recognition rate of 66.67%. The STSTNet proposed by Liong et al. in [29] makes use of three-dimensional CNNs which carry out three-dimensional convolutions instead of two-dimensional ones (such as ResNet, VGG, the networks presented in [27,28,40] and our study). It was tested on three classes: positive, negative and surprise from a composite database consisting of samples from the SMIC, CASME II and SAMM databases. It achieved an unweighted average recall rate of 76.05% and an unweighted F1-score of 73.53%. Both of these two frameworks are not very suitable for real-time embedded applications constrained by limited memory and computing resources.


**Figure 7.** Confusion matrix corresponding to our network with 3 CLs and Vy as input.


**Figure 8.** Confusion matrix obtained by the work of [28].

**Evaluation on memory space:** Table 9 summarizes the number of learnable parameters and used filters according to the dimensionality of the network inputs. The minimum required memory space corresponds to 333,121 parameter storage, which is less than 3.12% of that of off-the-shelf ResNet18.


**Table 9.** Number of learnable parameters and filters (in brackets) of various network architectures under different input dimensions.

**Evaluation on processing speed**: We used a mid-range computer with an Intel Xeon processor and an Nvidia GTX 1060 graphic card to carry out all the experiments. The complete pipeline was implemented in MatLAB 2018a with its deep learning toolbox. Our model which achieved the best score was the CNN with a single-channel input and three successive CL. It needs 12.8 ms to classify the vertical component Vy. The optical flow between two frames requires 11.8 ms to compute using our computer, leading to a total runtime to classify an ME video clip of 24.6 ms. In our knowledge, the proposed method outperforms most ME recognition systems in terms of processing speed.

#### **5. Conclusions and Future Works**

In this paper, we propose cost-efficient CNN architectures to recognize spontaneous MEs. We first investigated the depth of the well-known ResNet18 network to demonstrate that using only a small number of layers is sufficient in our task. Based on this observation, we have experienced several representations of network input.

Following several previous studies, we fed CNNs with optical flow estimated from the onsets and apexes of MEs. Different flow representations (horizontal Vx, vertical Vy, Magnitude M and the Vx-Vy pair) have been tested and evaluated on a composite dataset (CASME II and SAMM) for recognition of five objective classes. The results obtained on the Vy input alone are more convincing. That was likely due to the fact that such an orientation is more suitable describing ME's motion and its variations between the different expression classes. Experimental results demonstrated that the proposed method can achieve similar recognition rate when compared with state-of-the-art approaches.

Finally, we obtained an accuracy of 60.17% with a light CNN design consisting of three CLs with single-channel inputs Vy. This configuration enables the number of learnable parameters to be reduced by a factor of 32 in comparison with the ResNet18. Moreover, we achieved a processing time of 24.6 ms which is shorter than MEs (40 ms). Our study opens up an interesting way to find the trade-off between speed and accuracy in ME recognition. While the results are encouraging, it should be noted that our method does not provide better accuracy than the ones described in the literature. Instead, a compromise has to be made between accuracy and processing time. By minimizing the computation, our proposed method manages to obtain accuracy comparable to the state-of-the-art systems while being compatible with the real-time constraints of embedded vision.

Several future works could further enhance both the speed and accuracy of our proposed ME recognition pipeline. These include more advanced data augmentation techniques to improve recognition performance. Moreover, new ways to automatically optimize the structure of a network to make it lighter have been presented recently. Other networks optimized for efficiency will also be explored. For example, MobileNet [41] uses depth-wise separable convolutions to build light weight CNN. ShuffleNet [42] uses pointwise group convolution to reduce computation complexity of 1 × 1 convolutions and channel shuffle to help the information flowing across feature channels. Our next step of exploration aims to analyze and integrate these new methodologies in our framework. Furthermore, we also hope to investigate new emotional machines while avoiding AI modeling errors and biases [43].

**Author Contributions:** Conceptualization, C.M. and F.Y.; formal analysis, R.B. and Y.L.; methodology, R.B., Y.L. and C.M.; software R.B. and Y.L.; supervision, C.M., D.G. and F.Y.; writing—review and editing, R.B., Y.L. and C.M., D.G. and F.Y.; writing—original draft R.B. and Y.L.; funding acquisition, D.G. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the H2020 Innovative Training Network (ITN) project ACHIEVE (H2020-MSCA-ITN-2017: agreement no. 765866).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

*Review*
