**1. Introduction**

Child-Robot Interaction (CRI) is a subfield of Human-Robot Interaction (HRI) [1], which is defined as the interaction between humans and robotic systems. Inside the several possibilities of HRI and CRI, socially assistive robots are being used as a therapy-aid tool for children with Autism [2,3]. One feature that could improve this interaction is the ability of recognizing emotions, which can be used to provide a better CRI. For instance, children with autism spectrum disorder (ASD) tend to lack the ability of emotion display, thus, the robots should rely on involuntary biological signals measurements, such as skin thermography [4–6].

The face is a region of the body that has a high response to emotions, and the facial thermal print changes may be linked to the child emotion. Thus, this feature can be a useful parameter to be applied in a CRI, since this biological signal is not voluntary and not easily mutable [7]. Due to this feature, recent studies are focused on facial detection and thermography to evaluate emotion expressions in affective computing [8–10]. Moreover, it is a more comfortable and unobtrusive technique to evaluate emotions, since no sensor touching the child is needed, such as electrodes used in electroencephalography and electrocardiography [8,11,12].

A conventional system for facial emotion recognition is composed of the following three main stages: face and facial component detection, computation of various spatial and temporal features, and emotion classification [10]. Then, the first stage for face detection over an input image, and consequently to locate facial components (such as eyes, nose, and mouth) or landmarks of interest, is a crucial task and still a challenge. In fact, to accurately discriminate emotions, it is necessary to apply geometric or appearance features based methods [10,13–15], being the latter the most popular, due to its superior performance [16]. On the other hand, Facial Landmarks (FL) should be used to locate salient points of facial regions, such as the end of the nose, ends of the eye brows, and the mouth [10,16].

Many studies have demonstrated that dividing the face into specific regions for facial feature extraction can improve the performance during emotion recognition [17–29]. However, this strategy may be affected by improper face alignment. Moreover, other works based on learning [30,31] for feature extraction from specific face regions have been proposed to locate those facial regions with higher contribution for emotion recognition. Nevertheless, these approaches are difficult to be extended as a generic system, due to the fact that positions and sizes of the facial patches vary according to the training data.

It is worth commenting that studies using thermal cameras for emotion recognition have shown promising results, but low-cost thermal cameras typically present a poor resolution, making it difficult to accurately detect facial regions by applying conventional methods as the Viola-Jones algorithm, widely used on visual images [15,32].

Then, we hypothesized that a low-cost system for simultaneous capture of both visual and thermal cameras may increase the accuracy for locations of specific facial regions of interest (ROIs) over faces, and consequently improve the feature extraction, increasing the emotion discrimination. This way, we consider an alternative few-explored, which is to firstly apply Viola-Jones algorithm on the visual image to locate desired ROIs, and after transferring it for its corresponding thermal image, but including as a last stage a method for ROI location correction based on error probability, taking into account manual annotations of a trained expert over a reference frame.

Thus, the goal of this work is to propose a system able to detect facial ROIs for five emotions (disgust, fear, happiness, sadness, and surprise) in typically developing children during an interaction with a social robot (as an affective stimulus). In this study, our low-cost camera system allows obtaining pairs of synchronized images for detecting ROIs in the visual image using the Viola-Jones algorithm as a first stage, and then, transferring these ROIs to the corresponding thermal camera frame through a homography matrix. As the main novelty, we introduced here a new way to accurately improve the ROI locations after applying both Viola-Jones and homography transform. This approach computes the error probability to automatically find that ROI located over thermal images, which is better placed

according to manual annotations of a trained expert. This ROI of highest probability (with lowest location error) is latter used to relocate other ROIs, improving the overall accuracy. Then, better appearance features can be extracted, in order to increase the emotion discrimination by our proposed recognition system. Similarly, this method may be extended to other studies aiming to accurately locate ROIs over facial thermal images, which are physiologically relevant, such as described in [17,21], allowing to understand phenomena linked to behaviours, emotions, stress, human interactions, among others. As a relevance of this work, our system is capable of detecting ROIs on the child's face, which has neurophysiological importance for emotion recognition through thermal images recorded in an unobtrusive way. Additionally, methods for feature extraction and dimensionality reduction are applied on specific ROIs for emotion recognition using Linear Discriminant Analysis (LDA). As another highlight, a set of visual and thermal images is acquired in an atypical context in which a social robot is used as an emotional stimulus in an interaction with children, in order to test the proposed system for specific ROIs detection and emotion recognition. For our knowledge, this type of approach has not been explored in other studies.

This work is structured as follows. Section 2 presents a description of several works of the state-of-the-art. Section 3 presents a system for image acquisition, in addition to a proposal based on the Viola-Jones algorithm and error probability for facial ROI location. Moreover, the experimental protocol and methods for feature extraction, dimensionality reduction, and classification are described. Section 4 presents the experimental findings about the automatic method for ROI placement and children's emotion recognition during the interaction with the robot. Afterwards, Section 5 presents the findings of this work and compare them to previous studies, summarizing also its main contributions and limitations. Finally, Section 6 presents the Conclusion and Future Works.

#### **2. Related Works**

Many research to recognize facial emotion by contact-free strategies have proposed automatic methods for both face and facial ROI detection over visual and thermal images, as constructing an effective face representation from images is a crucial step for successful automatic facial action analysis, in order to recognize facial emotions. In this field, there is the Facial Action Coding System (FACS), which is a taxonomy of human facial expressions designed to facilitate human annotation of facial behaviour [9,14,33]. For instance, a total of 32 atomic facial muscle actions, termed Action Units (AUs), and 14 additional descriptors related to miscellaneous actions are specified, which are widely used by automatic methods to locate facial landmarks and ROIs. These regions are used by methods based on geometric [9,10,15] and appearance features to discriminate emotions [9,14]. Appearance representations use textural information by considering the intensity value of the pixels, whereas geometric representations ignore texture and describe shape explicitly [9,14,15]. Here, we focused our revision of the state-of-the-art on approaches using only appearance features on the target face, which are generally computed by dividing the face region into regular grid (holistic representation). Appearance features can be obtained to encode low or high-level information. For example, low-level information can be encoded through low-level histograms that are computationally simple and ideal for real-time applications, Gabor representations, data-driven representations by applying bag-of-words, among others. Furthermore, higher level of information can be encoded through Non-Negative Matrix Factorization (NMF) [9]. However, the effectiveness of the feature extraction to increase the emotion discrimination may be affected by several factors, such as head-pose variations, illumination variations, face registration, occlusions, among others [9].

In [16] the authors used the Haar classifier for face detection, which is widely applied, due to its high detection accuracy and real time performance [32]. They extracted appearance features from the global face region by applying Local Binary Pattern (LBP) histogram that take care of minor changes of facial expression for different emotions [9,34], followed by Principal Component Analysis (PCA) for dimensionality reduction, to improve the speed of computation in real time during six emotions (anger, disgust, fear, happiness, sadness, and surprise). This approach is customizable person to person, and

achieved an accuracy (ACC) of 97%. It is worth mentioning that unlike a global-feature-based approach, different face regions have different levels of importance for emotion recognition [17]. For example, the eyes and mouth contain more information than the forehead and cheek. Notice that LBP has been widely used in many research of emotion recognition. Refer to Ref. [34] for a comprehensive study about methods based on LBP for emotion recognition.

Another study [14] used specific regions for appearance feature extraction by dividing the entire face region into domain-specific local regions, using the landmark detection method presented in Ref. [35] that uses ensemble of regression trees. These authors used facial point locations to define a set of 29 face regions covering the whole face, which was based on expert knowledge regarding face geometry and AU-specific facial muscle contractions, such as shown in Ref. [33]. Ensemble of regression trees are used to estimate the face landmark locations directly from a sparse subset of pixel intensities, achieving super-real-time performance with high quality predictions. Similarly, they used LPB descriptor for appearance feature extraction, achieving an ACC of 93.60% after applying Support Vector Machine (SVM) with Radial Basic Function (RBF) kernel.

In Ref. [36], a comparative study of methods for feature extraction, such as Kernel Discriminant Isometric Mapping (KDIsomap), PCA, Linear Discriminant Analysis (LDA), Kernel Principal Component Analysis (KPCA), Kernel Linear Discriminant Analysis (KLDA), and Kernel Isometric Mapping (KIsomap) was conducted, achieving the best performance (ACC of 81.59% on the JAFFE database, and 94.88% on the Cohn-Kanade database) for KDIsomap during seven emotions (anger, joy, sadness, neutral, surprise, disgust and fear), but without significant difference compared with other approaches. Here, the authors used the well-known Viola-Jones algorithm to detect the face [32], which is suitable for real-time applications. This method uses a cascade of classifiers by employing Haar-wavelet features, which usually use the eye position detected in the face region to align the other detected face regions.

In Ref. [37] the authors propose the Central Symmetric Local Gradient Coding (CS-LGC) algorithm to define the neighborhood as a 5 × 5 grid, using the concept of center symmetry to extract the gradient information in four directions (horizontal, vertical, and two diagonals) for feature extraction over target pixels more representative. Afterwards, they also applied PCA for dimensionality reduction, followed by the Extreme Learning Machine (ELM) algorithm. The evaluation of this approach was conducted through JAFFE and Cohn-Kanade databases, which contain grayscale visual images related to the following emotions: anger, disgust, fear, happiness, neutral, sadness and surprise. Accuracies of 98.33% and 95.24% for Cohn-Kanade and JAFFE were obtained, respectively, being relatively better compared with other operators for feature extraction, such as LBP.

Several studies for emotion recognition have been conducted with two kinds of camera (one visual and another infrared), such as in Ref. [20]. Those authors proposed a fusion scheme by applying PCA over both thermal and visual faces for feature extraction, and k-nearest neighbors to recognize two classes (surprised and laughing) with mean ACC of 75%. Additionally, in Ref. [23] a comparison for emotion recognition using visual and infrared cameras was carried out and four typical methods, including PCA, PCA plus LDA, Active Appearance Model (AAM), and AAM-based plus LDA were used on visual images for feature extraction, whereas PCA and PCA plus LDA were applied on infrared thermal images using four ROIs (forehead, nose, mouth, and cheeks). These authors used k-nearest neighbors to recognize six emotions (sadness, anger, surprise, fear, happiness, and disgust). It is worth mentioning that the eye locations over the thermal were manually performed for those authors, which latter were used to locate the aforementioned four ROIs. In Ref. [22] an interesting approach using both kinds of camera was addressed, including the use of eyeglasses, which are opaque to the thermal camera, but visible to the visual camera.

Another interesting work shows that infrared and visual cameras can be combined into a multi-modal sensor system to recognize fear [24], through electroencephalogram (EEG) signals, eye blinking rate, and facial temperature while the user watched a horror movie. An Adaptive Boosting (AdaBoost) algorithm was used to detect the face region, and a geometric transform to make the

coordinates of the two images (visible-light and thermal) coincident was used. Similarly, other study was conducted on Post traumatic Stress disorder (PTSD) patients to infer fear through visual and thermal images [38]. In Ref. [39], an algorithm for automatic determination of the head center in thermograms was proposed, which has demonstrated to be sensitive to the head rotation or position. In Ref. [40], the authors proposed an unsupervised Local and Global feature extraction for facial emotion recognition through thermal images. For this purpose, they used a bimodal threshold to locate the face for feature extraction by PCA, after applying a method based on clustering to detect points of interest; for facial expression classification, a Support Vector Machine Committee was used. In Ref. [17], the face was extracted on thermal images after applying both median and Gaussian filters with further binarization to convert the gray scale image into pure black and white, and removing small sets of non-connected pixels to enhance the image quality. Afterwards, appearance features were extracted on defined ROIs over the thermal images, followed by Fast Neighbourhood Component Analysis (FNCA) and LDA for feature selection and recognition of five emotions, respectively.

More details about different methods for feature extraction, dimensionality reduction, feature selection, and classification can be reviewed in some studies [37] and also in extensive reviews, such as in Refs. [9,10].

The next section presents our proposed system for five emotions recognition, which allows accurately locating facial ROIs over thermal images, improving the appearance feature extraction.

#### **3. Materials and Methods**

#### *3.1. Experimental Procedure*

Seventeen typically developing children, 9 boys and 8 girls (aged between 8 and 12 years) participated in this study, who were recruited from elementary schools in Vitoria-Brazil. All had their parents' permission, through signatures of Terms of Free and Informed Consent. In addition, children signed a Term of Assent, informing their wish in participating. This study was approved by the Ethics Committee of Federal University of Espirito Santo (UFES)/Brazil, under number 1,121,638. The experiments were conducted in a room within the children's school environment, where the room temperature was kept between 20 ◦C and 24 ◦C, using a constant luminous intensity, such as done by Ref. [39].

A mobile social robot (see Figure 1b), called N-MARIA (New-Mobile Autonomous Robot for Interaction with Autistics), built at UFES/Brazil to assist children during social relationship rehabilitation, was used in our research. This robot has attached a camera system to record facial images during interaction with children. More details about N-MARIA are given in Section 3.2.1.

The experiment was conducted in three phases, as follows. First, N-MARIA was initially covered at the room with a black sheet, except its attached camera system, that was turned on to record visual and thermal images of the frontal view with sampling rate at 2 fps, for further processing. Afterwards, the child was invited to enter the room, and sit comfortably for explanations about the general activities related to the experiment, being conditioned to a relaxed state for a period of time minimum of 10 min, in order to adapt her/his body to the temperature of the room, allowing her/his skin temperature to stabilize for baseline recordings, according to similar studies carried out in Refs. [21,41]. Once they had completed the relaxation period, the child was placed in front of the covered robot about 70 cm away from it, remaining in standing position. Immediately, recordings of the child face by the camera system were carried out for a period of one minute with the robot covered, one minute with the robot uncovered, and three minutes of interaction with the robot. After, the child spent two minutes answering a questionnaire about the experiment.

**Figure 1.** Experimental setup showing the child-robot interaction. (**a**) Before showing the robot; (**b**) After presenting it.

The first part of the recording (robot covered) corresponds to the experimental stage, called Baseline, whereas the next stage, presenting the uncovered robot is called Test. Before the robot is uncovered, the child was asked to permanently look forward without sudden facial movements or touching the face, avoiding any facial obstruction during video recordings.

After the removal of the black sheet that covered the robot, the first dialogue (self-presentation) of the robot was started. In addition to the self-introduction to the child, prompt dialogues during the experiment were related to questions, positive reinforcement and invitations. In the interaction with the child, that lasted two minutes, the child was encouraged to make communication and tactile interaction with the robot. At the end of the experiment, the child was again invited to sit and answer a structured interview about her/his feelings before and after seeing the robot, and also about the robot structure (if the child liked it, what she/he liked more, and what the child would change about it).

#### *3.2. Contact-Free Emotion Recognition*

Figure 2 shows the proposed contact-free system for emotion recognition, which is composed of the following four steps: (a) camera calibration; (b) image acquisition and automatic ROI detection; (c) ROI replacement; (d) feature extraction followed by the dimensionality reduction and emotion classification.

Figure 2a shows a first stage to calibrate the camera system by obtaining a homography matrix to map the pixels of the visual camera image into the thermal camera image, considering the relative fixed position between the two cameras. Also, another process is performed to obtain a frame that contains intrinsic noise of the infrared sensor, which is latter used in a second stage (Figure 2b) to remove the sensor noise (inherent to the camera) over the current thermal image captured. In this second stage, the image acquisition process is carried out taking synchronous images from both visual and infrared cameras, which are pre-processed to enhance the automatic facial ROIs detection by applying the Viola-Jones algorithm on the visual image. Then, the ROIs placed on the visual image are projected into the thermal image using the homography matrix. As a third stage, manual annotations by a trained expert over a reference frame are used to accurately relocate the ROIs by applying our approach based on errors of probability, such as shown in Figure 2c. Afterwards, feature vectors related to thermal variations are computed on the detected ROIs, and after reduced by applying PCA for dimensionality reduction for five emotion recognition in a last stage by LDA. More details about the proposed recognition system are given in the next subsections.

**Figure 2.** Overview of the proposed system for emotion recognition during a child-robot interaction. (**a**) Camera calibration; (**b**) image acquisition and automatic region of interest (ROI) detection; (**c**) ROI replacement; (**d**) feature extraction followed by the dimensionality reduction and emotion classification.

#### 3.2.1. Camera System and N-MARIA

The camera system composed of both visual and thermal cameras was attached to the head of the social robot, such as shown in Figure 3. These two cameras were fixed so that both had approximately the same visual field. To capture thermal variations, a low-cost camera (Therm-App®) was used, which has spatial resolution of 384 × 288 ppi, frame rate at 8.7 Hz and temperature sensitivity <0.07 ◦C. The normalization of the thermal images acquired in gray scale consisted of a brightness rate ranging from 0 to 255, where darker pixels correspond to lower temperatures, and lighter pixels correspond to higher temperatures. Moreover, a C270 HD Webcam (Logitech) was used to obtain visual images in RGB format, with a resolution of 1.2 MP.

**Figure 3.** N-MARIA (New-Mobile Robot for Interaction with Autistics) developed at Federal University of Espirito Santo (UFES)/Brazil.

The robot was built 1.41 m tall, considering the standard height of 9–10-year-old children. Additionally, soft malleable materials were used on the robot's structure for protection of both children and internal robot devices. The Pioneer 3-DX mobile platform was responsible by locomotion, a 360◦ laser sensor was used to locate the child in the environment, and a tablet was used as the robot face to display seven dynamic facial expressions during the robot-child interaction. Those expressions could also be remotely controlled through another tablet by the therapist, who could also control the robot behavior, expressions and dialogues emitted by the speakers.

#### 3.2.2. Camera Calibration

The camera calibration is done through a synchronous acquisition between visual and thermal images, and using a chessboard built with aluminum and electrical tape positioned in several possible angles. Then, the images obtained are processed with OpenCV calibration software [42], which uses Direct Linear Transform (DLT) to return a homography matrix [43], allowing transformation of points from the visual image to the thermal image in a robust way [32]. It is worth mentioning that there is not a homography matrix matching exactly points in all regions of the face (as they are not in the same plane), but the matrix obtained by DLT is used as an efficient approximation.

Also, other procedure to remove the intrinsic thermal noise of infrared sensors is carried out [44], which increases the quality of thermal images by correcting undesirable offsets. For that, a reference of an object with uniform body temperature covering the visual field of the thermal camera is recorded, which contains the intrinsic noise of the infrared sensor. Thus, it was expected to have a frame with the same brightness for all pixels, however, that did not occur. Thus, the frame with the intrinsic sensor noise was used in the pre-processing stage to eliminate the thermal noise, such as described in the next section.

#### 3.2.3. Image Acquisition and Pre-Processing

The thermal camera has maximum acquisition capacity of 8.7 fps whereas the visual camera has maximum capacity of 30 fps. Thus, to obtain temporal consistency, both visual and thermal images were simultaneously recorded with a sampling rate of 2 fps, which was suitable for our purposes.

During acquisition, the frame with the intrinsic sensor noise obtained in the Calibration stage was used to remove the intrinsic thermal noise of the current image acquired by subtracting pixel to pixel. Finally, a median filter was used to reduce salt and pepper noise from the thermal image.

#### 3.2.4. Face Landmark Detection

An automatic method was proposed here for face landmark detection over a given set of frames (**I** = {**<sup>i</sup>**1,**i**2, ... ,**i***b*, ... ,**i***B*}), taking as reference annotated ROIs by a trained expert on the frame **<sup>i</sup>***A*(*b* = *<sup>A</sup>*), such as shown in Figure 4. It is possible to observe that these manual annotations were located on eleven ROIs ( **R***A* = {**R**<sup>1</sup> *A*, **R**<sup>2</sup> *A*, ... , **R***<sup>k</sup> A*, ... , **R**<sup>11</sup> *A* }) of thermal images, taking into account the relevance of these ROIs in other studies for facial emotion recognition [17,21]. Here, the facial ROI sizes were computed in the same way as in Refs. [17,18,21], using the width of the head ROI and the following defined proportions [18]: 6.49% for nose, 14.28% for forehead, 3.24% for periorbital region, 9.74% for cheek, 3.24% for perinasal region, and 5.19% for chin [17].

**Figure 4.** Facial ROIs. **R**1, right forehead side; **R**2, left forehead side; **R**3, right periorbital side; **R**4, left periorbital side; **R**5, tip of nose; **R**6, right cheek; **R**7, left cheek; **R**8, right perinasal side; **R**9, left perinasal side; **R**10, right chin side; **R**11, left chin side.

#### 3.2.5. Automatic ROI Detection

Infrared images are more blurred than color images [23], therefore the ROI detection over thermal images of low-cost cameras is a challenge. For this reason, the well-known Viola-Jones algorithm [32] was used on color images for head detection and other facial regions, such as nose and eyes [17,21,32]. Then, these initial detected regions were used as references to automatically locate eleven ROIs within the face (see Table 1), namely the nose, both sides of forehead, cheeks, chin, periorbital area (close to the eyes) and perinasal area (bottom of the nose).

**Table 1.** Reference ROIs used to locate face landmarks over a frame **i***b*.


**<sup>R</sup>**1*b*, right forehead side; **<sup>R</sup>**2*b*, left forehead side; **<sup>R</sup>**3*b*, right periorbital side; **<sup>R</sup>**4*b*, left periorbital side; **<sup>R</sup>**5*b*, tip of nose; **<sup>R</sup>**6*b*, right cheek; **<sup>R</sup>**7*b*, left cheek; **<sup>R</sup>**8*b*, right perinasal side; **<sup>R</sup>**9*b*, left perinasal side; **<sup>R</sup>**10*b* , right chin side; **<sup>R</sup>**11*b* , left chin side.

In our study, the facial ROI sizes were also computed using the width of the head, and the aforementioned proportions [17,18,21]. Additionally, the facial ROIs were spatially placed, taking as reference the expert annotation. Afterwards, the corresponding facial ROIs were projected on the thermal image through the aforementioned transformation using a homography matrix (see Section 3.2.2), such as shown in Figure 4. Here, the ROI set of a thermal frame *b* is defined by **R***b* = {**R**0*b*, **<sup>R</sup>**1*b*, ... , **<sup>R</sup>***kb*, ... , **<sup>R</sup>**11*b* }, being **R**0*b* the head ROI, and **R***kb* for *k* = 1 to 11 the facial ROIs. Notice that **R***kb*is described by several pixels *Rij*for a range from 0 to 255 (gray scale of 8 bits).

Figure 2b,c show our proposal to accurately locate facial ROIs, formed by the following two-stages: (1) automatic ROI detection and (2) ROI placement correction.

#### 3.2.6. ROI Location Correction

A new method is proposed here to correct with accuracy the detected ROIs by the Viola-Jones algorithm, taking into account all pre-defined ROIs positions, which were manually annotated on a first frame by a trained expert.

Let **R***kb* be an automatic detected ROI over the thermal frame **i***b* that presents a coordinate **C***kb* = (*Ckbx*, *Ckby*) at the left upper corner, which corresponds to **R***kA* (annotated ROI over **i***A*) with coordinate **C***kA* = (*CkAx*, *CkAy*) at the left upper corner too. Then, two probabilities *pkbx* and *pkby* can be calculated for **<sup>R</sup>***kb*, taking into account the expert annotation, such as described in Equations (1) and (2). These *pkbx* and *pkby* values are computed in relation to **x** and **y**, respectively. They take values closer to 1 if **R***kb* location highly agrees with the manual annotation of the trained expert, which are shown in Equations (1) and (2). Notice that **R***kb* is automatically obtained by applying the Viola-Jones algorithm, fixing defined proportions (see Section 3.2.4).

$$p\_{bx}^k = \frac{\exp\left(-|\frac{\mathbf{C}\_{bx}^k}{\mathbf{W}\_b} - \frac{\mathbf{C}\_{Ax}^k}{\mathbf{W}\_A}|\right)}{\sum\_{i=1}^{11} \exp\left(-|\frac{\mathbf{C}\_{bx}^i}{\mathbf{W}\_b} - \frac{\mathbf{C}\_{Ax}^i}{\mathbf{W}\_A}|\right)},\tag{1}$$

$$p\_{by}^k = \frac{\exp\left(-|\frac{\mathbf{C}\_{by}^k}{H\_b} - \frac{\mathbf{C}\_{Ay}^k}{H\_A}|\right)}{\sum\_{i=1}^{11} \exp\left(-|\frac{\mathbf{C}\_{by}^i}{H\_b} - \frac{\mathbf{C}\_{Ay}^i}{H\_A}|\right)}\tag{2}$$

$$p\_b^k = \min\left\{ p\_{bx'}^k p\_{by}^k \right\},\tag{3}$$

where *k* refers to the current facial ROI for analysis, taking values from 1 to 11; *Wb* and *Hb* are the width and height of **R**0*b* (head ROI for **i***b*), respectively; *WA* and *HA* are the width and height of **<sup>R</sup>**0*A* (head ROI for **i***A*); *pkbx* and *pkby* are the probabilities that **R***kb* were correctly located on **i***b* regarding the trained expert annotation, in relation to *x* and *y* axes, respectively.

Finally, **R***kb* (for which **C***kb* is denoted as **C***ref b* ) of lower probability *pkb* is selected as a reference to correct the location for the other ROIs, using Equations (4) and (5). It is worth mentioning that **C***ref A* notation is used for the annotated frame **i***A*.

$$\mathbb{C}\_{bx}^{k'} = \frac{\mathbb{C}\_{bx}^{ref} + (\mathbb{C}\_{Ax}^k - \mathbb{C}\_{Ax}^{ref})}{W},\tag{4}$$

$$\mathbf{C}\_{by}^{k'} = \frac{\mathbf{C}\_{by}^{ref} + (\mathbf{C}\_{Ay}^k - \mathbf{C}\_{Ay}^{ref})}{H},\tag{5}$$

where **C***k b* = (*Ck bx*, *Ck by*) is the coordinate of the left upper corner for **R***kb* relocated.

#### 3.2.7. Feature Extraction

Given a thermal frame **i***b* formed by a set of ROIs, **R***b* = {**R**1*b*, **<sup>R</sup>**2*b*, ... , **<sup>R</sup>***kb*, ... , **R***Kb* }, being *K* = 11 the total number of ROIs, it is possible to extract from **R***b* a feature vector **F***b* = {**f**1*b*,**f**2*b*, ... ,**<sup>f</sup>***kb*, ... ,**<sup>f</sup>***Kb* } that describes a pattern related to an emotion, being **f***kb* = { *f kb*1, *f kb*2, ..., *f kb*14} features of **<sup>R</sup>***kb*. Table 2 presents the features adopted in our study, which agree with [17].

**R***kb* is the current ROI for feature extraction, **R***kb* is the average value of **<sup>R</sup>***kb*, *σ*2*<sup>k</sup> b* is the variance of **<sup>R</sup>***kb*, and *f kb*(*c*+<sup>7</sup>) for *c* equal 1 to 7 are other seven features corresponding to the difference of computed features throughout consecutive frames. So we have eleven ROIs (see Figure 4), and 14 features for each of them. This gives a set of 154 features per frame.


**Table 2.** Features computed in each ROI.

3.2.8. Dimensionality Reduction and Emotion Classification

Let **T** = (**<sup>F</sup>**1, *y*1),(**<sup>F</sup>**2, *y*2),...,(**<sup>F</sup>***b*, *yb*),...,(**<sup>F</sup>***<sup>n</sup>*, *yn*) be the training set, where *n* is the number of samples, and **F***i* is a *d*-dimensional feature vector with class label *yb* ∈ 1, 2, . . . , 5. PCA method based on Single Value Decomposition [9,16,20,23,36] is applied on **F***i* to obtain the principal component coefficients, which are used in both training and validation sets to reduce the set of 154 features, in order to allow a robust and fast emotion recognition. As advantages, PCA is few sensitive to different training sets, and it can outperform other methods as LDA when the training set is small [45]. This method has been successfully used in many studies to represent, in lower dimensional subspace, the high dimensional feature vectors, which are obtained by applying appearance-based methods [16,23]. It is worth mentioning that before applying PCA in our study, the feature vectors of the training set were normalized using both mean and standard deviation values as reference. Then, the validation was normalized using the same reference values (mean and standard deviation) obtained from the training set.

Some classifiers, such as LDA [17,23,46] and Quadratic Discriminant Analysis (QDA) [12,47] by applying full and diagonal covariance matrices, as well as other three classifiers, such as Mahalanobis discrimination [12,48], Naives Bayes [49], and Linear Support Vector Machine (LSVM) [12,14,50] are used in our study to assign objects to one of several emotion classes based on a feature set.

#### *3.3. Statistical Evaluation*

From images recorded during both moments of the experiment (Baseline and Test), a set of 220 thermography frames randomly selected from 11 children was annotated by a trained expert, selecting the ROIs defined in Figure 4. These annotated images were used as reference to evaluate, through Euclidean distances (see Equation (6)), the accuracy and precision of both Viola-Jones without ROI relocation, and Viola-Jones applying our ROI relocation algorithm.

$$D = \sqrt{(A\_x - M\_x)^2 + (A\_y - M\_y)^2},\tag{6}$$

where (*Ax*, *Ay*) is the coordinate obtained by the automatic method, and (*Mx*, *My*) is the coordinate obtained by the manual method. When *D* is close to zero, that means high accuracy.

The statistical analysis used for comparison between both approaches for each ROI was the Wilcoxon Signed Rank Test for zero median.

In order to evaluate our proposed system for emotion recognition, a published database (available in the supporting information of [17] at the website —Available at https://journals.plos.org/plosone/ article?id=10.1371/journal.pone.0212928) was used, which is formed by feature vectors labeled as the following five emotions: disgust, fear, happiness, sadness and surprise. This database was collected on 28 typically developing children (age: 7–11 years) by an infrared thermal camera [17]. It is worth noting that this database was also created with children of similar age range, using the same thermal camera and feature set (a total of 154 features) described in our study (see Sections 3.1, 3.2.2 and 3.2.7), and computing this feature set over the ROIs defined on Figure 4. Notice that the correct locations of these ROIs were visually inspected by a trained expert. For this reason, it was possible to compare the recognition system using one of the following methods: PCA for dimensionality reduction and Fast Neighbor Component Analysis (FNCA) [17] for feature selection. Here, the training and validation sets were chosen for several runs of cross-validation (*kf old* = 3), and metrics such as accuracy (ACC), Kappa, true positive rate (TPR), and false positive rate (FPR) were used [51].

On the other hand, this published database was used to train our proposed system based on PCA, but only using data collected from those children that presented ACC higher than 85% during the emotion recognition [17]. Then, our trained system was used to infer the children emotion during our experimental protocol described in Section 3.1.
