**5. Discussion**

Several studies have been addressed to recognize some of the six accepted emotions by psychological theory: surprise, fear, disgust, anger, happiness and sadness [10,53–55]. For instance, facial motion and the tone of the speech have shown in recognition systems their relevant role to infer these aforementioned emotions, achieving accuracy (ACC) from 80% to 98% [56–58] and from 70% to 75% [53,59,60], respectively. Although facial expressions provide important clues about emotions, it is necessary to measure, by optical flow, the movements of specific facial muscles through markers located on the face [53,57,58]. However, this non-contact free technique may not be comfortable to study or infer emotions of children with ASD, as these children present high skin sensitivity.

Following this approach of non-contact techniques, there are some APIs (Application Program Interface) that allow facial detection, such as the "Emotion API" from Microsoft or the Oxford API [61,62]. Some works, alternatively, use other image processing techniques, such as finding the region of the thermogram within a temperature range to detect the face [18]. Another approach is to keep the face in a fixed position using a support for chin or headrest device [41,63]. Furthermore, in Ref. [64] the authors show a solution using neural network and supervised shapes classification methods applied to facial thermogram.

Emotion recognition systems based on IRTI have, in fact, shown promising results. Table 5 shows some studies that use ROI replacement and further emotion analysis, although using other techniques in comparison with our proposal. Unfortunately, the results presented are somewhat disperse, and it is not possible to make a fair comparison among them, due to the different pictures used in the studies. In Ref. [18] the authors proposed techniques for selecting facial ROIs and classify emotions using a FLIR A310 camera. To detect the face, thermogram's temperatures between 32 ◦C to 36 ◦C were used to define the face position. The ROIs positions were further calculated by proportions based on the head's width. All other temperature points were considered background. Additionally, for emotion classification, the system was calibrated using a baseline (neutral state) that compensates the induced emotion by applying fuzzy algorithm, and thus, calibrate the induced emotion image. Using the baseline, the temperature is inferred by IF-THEN rules to calibrate the thermal images for the following induced emotions: joy, disgust, anger, fear and sadness. Next, a top-down hierarchical classifier was used to analyze the emotion classification, reaching 89.9% of success rate.


**Table 5.** Comparison among some strategies used to infer emotions using Infrared Thermal Imaging (IRTI).

ACC, accuracy; N/A means that the age or ACC were not reported.

The functional Infrared Thermal Imaging (fITI) was used in Refs. [21,63], which is considered to be a promising technique to infer emotions through autonomic responses. Similarly, another study was carried out using fITI to compare subjective ratings of displayed pictures for the volunteers [63], where these pictures were categorized into unpleasant, neutral and pleasant. Then, while the volunteers were watching these pictures, the authors collected the nose tip temperature (there was a chin support to keep the face correctly located in the camera image), which is one of the most likely places to change temperature when the person is under some kind of emotion [17]. As a result, they found that pictures that evoke emotions (no matter if it is a positive or a negative emotion) were more susceptible to produce thermal variation, while the difference for the neutral images was not as grea<sup>t</sup> as the others. Thus, their findings demonstrate that fITI can be a useful tool to infer emotions in humans.

Another interesting research [68] locates facial points in visual grayscale image using Gabor features based boosted classifiers, in which the authors used an adapted version of Viola-Jones algorithm, using GentleBoost instead of AdaBoost, to detect the face. Also, Gabor wavelet was used for feature extraction, detecting 20 ROIs that represent the facial feature points. All this detection was made automatically and contact-free using the iris and mouth detection. These two parts were detected by dividing the face in two regions, and calculating proportions to find those regions (iris and mouth). From this, all other ROIs were calculated using proportions. Their success rate was high, since the algorithm achieved a 93% of success rate using the Cohn-Kanade database, which has expressionless pictures of 200 people. Although Gabor wavelet transform is a representative method to extract local features, it takes a long time and has a large feature dimension.

Another method was proposed in Ref. [65], where a deep Boltzmann machine (DBM) was applied to recognize emotions from thermal facial images, using a previous database and with the participation of 38 adult volunteers. Their evaluation consisted of finding the emotion valence, which could be positive or negative, and their accuracy rate reached 62.9%. In their study, since the face and the background have different temperatures, they were split by applying the Otsu threshold algorithm in order to binarize images. Then, the projection curves (both vertical and horizontal) were calculated to find the largest gradient and detect the face boundary.

Additionally, a model for expression recognition using thermal images of an adult volunteer was applied in Ref. [66]. These authors used eigenfaces for feature extraction of the volunteer's facial images through PCA to recognize five emotions (happiness, anger, disgust, sad and neutral). As a highlight, that proposal reached an accuracy close to 97%, in which work, they applied eigenvalues and eigenfaces, trained the system with a set of images, used PCA to reduce the dimensionality, and distance classifier to recognize the emotion.

In Ref. [13], authors were able to achieve 81.95% of accuracy using histogram feature extraction combined with multiclass SVM over thermal images of 22 volunteers in the Kotani Thermal Facial Expression (KTFE) database, and four classes were studied: happiness, sadness, fear and anger. They used preprocessing techniques to prepare the image to apply Viola-Jones and for further image enhancement de-noising the image and using the Contrast Limited Adaptive Histogram Equalization. For ROI detection, a ratio-based segmentation was used.

Moreover, the recognition of both baseline and affective states was carried out in Ref. [41], where, to detect the face, they used a headrest to keep it in the correct position, in addition to a reference point (located on the top of the head), which was about 10 ◦C cooler than the skin temperature. To find the ROIs, the reference point was used, and a radiometric threshold was applied. In case of loss of the reference point, it was manually corrected by the researchers.

Another study [67] applied IRTI in a female adult volunteer, and Neural Networks and Backpropagation algorithms were used to recognize emotions, such as happiness, surprise and neutral state, reaching an ACC of 90%. To find the face they used Otsu segmentation, and the Feret's diameter was found in the binary image together with the center of gravity of the binary image. Then, after segmenting the image, positions of the face based on FACS-AU were used to determine the heat variation and, thus, the emotion.

Another work [61] shows a study about several approaches of emotion recognition and facial detection, such as machine learning and geometric feature-based process, in addition to SVM and a diversity of other classifiers. They also present the use of Microsoft HoloLens (MHL) for detecting human emotions by using an application that has been built to use MHL to detect faces and recognize emotion of people facing it. The set of emotions they worked was composed of happiness, sadness, anger, surprise and neutral. Additionally, they used a webcam to detect emotions and compare with the result using MHL. The system with MHL could achieve much better results than previous works

and had remarkable accuracy probably due to the sensors attached to the HoloLens, reaching an accuracy of 93.8% on MHL, using the "Emotion API" from Microsoft.

In Ref. [62], the authors used a parrot-inspired robot (KiliRo) to interact with ASD children by simulating a set of autonomous behaviors. They tested the robot for five consecutive days in a clinical facility. Children's expressions while interacting with the robot were analyzed by the Oxford emotions API, allowing them to make an automated facial detection, emotion recognition and classification system.

Some works, such as Ref. [69], show the use of deep learning to detect the child face and infer the visual attention on a robot during CRI therapy. Authors used the robot NAO from Softbank Robotics, which has two low-resolution cameras that were used to take pictures and record videos. They also used the face detection and tracking system in-built in NAO for the clinical experiments. A total of 6 children participated of the experiment, in which they imitated some robot movements. The children had 14 encounters over a month, and the actual experiments started 7 days after the preliminary encounter, in order to avoid the novelty effect in the results. Different deep learning techniques and classifiers were used, and they could reach an average children attention rate of 59.2%.

Deep-learning-based approaches have shown to be promising for emotion recognition, determining features and classifiers without expert supervisors [10]. However, conventional approaches are still being studied for use in real-time embedded systems because of their low computational complexity and high degree of accuracy [70], although for these systems the methods for feature extraction and classification should be designed by the programmer and cannot be optimized to increase performance [71]. Moreover, it is worth mentioning that conventional approaches require relatively lower computing power and memory than deep learning-based approaches [10]. Similarly, Gabor features are very popular for facial expression classification and face recognition, due to their high discriminative power [72,73], but the computational complexity and memory requirement make them less suitable for a real time implementation.

Our system is composed of low-cost hardware and methods of low-computational cost for visual and thermal image processing, and recognizes five emotions, achieving 85.75% of accuracy. For our system, we proposed a method based on probability error to accurately locate subject-specific landmarks, taking into account the trained expert criteria. As a highlight, our proposal can find frame-to-frame the best located facial ROI using the Viola-Jones algorithm, and adjust the location of its surrounding facial ROIs. As another novelty, our proposal based on probability error showed robustness and good accuracy to locate facial ROIs on thermal images, which were collected while typically developing children interacted with a social robot. As other findings, we extended an existing database of five facial emotions from thermal images, to infer unknown emotion generated while the children interacted with the social robot, using our recognition system based on PCA and LDA, thus, achieving results that agreed with the written reports of children.

As limitation, our system is not able to track head movements, thus, adding a method for facial tracking, such as done by Ref. [74], can make robust our proposal for facial landmarks in uncontrolled scenarios, such as mobile applications for child and social robot interaction. Generally, facial emotion datasets with six basic emotions contain only adult participants, but there are very few databases collected on typically developing children (aged between 7 and 11 years) through infrared camera, containing the basic emotions. Then, it is a challenge to use high quantity of examples during the training stage of a recognition system to infer emotions of children with age range from 7 to 11 years while they interact with a robot for example. In addition, more tests with a higher number of volunteers must be performed, including ASD children.
