1. Introduction
Autism spectrum disorder (ASD) is a neurodevelopmental condition that limits social and emotional skills, and as a result, the ability of children suffering from ASD to interact and communicate is negatively influenced. The Centers for Disease Control and Prevention (CDC) reports that 1 in every 36 children in the US is diagnosed with ASD. It can be difficult to recognize the emotions of individuals with ASD, therefore making it hard to infer their affective state during an interaction. However, new technological advancements have proven to be effective in understanding the emotional state of children with ASD.
In recent years, wearable devices have been used to recognize emotions, detect stress levels, and prevent accidents using behavioral parameters or physiological signals [
1]. The low cost and wide availability of wearable devices such as smartwatches have introduced tremendous possibilities for research in affect analysis using physiological signals. An advantage of a wearable device, such as a smartwatch, is its ease of use in real-time emotion recognition systems. There are many physiological signals that can be used for emotion recognition, but heart rate is relatively easy to collect using wearable devices such as a smartwatch, bracelet, chest belt, or headset. Nowadays, many manufacturers have marketed smartwatches that can monitor heart rate employing photoplethysmography (PPG) sensors or electrocardiograph (ECG) electrodes. Heart rate sensors in devices like the Samsung Galaxy Watch, Apple Watch, Polar, Fitbit, and Xiaomi provide a reliable instrument for heart-rate-based emotion recognition. Another significant aspect of using heart rate signals for affect recognition is its direct linkage with the human endocrine system and the autonomic nervous system. Thus, a more objective and accurate affective state of an individual can be acquired by using heart rate information. In this work, the Samsung Galaxy Watch 3 is employed to acquire the heart rate signal, as it is more comfortable to wear for participants with ASD compared to wearing a chest belt or a headset. This is especially important for those who are hypersensitive, a common experience of those with ASD.
Previous research shows that heart rate changes with emotions. In Ekman et al. [
2] showed that heart rate had unique responses to different affective states. It was found that heart rate increased during the affective states of anger and fear, and decreased in a state of disgust. Britton et al. revealed that heart rate during a happy state is lower than heart rate during neutral emotion [
3]. Similarly, Valderas et al. found unique heart rate responses when subjects were experiencing relaxed and fearful emotions [
4]. Valdera’s experiments showed that the average heart rate is lower in a happy mood as compared to a sad mood.
Similarly, the field of robotics is opening many doors to innovate the treatment of individuals with ASD. Motivated by their deficiencies in social and emotional skills, some methods have employed social robots in interaction with children with ASD [
5,
6,
7]. Promising results have been reported in the development of the social and emotional traits of children with ASD while supported by social robots [
8]. Similarly, in Taylor et al. [
9,
10] taught children with intellectual disabilities coding skills using the Dash robot developed by Wonder Workshop. In this paper, an avatar is used in a virtual learning environment to assist children with autism (ASD) in improving communication skills while learning science, technology, engineering, and mathematics (STEM) skills. In particular, the child is given a challenge to program a robot, Dash™. Based on the progress and behavior of the child, the avatar provides varying levels of support so the student is successful in programming the robot.
Most emotion analysis studies of children with ASD use various stimuli to evoke emotions in a lab-controlled environment. The majority of these studies have employed pictures and videos to evoke emotions. However, in [
11], Fadhil et al. report that pictures are not the proper stimuli for evoking emotions in children with ASD. Although most studies have used video stimuli, other research works have employed serious games [
12] and computer-based intervention tools [
13]. In this work, a human–avatar interaction is used in a natural environment where children with ASD learn to code a robot with the assistance of an avatar displayed on an iPad.
The compilation of ground truth labels from the captured data is challenging, laborious, and prone to human error. To tag the HR data according to emotions, many techniques have been used in the literature. For instance, in [
14], the study participants use an Android application to record their emotions by self-reporting them in their free time. Similarly, in [
15,
16], the HR signals are labeled by synchronizing the HR data with the stimuli videos. Since the emotion label of the stimuli is known, the HR signals aligned with those stimuli are tagged accordingly. The problem with this tagging process is that it assumes the participants experience the emotion of the stimuli and that it is constant for all participants. However, in Lei et al. [
17] reveal that individuals experience varied emotions to different stimuli.
The ground truth labeling process becomes even more challenging when the data are collected from participants with ASD [
18]. For these participants, it is very difficult to accurately determine their internal affective state, and due to the deficits in communication skills in children with autism, the conventional methods for emotion labeling are difficult to apply [
18,
19]. In this paper, a semi-automatic emotion labeling technique is presented that leverages the full context of the environment in the form of videos captured during the interaction of the participant with the avatar. An off-the-shelf facial expression recognition (FER) algorithm, TER-GAN [
20], is employed to produce an initial label recommendation by applying FER on the video frames. Based on the emotion prediction confidence of the FER algorithm, a human with knowledge of the full context of the situation decides the final ground truth label. The FER algorithm classifies a video frame into seven classes, i.e., the six basic expressions of fear, anger, sadness, disgust, surprise, and happiness, and a neutral state. Similar to [
14], these emotions are clustered into three classes: neutral (neutral), negative (fear, anger, sadness, and disgust), and positive (happiness). After tagging the children–avatar interaction videos, the classical HR and video synchronization labeling technique is used to produce the ground truth emotion annotation of the HR signal.
After compiling the training and testing dataset, optimal features from the heart rate signal are then extracted and fed to the classifier for emotion recognition. A comparison between two different feature extraction techniques is also presented and experiments for intra-subject emotion categorization and inter-subject emotion recognition are performed. The main contributions of this paper are given below:
To the best of our knowledge, this is the first paper that presents a wearable emotion recognition technique using heart rate information as the primary signal from a smart bracelet to classify the emotions of participants with ASD in real time.
The dataset compiled for this study contains face videos and heart rate data collected in an in-the-wild set-up where the participants interact with an avatar to code a robot.
A semi-automated heart rate data annotation technique based on facial expression recognition is presented.
The performance of a raw HR-signal-based emotion classification algorithm is compared with a classification approach based on features extracted from HR signals using discrete wavelet transform.
The experimental results demonstrate that the proposed method achieves a comparable performance to state-of-the-art HR-based emotion recognition techniques, despite being conducted in an uncontrolled setting rather than a controlled lab environment.
The following presents the overall structure of the paper. The
Section 2 of this paper presents the related work in the domain of emotion recognition of kids with ASD. The
Section 3 describes the methods and the experimental details such as the demography of participants, the in-the-wild real-time learning environment, the interaction of a child with an avatar, the semi-automatic emotion labeling process, the feature extraction techniques, and the emotion recognition step. The
Section 4 presents the results of the experiments and discusses and compares the results with the state-of-the-art emotion recognition techniques using heart rate information. The
Section 5 presents the conclusions, and the
Section 6 is about the limitations and the future research work.
4. Results and Discussion
Summary statistics of the heart rate data acquired during the interactions of children with the avatar are shown in
Figure 7.
Figure 7 shows the average heart rate, the minimum heart rate, and the maximum heart rate of all participants. As can be seen during the completion of tasks and the interaction with the avatar, the participants go through a range of heart rate activities.
Figure 8 shows the maximum, minimum, and average beats per minute of all nine participants. The average heart rate of all nine participants is 96.8 BPM, while the maximum heart rate is 124 BPM (Participant 6) and the minimum heart rate is 62 BPM (Participant 1).
The emotion recognition results of both the intra-subject and the inter-subject data using DWT and heart rate features employing SVM, KNN, and RF classifiers are discussed in the following paragraphs. Experiments are performed using three different window sizes, and it is found that a window size of two seconds enhances the performance of the algorithm in terms of both accuracy and speed, which facilitates the real-time application of the emotion recognition technique.
The intra-subject classification accuracies of the three classifiers using DWT features are shown in
Figure 9. In the case of SVM, the highest accuracy of 100% is obtained from Participant 6, and the lowest recognition accuracy of 40.1% is obtained from Participant 4. For KNN, Participant 6 obtains the highest accuracy of 100%, and Participant 3’s emotion recognition accuracy of 51.4% is the lowest. In the case of RF, the highest accuracy of 99.5% is obtained from Participant 6, and the lowest accuracy of 39.2% is obtained from Participant 9.
Similarly, the intra-subject classification accuracy of the three classifiers using HR data is shown in
Figure 10. In the case of SVM, the highest accuracy of 100% is obtained from Participant 6, and the lowest recognition accuracy of 29.7% is obtained from Participant 1. In the case of KNN, Participant 5 obtains the highest accuracy of 100%, and Participant 2’s emotion recognition accuracy of 32.1% is the lowest. For RF, the highest accuracy of 99.2% is obtained from Participant 6, and the lowest accuracy of 35.6% is obtained from Participant 1.
The emotion recognition accuracy using the DWT features for inter-subject classification employing SVM, KNN, and RF is shown in
Table 3. The highest emotion recognition accuracy among all three classifiers, with a window size of 2 s, is obtained with SVM, and the average accuracy of the ten-fold cross-validation is 39.8%. The recognition accuracy for the window size of three and five seconds is 39.8% and 39.9%, respectively, and since the window size of five has a less significant impact on the average accuracy with a lagging overhead, we set the window size to two seconds to facilitate the real-time application of the proposed method. The emotion classification accuracy produced by KNN and RF is 33.4% and 35.7%, respectively, using a window size of two seconds. Similarly, the highest classification accuracy of 38.1% is obtained using SVM with the heart rate signal as an input feature, while RF produces the lowest recognition accuracy of 31.9%, as shown in
Table 4. Hence, this comparison indicates that a slightly better performance in emotion recognition can be achieved by using DWT-based features.
Figure 11 shows the confusion matrix of experiments performed with DWT features and raw heart rate signal, while
Table 4 shows the average precision, average recall, and F1 score of the experiments with the DWT features and the HR signal. Similarly, comparing the intra-subject and inter-subject recognition accuracy, it can be seen that the inter-subject emotion detection task is much more difficult than the intra-subject emotion classification due to the variation present in heart rate data for each individual.