*3.2. Voices*

A scientific metaphor that has become somewhat popular describes the voice as an "auditory face" [55], emphasizing the fact that voices, just like faces, provide rich information about a person's emotions, but also about identity, gender, socioeconomic or regional background, or age (for review, see [56]). In the context of autism, and like for faces, researchers focused on deficits in vocal emotional communication and pointed out that these deficits tend to affect multiple modalities including voices, faces, and body movement [57]. At the same time, deficits in vocal communication may also affect other aspects such as vocal identity perception [58,59], and vocal expression of autistic people in communication could be affected beyond emotional expressions.

One path for detecting auditory markers for autism in voices is linked to the symptom of repetitive behaviors or 'vocal stereotypies'. One study conducted a subspace analysis from acoustic data of autistic children and reported a good detection of vocalized non-word sounds [42]. Subsequently, Min and Fetzner [43] also used subspace learning for vocal stereotypies and trained dictionaries to differentiate between vocal stimming (a nonverbal vocalization often observed in autism) and other noises. Using a small sample of 4 children with ASD who lacked verbal communication (age not reported), the authors could e.g., detect vocal stimming and predict perceived frustration reasonably well, although the study was regarded as preliminary. For verbal children, other potential vocal markers could include prosody. Marchi et al. [44] created an evaluation database in three languages with ASD and TD children's emotionally toned voices which were analyzed using the COMPARE feature set. Groups comprised between 7 and 11 individuals per language, and children were between 5 and 11 years old. Comparisons between groups and emotions showed relatively poor classification performance ASD children's' voices, particularly for 'Anger' for both the Swedish and English dataset, and for 'Afraid' for Hebrew-speaking children with ASD. Although using a pre-existing dataset, a study on automatic voice perception showed that an algorithm successfully classified up to 61.1% of voice samples of children actually diagnosed with and without autism [60]. Ringeval et al. [45] assessed verbal prosody in ASD children, children with pervasive developmental disorder (PDD), children with specific language impairment (SLI), and TD children (aged around 9–10 years, with 10–13 children per clinical group). Specifically, these authors recorded performance during an imitation task for sentences with different intonations (e.g., rising, falling). The rising intonation condition was reported to best discriminate between groups, and the authors interpreted their findings as indicating a pronounced pragmatic impairment to prosodic intonation in ASD.

When compared with facial data, the use of sensors for automatic analysis of markers for autism in voices clearly is still at an early and preliminary stage. The studies reviewed above that use original data typically rely on small data sets from few people. Although initial findings suggest that the systematic search for vocal markers of autistic traits could be highly promising, more research is clearly warranted at this stage. When considering that similar impairments in facial and vocal emotional processing could be present in ASD [57], multisensory assessment of emotional behavior in future studies could be particularly promising.

## *3.3. Body Movement*

The identification of ASD-associated movement patterns has been the subject of intense research, primarily focusing data from accelerometer sensors [42,61]. As the display of stereotypical body movement patterns is a core symptom of ASD, it has been a target feature in computer vision. Gonçalves et al. [46] used a simple gesture recognition algorithm on 3D visual data automatically detecting hand-flapping movements. Validation of these data with 2D video data suggested that automatic 'hand flapping' detection delivers valuable information for monitoring autistic children, as in the case of special needs schools. Jazouli et al. [47] also used the same sensor but based their analysis on a \$P Point-Cloud Recogniser to automatically detect body rocking, hand flapping, fingers flapping, hand on the face and hands behind back with an overall mean accuracy of 94%.

Rynkiewicz et al. [47] studied the role of non-verbal communication in the setting of an ADOS-2 assessment for children aged 5–10. They used a 3D sensor for automatic gesture analysis of the upper body, while the boys (*N* = 17) and girls (*N* = 16) with high-functioning ASD performed two assessment-related tasks. Females had a higher gesture index than boys although they had less verbal communication skills and a more impaired ability to read mental states from faces. The authors suggested that the vivid use of gestures in girls (generally less common in our current understanding of the autistic phenotype) may contribute to possible under-diagnosis of autism in females. This might further lead to the more general use of automatic gesture analysis that is currently performed by professional human raters in the course of these assessments. A study by Anzulewicz et al. [48] used the touch and inertial sensors of a tablet to assess the specific movement patterns of children with autism (*N* = 37) and an age- and gender-matched TD group (*N* = 45) while playing simple games. Apart from interesting findings including the use of greater force, larger and more distal gestures and faster screen taps in the ASD group, they also tested different machine learning algorithms to classify between groups, with promising results.

In summary, the use of 3D data has been preferred by researchers investigating body movements. While it seems to be possible to detect certain stereotypical movements, there is still a lack of large-scale studies. However, even coarser movement indices (including, for instance, gesture indices or general movement patterns) may also provide meaningful information for the identification and differentiation of autistic behavioral markers.

#### *3.4. Multimodal Information*

Multimodal approaches to automatic analysis of behavior in ASD are still infrequent. Samad et al. [49] used facial motion, eye-gaze and hand movement analysis of adolescents with (*N* = 8) and without ASD (*N* = 8) on tasks involving 3D expressive faces. They found a reduced synchronization of facial expression and visual engagement with the stimuli in the ASD group, as well as poorer correlations in eye-gaze and hand movements. Jaiswal et al. [50] designed an automatic approach based on 3D data of facial features and head movements that detected ASD (vs. TD and vs. comorbid ASD and ADHD, with *N* between 11 and 22 per group) in adults with high accuracy in classification performance. Their recorded data consisted of participants reading, listening to and answering to questions taken from the 'Strange Stories' task [62] often used in ASD-assessments.

Overall, the integration of multimodal information, unfortunately, is not yet common in this field, despite the fact that there are reasons to expect that multimodal assessments will provide ample additional information (for instance, on synchronization or complementarity of signals from different channels). Thus, it can be expected that future use of multimodal assessments has the potential to substantially improve both the identification of markers for autistic traits and classification results.

## *3.5. Cognition and Social Behavior*

A substantial body of research has suggested that autism is characterized by changes in cognitive information processing. For instance, autistic people may have difficulties in cognitively simulating the observed actions of communication partners – a process that has been related to the so-called

human mirror neuron system, and that has been inferred via neurophysiological recordings [63]. Via experiments with social behavioral assessments, autism has also been related to a deficit in forming a theory of mind about other people [64]. Researchers in this field have long focused on inferences about "core" cognitive deficits from observational data, even though results can be very inconsistent across situations or studies [65,66]. We also welcome an increasing awareness of the danger that researchers wrongly infer putative core deficits by misinterpreting observational data [36]. Of course, inferences about core cognitive processes most typically need to be made from observational information. It is this context in which we foresee that sensor data, beyond their immediate value for an individual study, might eventually contribute to resolving theoretical controversies. One of these is the degree to which we should frame cognitive changes in autism in terms of general "core" deficits, or rather in terms of domain-specific deficits that emerge in a concrete perceptual and interactional context.

The automatic analysis of playing behavior could be a promising screening tool to identify stages of development, developmental delays and specific forms of deficits. In a preliminary study with only a single completely analyzed data set of a TD child [51], play data from smart toys with embedded acceleration sensors, as well as simultaneous video and audio recordings, were automatically characterized as certain forms of play behavior (exploratory, relational, functional), roughly indicating a certain stage of development. While this approach seems interesting, current limitations include the verbal prompting nature of their setting, which only allows children with good verbal skills to be assessed.

Joint attention is a basic communicational skill of sharing attention between communicational agents towards an object and has long been considered to be a precursor of a theory of mind [67] which may be reduced in people with ASD. In a study investigating a robot-assisted joint attention task [52] in children with and without ASD (*N* = 16 each), they captured participants' orientation using 3D sensors over a timespan. In addition to findings of significantly less joint attention in the ASD group's interaction with the robot, the ASD group also showed decreased micro-stability in the trunk area that was tentatively interpreted as a consequence of increased cognitive cost. In another study [53], attentional and orienting responses to name calls were studied in toddlers, aged 16–31 months, with ASD (*N* = 22) and TD (*N* = 82). Achieving a high intra-class-correlation with human raters, automated coding offered a reliable method to detect the differential social behavior of toddlers with ASD, who responded to name calls less often and with longer latency. Petric et al. [54] designed and further developed a robot-assisted subset of the ADOS assessment. They included certain social tasks (name-calling, joint attention) to assess information on eye gaze, gestures, and vocal utterances. The agreement of the robot's classifier and clinicians' judgments were evaluated as promising, but the results should be regarded with caution given the very small sample of children (ASD *N* = 3, TD N = 1).

Overall, while researchers have begun to assess joint attention in ASD, sensor-based assessment of other functional domains of cognition and social behavior still awaits systematic research. In particular, research on a few putative "core" areas of social cognition, including on observational tasks that probe theory of mind in ASD, is currently lacking.

#### **4. Supporting Interventions**

While accurate early diagnosis and an assessment of specific impairments are crucial, they also are prerequisites that inform environmental adjustment, intervention, and training approaches which ultimately can be valuable for the individual person with ASD following diagnosis. Technically-assisted training often has the benefit of being readily available, problem-specific, cost-effective, and widely accepted by affected children. Additionally, smart responses of training systems that give reliable, immediate feedback and appraisal can be highly beneficial for fast learning results. Note that although we identified many publications on interventions aiming at autism as a target condition, many of these reported conceptual or technological contributions and few of them presented original data from people with ASD that qualified them for inclusion in this review (cf. Section 2). The original articles presented in this review, regarding sensor-based supporting interventions for ASD, are listed in Table 2.

## *4.1. Emotion Expression and Recognition*

Emotion expression, especially from the face, is considered highly relevant in autism research. A game called FaceMaze [68], in combination with automatic online expression recognition of the user was specifically developed to improve facial expression production. In this game, children played a maze game while posing 'happy' or 'angry' facial expressions to overcome obstacles in the game. In a pre-post-rating with naive human raters, quality ratings for both trained expressions (happy and angry expression) in ASD children (*N* = 17, aged 6–18) increased in the post-test while ratings for an untrained emotion (surprise) did not change. Another smaller study created a robot-child-interaction and tested it with three children with ASD that were to imitate a robot's facial expression [69]. The robot correctly recognized the children's' imitated expressions through an embedded camera in half of the cases. In these cases, it was able to give immediate positive feedback. It also correctly did not respond in about one-third of the trials where there was no imitation by the participants. Piana, et al. [70] designed a serious game with online 3D data acquisition that trained children with ASD (*N* = 10) in several sessions to recognize and express emotional body-movements. Both emotion expression (mean accuracy gain = 21%) and recognition (mean accuracy gain = 28%) increased throughout the sessions. Interestingly, performance in the T.E.C. (Test of Emotion Comprehension, assessing emotional understanding more generally), increased as well (mean gain = 14%).

#### *4.2. Social Skills*

Robins et al. [71] created an interactive robot (KASPAR) with force sensitive resistor sensors. They later planned to use KASPAR for robot-assisted play to teach touch, joint attention and body awareness [72], although conclusive data on experimental results from interactions between individuals with autism and KASPAR may still be in the pipeline. Learning social skills also presupposes attention to potential social cues and social engagement. Costa et al. [73] reported preliminary research on using the LEGO mindstorm robots with adolescents with ASD (*N* = 2), in an attempt to increase openness and induce communication since the participants actively had to provide verbal commands or instructional acts. They reported that the two participants behaved differently, one being indifferent, and one being increasingly interested in the interaction. Wong and Zhong [74] used a robotic platform (polar bear) to teach children with ASD (*N* = 8) social skills. They found, that within five sessions an increase in turn-taking, joint attention and eye contact was observable, resulting in overall 90% achievement of individually defined goals.

Greeting is a basic element of communication. In a greeting game with 3D body movement as well as voice acquisition [75], a participant would play an avatar with his own face, learning to greet (vocalization, eye contact and waving) and get immediate appraisal upon success. A single case study suggested that this intervention can be effective at teaching greeting behavior. As a more complex pilot intervention, Mower et al. [76] created the embodied conversational agent 'Rachel' that acted as an emotional coach in guiding children through emotional problem-solving tasks. Of their two participants with ASD, audio and video data were acquired for post hoc analysis and tentatively suggested that the interface could elicit interactive behavior.

Overall, there is some evidence that sensor technology can improve social skills in people with autism, and the use of sophisticated robotic platforms can be regarded as particularly promising. As limitations, it needs to be noted that all studies that met the criteria to be included in this review only tested very few participants, and that there typically were no real-world follow-up tests reported. As a result, a systematic quantitative assessment of treatment effects and effect sizes, as well as a comparison with more conventional interventions (e.g., social competence training) will require substantial cross-disciplinary research. Moreover, most studies were driven by a combination of theoretically interesting and technically advanced approaches, and from the perspective of typical development. Designing more user-centered and irritation-free approaches could promote both usability and motivation for people with autism to engage in technology-driven interventions.




*Sensors* **2019**, *19*, 4787

#### **5. Monitoring**

Monitoring a child's emotional state or behavioral changes can be crucial for the outcomes of a learning environment. As discussed above, emotional expressions from people with ASD may differ in several respects from those of TD people. As a result, there is a higher risk that caregivers or interaction partners overlook or misinterpret the emotional state of people with autism.

Del Coco et al. [77] created a humanoid and tablet-assisted therapy setup that was trained to monitor behavioral change in children with ASD via a video processing module. Besides creating a visual output of behavioral cues, they computed a score for affective engagement (happiness related features) from visual cues such as facial AUs, head pose and gaze that provides the practitioner with a behavioral trend along with the treatment. Dawood et al. [78] used facial expressions, eye gaze and head movements to identify five discrete emotional states of young adults with ASD in learning situations (e.g., anxiety, engagement, uncertainty). Their resulting model yielded a high validity in identifying emotional states of participants with high-functional ASD. At the same time, a lower validity was found for TD participants, suggesting differential facial expressions of certain emotional states in ASD. For monitoring social interactions, Winoto et al. [79] created a machine-learning-based social interaction coding of 3D data around a target user. Kolakowska et al. [80] approached automatic progress recognition with different tablet games. Over a 6-month time window, they were able to identify movement patterns in their study group of children with ASD (*N* = 40), that not only related to development in fine motor skills but also other fields like communication and socio-emotional skills. Overall, these initial studies suggest that sensor-based monitoring of emotional and behavioral changes may support caregivers in optimizing learning outcomes.

#### **6. Discussion**

The studies discussed above demonstrate substantial research activities towards using sensor-based technology in the context of autism overall, with attention to multiple aspects including diagnosis/classification and intervention. At the same time, it appears that much current research is largely driven by fast technological progress in terms of innovative engineering and data analysis methods. It remains a significant challenge to reconcile these developments with the specific testing of psychological or neuroscientific theories regarding functional changes and potentials in autism. Similarly, systematic studies with theory-driven protocols and larger samples are required to evaluate in more detail both the diagnostic and interventional potential of sensor-based technology. For the ultimate goal of evaluating its practical relevance, quantitative assessments of diagnostic sensitivities and specificities, or of treatment effect sizes, will be as important as will be comparative studies with more traditional approaches to diagnosis and intervention.

One of many examples of how sensor technology has the potential to go beyond application, and to contribute to current neurocognitive theories of communication is related to the theory of a tight link between perception and motor action in communication. This link now has been firmly established in speech communication [81], but there are reasons to believe that perception and action are also closely linked in nonverbal emotional and social communication. For instance, listening to laughter normally activates premotor and primary motor cortex [82], and may involuntarily elicit orofacial responses in a perceiver in parallel. In turn, there also is initial evidence that voluntary motor imitation can actually facilitate facial emotion recognition, particularly in people with high levels of autistic traits [83] who are thought to engage less in spontaneous imitation. A consistent theoretical account for such findings is that imitation, and covert sensorimotor simulation of others' actions, may be based in part on the so-called mirror neuron system. This system consists of neurons that fire not only when a person performs an action, but also when s/he observes the same action in another individual. However, the human mirror neuron system is thought be specifically impaired in autism [63], and a subset of promising intervention approaches for autism using neurofeedback [84] are based on this theory. However, it should be noted that the underlying theory remains disputed [85].

Findings such as those by Lewis and Dunn [83] may be taken to suggest that interventions that promote facial imitation of emotions in autistic people should also support their abilities for emotion recognition and bidirectional communication. However, it is technically challenging to objectively quantify the degree of facial imitation, and in fact, a limitation of the study by Lewis and Dunn was that these authors failed to quantify imitation beyond simply asking participants to rate their own degree of imitation. Other studies measured facial imitation more objectively but typically did so by measuring the facial muscle response for selected target action units with electromyography (EMG, e.g., [86,87]). Although this can provide an objective measure of facial imitation, the fact that the method uses recording electrodes attached to facial muscles has many drawbacks. For instance, one concern is that this technology could draw the participants' attention to their own facial behavior, which in turn could influence facial action. We believe that contact- and irritation-free assessment of imitation as provided by modern sensor and real-time facial emotion recognition technologies is the method of choice to promote better understanding not only the role of spontaneous facial imitation in emotion recognition in normal communication, but also to determine the potential role of impaired links between perception and action for communication difficulties in people with autism.

While the research discussed in this review appears to underline a great potential for the use of sensor technology, in particular in the context of autism, it is equally clear that many current tests of assessments or interventions will benefit in validity from a clear conceptual framework of autism spectrum disorders in the developmental perspective. At present, and honoring findings of large individual variability within both people with ASD and TD, results that were obtained with only a few participants (not always well described, and sometimes obtained in the absence of a TD group) or with experimental groups that are not comparable with respect to their basic characteristics (e.g., age, gender, IQ) need to be interpreted with caution in order to avoid biased or overgeneralized interpretation of individual study findings.

Other potential obstacles relate to sophisticated developments (and costs) of some of the systems used, which make them unlikely to become available in greater quantities. Moreover, even readily available systems may get discontinued or run out of support, such as in the case of Microsoft's Kinect in 2017, and this provides great challenges for large-sample research in autism which often takes years to complete. Research aiming at training and modeling behavior of people with ASD also will increasingly need to consider usability, to the extent that the relevant systems are to be used by individuals with ASD, their parents, caregivers, and therapists.

Finally, compared to the typical approach of developing sensor-based technology with neurotypical individuals before applying it to people with autism, a more promising strategy may be one in which technology design originates from a user-centered perspective, with autistic people as users actively involved in the process. Such an approach has been forcefully advocated by Rajendran [88], who argues that this may both enhance our understanding of autism and promote better inclusivity of people with autism in an increasingly digital world. At the same time, such technologies ultimately can be useful for people without autism as well. This is because autism is seen as a unique window into social communication and social learning more generally.
