**Thermal Infrared Imaging-Based A**ff**ective Computing and Its Application to Facilitate Human Robot Interaction: A Review**

**Chiara Filippini 1,2,\* , David Perpetuini <sup>1</sup> , Daniela Cardone <sup>1</sup> , Antonio Maria Chiarelli <sup>1</sup> and Arcangelo Merla 1,2**


Received: 6 March 2020; Accepted: 21 April 2020; Published: 23 April 2020

**Abstract:** Over recent years, robots are increasingly being employed in several aspects of modern society. Among others, social robots have the potential to benefit education, healthcare, and tourism. To achieve this purpose, robots should be able to engage humans, recognize users' emotions, and to some extent properly react and "behave" in a natural interaction. Most robotics applications primarily use visual information for emotion recognition, which is often based on facial expressions. However, the display of emotional states through facial expression is inherently a voluntary controlled process that is typical of human–human interaction. In fact, humans have not yet learned to use this channel when communicating with a robotic technology. Hence, there is an urgent need to exploit emotion information channels not directly controlled by humans, such as those that can be ascribed to physiological modulations. Thermal infrared imaging-based affective computing has the potential to be the solution to such an issue. It is a validated technology that allows the non-obtrusive monitoring of physiological parameters and from which it might be possible to infer affective states. This review is aimed to outline the advantages and the current research challenges of thermal imaging-based affective computing for human–robot interaction.

**Keywords:** human–robot interaction; thermal IR imaging; affective computing; social robots; emotion recognition

#### **1. Introduction**

Human–robot interaction (HRI) can vary depending on the specific robotic function. In fact, robots have been effectively used in the last decade in therapy and educational interventions, but they have been also used where human interaction is not required, such as for example in robotic vacuum cleaner applications. Even though the latter were not designed to interact socially and form bonds with humans, it was found that people often attribute human-like characteristics to their robotic technology and that they express attachment toward them [1]. This is in line with the assertion that humans have a natural tendency to anthropomorphize everything around them, including technology [2]. This goes especially with children, who "are unlikely to only use a robot as a tool, and they will undoubtedly have some sort of interaction that can be considered social" [3]. Therefore, it would be desirable that robots, which are designed to interact with adults and children, would be able to socially interact. To facilitate the interaction, robots should be easy to use, and they should be built to understand and correctly respond to different needs. As robots gain utility, and thereby influence on society, research in HRI is becoming increasingly important. HRI primarily deals with social robotics and includes

relevant aspects of robot action, perception, and cognition. The development of social robots focuses on the design of living machines that humans would perceive as realistic, effective, communicative, and cooperative [4]. For this purpose, social robots should be able to express, through their shapes and behaviors, a certain degree of "intelligence" [5]. This skill entails the whole set of social and perceptual abilities of the robot, delivering a human-like interaction.

However, on the other hand, the matter about the anthropomorphism of social robots is still strongly debated. Indeed, the inclination to anthropomorphize the existing technology could have negative consequences [6] and even lead to an unrecognized change in human relationships [7] or invasion of privacy [8]. Therefore, it would be prudent to distinguish cases in which the use of anthropomorphism can be encouraged and cases in which it would be better not to [9]. Darling [10] argues that anthropomorphic framing is desirable where it enhances the function of the technology. Especially, it should be encouraged for social robots and discouraged for robots that are not designed in an intrinsically social way [9,10]. The present review is focused on social robots and in this context, their anthropomorphism is aimed at facilitating the interaction with humans.

An important key to reproduce a human-like behavior and facilitate HRI is the understanding of human emotional reactions. In recent years, a great effort was devoted to endowing the robot with the capability of interpreting and adapting to humans' emotional state. Consequently, a core aspect of HRI is Affective Computing. The term Affective Computing was created over 10 years ago; it is defined as "computing that relates to, arises from, or deliberately influences emotion or other affective phenomena" [11]. Since then, the community developed different concepts, models, frameworks, and demo applications for affective systems that are able to recognize, interpret, process, and simulate human feelings and emotions. However, these activities have been currently limited to possibility studies, proof-of-concept prototypes, or demonstrators.

Human emotions are indeed manifested as visible changes in facial expressions, gestures, body movements, or voice tone [12]. Beyond these observable expressive channels, physiological modulations also occur and can be observed as modifications in blood pressure, heart rate, or electrodermal activity [13,14]. These modulations are controlled by the autonomic nervous system and are not directly visible to humans. The main approaches to objectively measure these emotional signs rely on the observation of the face, gestures or body posture, and on measurements of physiological parameters through contact sensors [15,16]. Visual observation has the advantage of being completely non-invasive and only moderately intrusive. In fact, whereas a camera is being considered an intrusion into privacy, people are inclined to accept and forget it once they get used to it and see a value in it [17]. Furthermore, the greatest advantage of using the visual domain for emotion recognition is that it relies on years of study and cutting-edge algorithms developed for face identification and facial expressions analysis. The disadvantage of the visual approach is that people are prone to avoid facial expressions when interacting with technical systems [18]. In fact, the facial display of emotional states characterizes the human-to-human interaction, and it was acquired through evolution to facilitate communication and to influence others' actions [19,20]. Since machines do not still respond to emotional expressions, humans have not yet learned to use this channel when communicating with a robot. On the contrary, measuring physiological parameters has the advantage of evaluating features that people cannot control or mask [21]. Therefore, they may deliver much more reliable data about the emotional symptoms than visual channels. Physiological parameters such as heart rate, skin temperature, and electrodermal activity are very simple to acquire and analyze. The drawback of working with physiological readings is that they are mostly obtained through contact sensors [22]. Direct contact with the person's skin requires the willingness to correctly wear the device. Novel wearable sensors have to meet various technological requirements (e.g., reliability, robustness, availability, and quality of data), which are often very difficult to obtain. Finally, the time required for sensors placing is not negligible.

In recent years, thermal InfraRed (IR) imaging has been used for Affective Computing [23–25], and it exploits the advantages of both visual and physiology measuring approaches, as well as overcoming their drawbacks. Thermal IR imaging-based Affective Computing is a breakthrough technology that enables monitoring human physiological parameters and Autonomic Nervous System (ANS) activity in a non-contact manner and without subject's constraining [26]. Although thermal IR imaging has been widely used for human emotion recognitions [27–35], its application in HRI implies overcoming ambitious and entirely new challenges. The most important ones are real-time monitoring and real-life applications. To this end, the use of a mobile and ideally miniaturized technology is essential to bridge the gap between laboratory and real-world applications. In addition, since robots should be commercially available, they need to include low-cost technology, which implies facing further issues such as low resolution and signal-to-noise ratio.

In this survey, the importance of facilitating HRI in different aspect of everyday life is firstly highlighted. The state-of-the-art of affective state recognition through thermal IR imaging is briefly reviewed. Afterwards, the above-mentioned challenges are examined, indicating what has been done so far and suggesting further development to ensure its future efficient use in the field. Finally, the possible impact of this technology for HRI applications in infants, children, and adults is highlighted.

#### **2. Study Organization and Search Processing Method**

This study has been carried out as a systematic literature review based on the original guidelines as proposed by Kitchenham [36]. The work aims to highlight the positive impact that the thermal IR imaging technology can have on HRI application, facilitating and enhancing this interaction by recognizing the human affective state. For this purpose, the research questions (RQ) addressed by this study were:

RQ1. What is the broad social impact of a facilitated HRI? Or Why is facilitated HRI important?

RQ2. What are the scientific bases for thermal IR imaging as an affective state recognition technique?

RQ3. What are the limitations of current research?

RQ4. Has thermal IR imaging already been addressed in the HRI field?

One of the goals of this study was to make this review as inclusive and practical as possible. Therefore, the databases searched were both Scopus and Google Scholar. All the papers published in conferences and journals between 2000 and 2020 were considered. Papers published from the year 2000 were considered, since the word "Affective Computing" was coined by Picard that year [6], and "Affective Computing" started to be applied in the HRI field. For each research question, a search process was applied.

Concerning RQ1, the search was based on the words "facilitate" and "human–robot Interaction". In the Scopus database, the survey was set up by searching for those words within the following fields: article title, abstract, and keywords. The basic search generated 441 results. In the Scholar database, on the other hand, the advanced search can be performed either by searching (i) the entire text or (ii) the title only. Therefore, the search was based on "facilitate human–robot interaction" within the entire text. A total of 360 results were obtained from the Scholar survey. The review papers and all the papers that did not refer to a user study or that did not relate to HRI were excluded, which reduced the considered pool to 155 papers. Within those papers, 130 were related to Scopus research, and 105 were from Scholar with an overlap of 80 papers. The manual review process was adopted for the final exclusion; the papers' abstracts were scanned, and all the papers that did not report experimental applications of human interaction were discarded. A total of 18 papers were discussed in the review concerning these keywords. Among those papers, 13 papers resulted from both Scopus and Scholar research, while 5 were from Scholar only.

With respect to RQ2 and RQ3, the searched keywords were "thermal imaging" OR "IR imaging" OR "thermography" AND "emotion recognition" OR "Affective Computing" OR "emotion". In the Scopus database, those keywords were surveyed in fields such as article title, abstract, and keywords. Whereas in Scholar, the advanced survey was carried out by searching for "thermal imaging" OR "IR imaging" OR "thermography" with at least one of these words: "Affective Computing" OR "emotion", and the field searched was the entire text. The search generated 115 results in Scopus and 163 in Scholar with an overlap of 95 papers. The results were scanned through a manual review procedure focused

on the papers' abstracts, which aimed to identify whether the considered works reported on thermal IR imaging for human emotion recognition. Papers not related to it were excluded. The resulting papers were analyzed and grouped based on their experimental applications, strengths and limitations were indicated, and 46 were reported in the present work. Among those papers, 40 papers resulted from both Scopus and Scholar research, while 6 were from Scholar only.

Finally, as for the RQ4, the keywords "thermal imaging" OR "IR imaging" OR "thermography" AND "human–robot interaction" were searched. The fields checked and the procedure performed were the same reported in the previous RQs. The basic search generated 13 results in Scopus and 388 in Scholar. Ten papers that resulted from Scopus search were also found in the Scholar research outcome. Exclusion criteria regarded all the results that were not conference or journal papers actually related to thermal IR imaging applied in the HRI field. From the latter, 20 papers of thermal IR imaging-based Affective Computing were selected and included in this work. Among those papers, a subset of 2 papers was linked to both RQ2/RQ3 and RQ4; therefore, it was reported in both sections.

#### **3. The Importance of Facilitating Human–Robot Interaction**

Robotic technologies and HRI are being increasingly integrated in real-life contexts. Modern society creates ever more spaces where robot technology is intended to interact with people. Robots applications can range from education to communication, assistance, entertainment, healthcare, and tourism. Hence, there is a need to better understand how robotic technologies shape the social contexts in which they are used [37]. In this section, the importance of a natural and efficient HRI, in three major fields such as education, healthcare, and tourism, is deepened.

Economic and demographic factors drive the need for technological support in education. The growing number of students per class, the reduction of the school budget, and the demand for greater personalization of curricula for children with diverse needs are encouraging research on technology-based support for parents and teachers [38]. Hence, the efficacy of robots in education is of primary interest. Movellan et al. deployed a fully autonomous social robot in a nursery school's classrooms for a period of 2 weeks in order to study whether the robot could improve target vocabulary skills in toddlers (18–24 months age) [39]. The results showed that the vocabulary skills improved significantly; in fact, the number of learned words increased by 27% when compared to a matched set of control. Considerable educational benefits can also be obtained from a robot that takes on the role of a novice (i.e., a care-receiving robot), thus allowing the student to take on the role of the instructor, which generally improves self-confidence and learning outcomes. This phenomenon is known as learning by teaching. An example is the educational use of Pepper (Figure 1), which was designed to learn together with children at their home environment from a remote teacher [40].

**Figure 1.** Pepper robot interaction with a child.

Social robots are also widely used in healthcare fields. Given the exponential growth of vulnerable populations (e.g., the elderly, children with developmental disabilities, and sick people), there is an increasing demand for social robots that are able to provide aids and entertainment for patients in the hospital or at home. For instance, an important contribution can be provided by companion robots especially among sick people, in the mitigation of boredom, depression, isolation, and loneliness. In this perspective, Banks et al. explored the ability of a robotic dog (AIBO) to treat loneliness in elderly patients living in long-term care facilities (LTCF). Results demonstrated that LTCF residents showed a high level of attachment to AIBO, highlighting the capability of interactive robotic dogs to reduce loneliness [41].

Together with companion robots, therapy robots are also considered of high social impact. Therapy robots are generally employed to deliver treatment for people with physical and mental diseases, such as Autism Spectrum Disorder (ASD). A recent review reported statistics showing an annual increase in the number of children with ASD, starting from 1 out of 1000 children in 1970 up to 1 out of 59 children in 2018 [42]. Thus, the need for innovative care and proper attention for ASD children is compelling. Particularly, researches have shown that people suffering from Autism Spectrum Disorders (ASD) responded better to treatments involving robotic technology rather than treatments from human therapists [43]. Moreover, it was demonstrated that the use of robots in education helps ASD children to improve their abilities to handle social and sensory challenges at school environment and to better control the anxiety and stress [44,45]. Furthermore, Wilson et al. noted that one of the major barriers to the education of children with autism pertains to lack of knowledge, training, and specialized support staff as well as the lack of adequate resources for education and classroom size [46]. Therefore, the development of robots with advanced emotion recognition and HRI capabilities that can be used at school or at home is becoming essential in supporting ASD patients.

Finally, several applications in the literature confirmed the importance of HRI in tourism settings. In fact, with recent technological advancements in artificial intelligence (AI) and robotics, we see an increasing number of service robots entering tourism and hospitality contexts, including consumer-facing ones [47]. For instance, Niculescu et al. developed SARA (Singapore's Automated Responsive Assistant), a robotic virtual agent, to offer information and assistance to tourists, being able to detect the user's location on a map [48]. CLARA is a virtual restaurant recommendation system and conversational agent that provides tourists with information about sightseeing, restaurants, transportation, and general information about Singapore [49]. However, the adoption of service robots inevitably changes the nature of service experience. Unlike industrial robots whose performance metrics depend entirely on efficiency, the success of service robots depends on user satisfaction and, consequently on the degree of empathy and natural interaction. Tussyadiah et al. focused on consumer's evaluation of hotel service robots. In their study, they surveyed consumer response to two types of robots, NAO and Relay. The results revealed that consumer intention to accept hotel service robots is influenced by their interaction with the robot, dimensions of anthropomorphism, perceived intelligence and security [47]. The same conclusion was drawn from Kervenoael et al. in [50], where they stated that empathy from service robots has a positive and significant effect on the intention to use a robot. Empathy is meant as the robot's ability to understand or feel what another person is experiencing from within their frame of reference, i.e., the robot's ability to recognize emotion and to respond it in an appropriate way. In conclusion, beyond the requirement of well-programmed social robots, in order to cater to specific consumers or interlocutor's needs, they must incorporate a seamless integration of safe and reliable service that includes courtesy and inspires trust.

With sophisticated robots or artificial agents becoming ever more ubiquitous in daily life, their appropriate use is crucial. In this section, we provided examples of their positive impact on fields such as education, healthcare, and tourism. Research on HRI requires contributions and expertise from heterogeneous disciplines, including engineering and artificial intelligence. In fact, to ensure a natural HRI, social robots need to recognize human emotions, respond adequately to them, understand and generate natural language, have reasoning skills, plan actions, and execute movements in line with what is required by the specific context or situation [51]. A good extent of the effort can be ascribed to emotion recognition, which currently requires exploring fields of facial expression analysis and speech processing, which are anything but trivial tasks. Relying on emotion recognition through non-obtrusive physiological sensing, thermal IR imaging-based affective computing can be a way to avoid some of those challenges. Although this technique is already used in the field of emotion recognition, methodological considerations are required to make it suitable for HRI applications.

#### **4. A**ff**ective States Recognition through Thermal IR Imaging**

A considerable number of studies have explored the use of thermal IR imaging for classifying affective states and human emotions. Those studies were based on measuring a person's physiological cues that are considered related to affective states. Indeed, the observations of affective nature derive primarily from muscular activity, subcutaneous blood flow, perspiration patterns in specific body parts, as well as from changing in breathing or heart rate, which are all phenomena that are controlled by our ANS. Measuring facial cutaneous temperature and assessing both its topographic and temporal distribution can provide insights about the person's autonomic activity. This depends on the ANS's role in the human body's thermal homeostasis and in the regulation of physiological responses to emotional stimuli [29]. Alert, anxiety, frustration responses, and other affective states determine the redistribution of the blood in the vessels through vasodilation or vasoconstriction phenomena which are regulated by ANS. These phenomena can be captured by an IR thermal camera through changes in the IR emissivity of the skin. Vasoconstriction implies a decrease in temperatures of the Region Of Interest (ROI). On the other hand, vasodilatation is the cause of heating. Vasomotor processes can be identified and monitored over time because they produce a thermal variation of the skin and they can be characterized by simple metrics such as temperature difference between data at two temporal points. For this reason, most researchers have focused on studying the relationship between thermal directional changes (i.e., temperature drop and rise) of specific skin areas in relation to psychophysiological states [28,31,52,53]. As a body area of interest, the human face is considered of particular importance since it can be easily recorded and it is naturally exposed to social stimuli. Regions of the face extensively characterized are the nose or nose tip, the glabella (area associated with the corrugator muscle), the periorbital area, forehead, and the orbicularis oculi (surrounding the eyes), as well as the maxillary area or the upper lip (perinasal) [24,28,54]. Partially evaluated regions were cheeks, carotid, eyes, fingers, and lips. An exhaustive review on this topic by Ioannou et al. summarized the emotions, the observed regions, and the direction of the average temperature changes in those regions (Table 1) [55].

**Table 1.** Overview of the temperature variations direction, as estimated through thermal IR imaging, for each emotion in the different regions considered. The table is adapted from [55].

Similar results were found in a recent study by Cruz-Albarran et al., where thermal IR imaging was used during the emotions induction process to quantify temperature changes that occurred on different ROIs of the face [56]. The authors were able to classify the emotions relying on regional temperatures with an accuracy of 89.9%. The induced emotions were joy, disgust, fear, anger, and sadness. The examined ROIs were nose, cheeks, forehead, and maxillary area, which are depicted in Figure 2. Among all the ROIs, the nose and maxillary area were the most responsive to emotional stimuli, as they showed a significant change in temperature in all the induced emotions. The forehead temperature changed during sadness, anger, and fear, while the temperature of the cheeks changed during disgust and sadness.

**Figure 2.** Thermal IR images with marked ROIs (black rectangles).

Emotion identification through thermal IR imaging was also employed in studies on children. Such applications may receive the greatest benefit from the technology thanks to the ecological nature of this technique and the difficulties associated with measuring children with skin-located sensing. Goulart et al. proposed an experimental design to identify five emotional states (disgust, fear, happiness, sadness, and surprise) evoked in children between 7 and 11 years old [57]. The forehead, the tip of nose, the cheeks, the chin, the periorbital and the perinasal regions were chosen to extract the affective information. Each thermal frame was processed by segmenting the ROIs and evaluating the ROIs' temperature mean, variance, and median values. Then, a linear discriminant analysis was used as a classifier. High accuracy (higher than 85%) was obtained for the classification of the five emotions, thus resulting in a robust method to identify quantitative patterns for emotion recognition in children. Temperature decrease was detected in the forehead region during disgust and surprise; in the periorbital region during happiness, sadness, and surprise; in the perinasal region during disgust and happiness; in the chin during surprise, happiness, and sadness; and finally, in the nose during disgust, fear, and happiness. The temperature increase was detected in the left cheek region for all emotions and in the nose tip during surprise.

Beyond the basic emotions, thermal IR imaging has been used to characterize the two dimensions of emotions, such as valence (pleasant versus unpleasant) and arousal (low versus high). Emotions dimensions are a crucial aspect in the affective research field, on which most of the studies on emotions recognition are based. The most commonly used models representing emotions dimension in HRI are the Pleasure, Arousal, Dominance (PAD) emotional state model [58] and the circumplex model of affect [59]. The PA dimensions of PAD were developed into the circumplex model, which indeed assume that any emotion might be described with two continuous dimensions of valence and arousal [60]. The valence dimension indicates whether the subject's current emotional state is positive or negative. Arousal, on the other hand, indicates whether the subject is responsive or not, at that given moment and for that given stimulus, and how active he/she is. In particular, the theory of the dimension of the emotions proposes that the emotional states are not discrete categories but rather a result of varying

degrees of their dimensions. A graphical representation of the circumplex model of affect developed by Russel is reported in Figure 3. For example, joy is characterized as the product of strong activation in the neural systems associated with positive valence or pleasure together with moderate activation in the neural systems associated with arousal (i.e., low arousal). Emotions other than joy likewise arise from the same two-dimensional systems but differ in the degree or extent of activation. This allows characterizing also complex emotions such as love or happiness other than the basic ones. The analysis of the emotion recognition solutions reveals that there is no one commonly accepted standard model for emotion representation. The dimensional adaptation of Ekman's six basic emotions and the circumplex or PAD model are the ones widely adopted in emotion recognition solutions [61].

**Figure 3.** Graphical representation of the circumplex model of affect with the horizontal axis representing the valence dimension and the vertical axis representing the arousal dimension. Adapted from [60].

Recent studies explored temperature changes associated with different degrees of valence and arousal. For instance, Salazar-Lopez et al. studied the relation between changes in temperature of the subject's face and valence or arousal dimensions [53]. They used pictures from the International Affective Picture System (IAPS), which is widely used in studies of emotion recognition and characterized along the two dimensions [62]. The analyzed ROIs were the forehead (left and right sides), the tip of the nose, the cheeks, the eyes, and the mouth regions. Significant differences in temperature were found only on the tip of the nose. The results showed that high arousal images elicited temperature increases on the tip of the nose, while low arousal images led to temperature increases for pleasant images (i.e., positive valence) and decreases for unpleasant ones (i.e., negative valence). Contrasting results were indeed found by Kosonogov et al. [63]. The authors found no significant temperature differences along the valence dimension of the emotions (i.e., pleasant and unpleasant emotions). Besides, an activation effect of emotional pictures was found on the amplitude and latency of nasal thermal responses: the more arousing the pictures, the faster and larger the thermal responses. The only evaluated region was the tip of the nose. The relevance of the emotional arousal in causing changes in the nose tip temperature is supported and demonstrated in different studies, such as in Diaz-Piedra et al., where a direct relationship was found between reduced levels of arousal and nasal skin temperature [64], and confirmed in Bando et al. [65].

Moreover, Pavlidis et al. proposed quantifying stress through thermal IR imaging [54]. Stress in the valence–arousal space is identified as negative valence and high arousal [66]. Pavlidis et al. found that high stressful situations resulting in the cooling of the area around the nose tip [54]. Parallel results were found in children, Ioannou et al. obtained, in their study of guilt in children (age 39–42 months), in which the higher the distress signs, the higher the decrease in nose temperature [67]. The sense of guilt

was induced through the "mishap paradigm" in which children were led to believe they had broken the experimenter's favorite toy (Figure 4). The temperature of the nasal area decreased following the "mishap" condition, suggesting a sympathetic activation and peripheral nasal vasoconstriction, and it increased after soothing due to a parasympathetic activation.

**Figure 4.** Child facial temperature during the mishap condition. Adapted from [68].

Vasomotor process, subcutaneous blood flow, and perspiration patterns are not the only physiological measures that can be detected through thermal IR imaging. In fact, the physiological signature of the autonomic system activation such as heart and breathing rate variation can be also monitored [69–72]. These two parameters are measurable with thermal IR imaging by positioning a ROI over a superficial vessel or over the nostrils respectively and by monitoring the ROI average temperature over time. Whereas the latter is easily detectable, because the thermal difference between the expired and inspired air is easily appreciable (approximately 0.5 ◦C), the modulation of temperature caused by the pulsation of blood in the vessel is not. Therefore, a series of algorithms have been developed for the estimation of the heart rate through thermal imaging [69]. Cross et al. used thermal IR imaging to detect physiological indicators of stress in the adult population by analyzing the respiration and heart rate variation during the performance of mental and physical tasks [26]. Temperature variation over time was recorded on the nose tip and regions near superficial arteries to detect respiration and heart rates, respectively. The results showed that the accuracy in the physical versus psychological stressors classification was greater than 90%, and the heart and respiration rate were accurately detected by thermal IR imaging. Whereas the evaluation of the heart rate through thermal IR imaging is not common in the literature, the respiration rate assessment through thermal IR imaging has been tested in different settings, including sleep monitoring [73], neonatal care [74], and driver's drowsiness monitoring [75].

#### **5. Limits of Current Thermal IR Imaging for HRI Applications**

Studies reviewed in this section revealed thermal IR imaging ability to monitor physiological signs and affective states. Although they have sometimes shown incongruent results (e.g., nose tip temperature did not significantly change along the valence dimension of the emotions in Kosonogov et al. [63], whilst it did change in Salazar-López et al. [53]), these findings open exciting prospects for affective computing. One of the causes of inconsistency could be that a single discrete metric is maybe not always sufficient, since it could be susceptible to the complex physiological mechanisms [76]. In addition, there could be interaction effects between affective states. Nonetheless, all the affective states analyzed have the capability to induce ROI temperature variations. However, an important consideration is that all these studies were conducted using high-end thermal camera and performed in controlled laboratory settings. The thermal systems mostly used in literature are FLIR (Wilsonville, OR, USA) A655sc with 640 × 480 spatial resolution, a 50 Hz sampling rate, and <0.03 ◦C

thermal sensitivity and a FLIR A325sc with 320 × 240 spatial resolution, 60 Hz sampling rate, and <0.05 ◦C thermal sensitivity. In addition, all the observations reported were made in a climate-controlled room according to the International Academy of Thermology (IACT) guidelines [77]. In fact, IACT guidelines indicated that when performing a thermal IR imaging measurement, it is mandatory to control the temperature and humidity of the experimental room. They suggested a temperature range of 18–23 ◦C and a controlled humidity range between 40% and 70%. No direct ventilation on the subject and no direct sunlight (no windows or with curtains or blinds) is also recommended during the experimental measures. In conclusion, by analyzing the studies reported in this section, two main constraints were identified that are not suitable for HRI applications. Those are (1) the use of high-end and sized thermal imaging systems and (2) the circumstances in which the studies were conducted i.e., in restricted laboratory settings. On the other hand, HRI applications require daily life scenarios, eventually suitable for outdoor use, and technology embedded in commercial social robots, i.e., low-cost miniaturized sensors. In Sections 6 and 7, those limits are addressed in order to highlight the new improvements developed in recent studies. A special emphasis was placed on these two sections as they deal with crucial aspects for the use of thermal IR imaging in the HRI field. Finally, the last section focuses on the actual state of the art of thermal IR imaging-based affective computing applications.

#### **6. Mobile Thermal IR Imaging**

The relevant spread of thermal IR technology, together with the miniaturization of IR detectors, induced manufacturers to produce portable thermographic systems, i.e., mobile and low-cost thermal IR imaging devices. One of the first companies to commercialize mobile thermal devices was FLIR with systems such as the FLIR ONE Pro, 160 × 120 spatial resolution, an approximately 8.7 Hz sampling rate, 0.15 ◦C thermal sensitivity, and dimensions of 68 <sup>×</sup> <sup>34</sup> <sup>×</sup> 14 mm<sup>3</sup> , or FLIR Lepton, 80 × 160 spatial resolution, an approximately 8.7 Hz sampling rate, <0.05 ◦C thermal sensitivity, and dimensions of 11.8 <sup>×</sup> 12.7 <sup>×</sup> 7.2 mm<sup>3</sup> . FLIR ONE was designed to be integrated on mobile phones. Another mobile thermal system designed to be integrated on a mobile phone is the Therm-App system developed by Opgal manufacturer; it has 384 × 288 spatial resolution, an approximately 9 Hz sampling rate, approximately 0.07 ◦C thermal sensitivity, and dimensions of 55 <sup>×</sup> <sup>65</sup> <sup>×</sup> 40 mm<sup>3</sup> . Market research has shown that the SmartIR640 mobile thermal system, manufactured by Device aLab (640 × 480 spatial resolution, 30 Hz sampling rate, <sup>&</sup>lt;0.05 ◦C thermal sensitivity, and dimensions of 27 <sup>×</sup> <sup>27</sup> <sup>×</sup> 18 mm<sup>3</sup> ) is also a valid solution, but it is not yet used for research projects.

Despite the relatively low-quality thermal imaging outputs of a mobile thermal system, this technology could help bridge the gap between the findings from highly constrained laboratory environments and wild real-world applications. Indeed, its portability (e.g., small size and low computational resource requirement) allows the camera not only to be easily attached to a mobile phone but also to be integrated in a social robot head. Recent studies have started to explore mobile thermal IR imaging for affect recognition tasks, especially focused on stress monitoring [78–81]. Cho et al. proposed a system consisting of a smartphone camera-based PhotoPlethysmoGraphy (PPG) and a low-cost thermal camera added to the smartphone, which was designed to continuously monitor the subject's mental stress [78]. By analyzing the nose tip temperature and the blood volume pulse through PPG [82,83], they were able to classify the stress with an accuracy of 78.33%, which is comparable to the state-of-the-art stress recognition methods. The employed mobile thermal camera was the FLIR ONE. The study was conducted in a quiet laboratory room with no distractions. Another study by the same authors included the mobile thermal camera as a standalone system to monitor mental stress [80]. The authors proposed a novel low-cost non-contact thermal IR imaging-based stress recognition system that relayed on the breathing dynamic patterns analysis. In fact, since breathing is an important vital process controlled by the ANS, its pattern monitoring can be informative of a person's mental and physical condition. Results showed a classification accuracy of 84.59% for a binary classification (i.e., no-stress, stress) and an accuracy of 56.52% for multi-class classification (i.e., none, low, high-level stress). The breathing signal was recovered by tracking the person's nostril area. Then, the extracted breathing signal was converted in a two-dimensional spectrogram by stacking the Power Spectral Density (PSD) vector of a short-time-window respiration signal over time. Since the PSD function handles the short-time autocorrelation that identifies similarities between neighboring signal patterns, it can be used to examine respiration variations in a short period [65]. The breathing signal was also investigated through the use of a mobile thermal camera by Ruminski et al., who embedded such a camera in smart glasses [84]. Basu et al. instead used a mobile thermal system for the challenging purpose of classifying personality (psychoticism, extraversion, and neuroticism) [81]. The proposed system classified the emotional state using an information fusion of thermal and visible images. A blood flow perfusion model was used to obtain discriminating eigenfeatures from the thermal system. Then, these eigenfeatures were fused with those of visible images and classified. The blood perfusion model was obtained by analyzing the thermogram of the entire face and using Pennes' bioheat equation. The classification performance reached 87.87%.

Summarizing, mobile thermal IR imaging can provide high levels of flexibility and suitability for recovering physiological signatures such as the breathing signal and recognizing a person's affective states. However, a key challenge on the use of a mobile thermal system is related to the low quality of the output signal due to the low thermal and spatial resolution of the imaging system. The low spatial resolution can be easily addressed by bringing the thermal camera closer to the region of interest, but this is not always possible. An interesting method to overcome such an issue is presented in Cho et al. [85]. The authors proposed the *Thermal Gradient Flow* and *Thermal Voxel Integration* algorithms. *Thermal Gradient Flow* was mainly based on building thermal-gradient magnitude maps for enhancing the boundary around the region of interest, which in turn contributes to making the system robust to motion artifacts in presence of low-resolution images. Instead, the *Thermal Voxel Integration* consisted of a projection of a 2D thermal matrix onto a 3D space by taking a unit thermal element as a thermal voxel. This method was applied for breathing signal analysis application, and it resulted in producing a higher quality of breathing patterns.

#### **7. Thermal IR Imaging-Based A**ff**ective Computing Outside Laboratory Settings**

The market entry of smaller and low-cost thermal cameras is paving the way for thermal IR imaging applications outside laboratory environments. However, only few studies have been conducted. One example is reported by Goulart et al., who proposed a camera system composed of thermal and visible cameras for emotion recognition in children [86]. The camera system was attached to the head of a social robot, and the experiment was conducted in a room within the children's school environment. The room temperature was kept between 20 and 24 ◦C, using a constant luminous intensity. To capture thermal variation in the children's faces, they used the mobile thermal camera Therm-App. A similar set-up has been reported in Filippini et al. [87]. The authors installed a mobile thermal IR imaging system, the FLIR ONE, on the head of a social robot. The study was conducted in a primary school. Filippini and Goulart's studies are described in detail in the next section. Although both experiments were performed outside a laboratory setting, they still had constraints that are not adaptable to all real-life applications, such as the need to maintain a stable temperature, which is not possible in open air contexts or applications. Instead, Cho et al. conducted one of their experiments in unconstrained settings with varying thermal dynamic range scenes (i.e., indoor and outdoor physical activity). The experiment was aimed to monitoring a person's thermal signatures while walking [85].

Concerning the employment of thermal mobile system for affective computing purposes, one of the most compelling challenges for real-world environment is to ensure reliable thermal tracking of the chosen ROIs. This is due to dynamic changes in ambient temperature that affect the skin and can cause an inconsistent thermal signal coupled with the low resolution of the low-cost thermal system. This aspect makes it difficult to track ROIs automatically. Moreover, applications in real-world scenarios require real-time responses from the sensors of interest. Hence, an automatic recording of thermal IR imaging data and real-time processing is required. In this regard, signal processing techniques need to be chosen based on their efficiency in terms of computational load to allow acceptable performance for real-time processing. In Goulart et al. and Filippini et al., the tracking algorithm in thermal images relied on the visible images [86,87]. Thereby, the first phase consisted on the camera calibration done through a synchronous acquisition between visual and thermal images and using a checkerboard whose details were clearly detectable by both the visible spectrum and the thermal camera. After the calibration, the face detection and ROIs localization were performed on the visible image, since a state-of-the-art computer vision algorithm can be used in the visible field, and then the ROI coordinates in the visible images were converted to those of the thermal image. In Goulart et al., the Viola–Jones algorithm was used for visible ROIs detection, and then these ROIs were transferred to the corresponding thermal camera frame through a homography matrix [86]. In Filippini et al., the authors used an object detector based on the histogram of oriented gradients (HOG) for face localization in visible images [87]. The extraction of landmarks was based, instead, on a regression tree ensemble algorithm. An average of 82.75% of the total amount of the frames were correctly tracked as a result. However, it is important to mention that a further improvement would be to develop a real-time tracker, based on the only IR videos, acquired by low-resolution thermal cameras, to avoid problems due to a low-light environment and to infer the psychophysiology state of the human interlocutor. An attempt in this direction was proposed in Cho et al., where the authors were interested in tracking the physiological signal related to the only breathing process, focusing on the nostril region [85]. The approach they proposed was intended to compensate for the effects of variation in ambient temperature and movement artifacts, and it was named the "Optimal Quantization" method. The quantization process itself is the process of translating from a continuous temperature value to its digital color-mapped equivalent. This method consisted in adaptively quantizing the thermal distribution sequences by finding a thermal range of interest that contains the whole facial temperature distributions for every single frame. Despite the enhancing approach, this cannot cover all possible scenarios such as contexts of high humidity or severe temperature condition. Nonetheless, the further development of automatic ROI tracking on thermal images in entirely mobile and ubiquitous situations is required.

#### **8. Thermal IR Imaging-Based A**ff**ective Computing in HRI**

For decades, a growing interest on the possibility of developing intelligent machines that engage in social interaction has been observed. Many researchers, also in the field of thermal IR imaging, have been enthralled with the HRI appeal and with the possibility of developing robots that are capable of social interactions with humans. The application of thermal IR imaging in this field was firstly investigated by Merla in 2014 [88]. Since then, many studies have been conducted to validate the thermal IR imaging technique for emotion recognition analysis with the aim of applying this technique in the HRI field. However, only few applications have been actually implemented in this area. One of the first attempts was carried out by Sorostinean et al. [89]. In this study, the authors presented the design of a thermal IR imaging-based system mounted on a humanoid robot performing a contact-free measurement of temperature variations across people's faces in a stressful interaction. The results showed a statistically significant interaction between the distance and the gaze direction of the robot and the temperature variation of the nasal and peri-nasal region. This supported the fact that thermal imaging sensors can be successfully employed in embodying robots with physiological sensing capabilities to allow them to become aware of their effect on people, know about their preferences, and build a reactive behavior. Agrigoroaie et al. reported promising preliminary results in their attempt to determine if a person was trying to deceive a robot through the use of a mobile thermal camera [90]. Instead, Boccanfuso et al. evaluated the efficacy of the thermal IR imaging for detecting robot-elicited affective response compared to video-elicited affective response by tracking thermal changes in five ROIs on the subject's face [91]. They studied the interaction effects of condition (robot/video) and emotion (happy/angry) on individual facial ROIs. Although no interaction effects for most ROI temperature slopes were found, a strong, statistically significant effect of the interaction between condition and emotion when evaluating the temperature slope on the nose tip was observed. This result

confirms again the assumption that the nasal area is a salient region for emotion detection [32,67,92–95]. Recent studies included applications with more challenging populations such as infants and children. Scassellati et al. proposed the design of a unique dual-agent system that uses a virtual human and a physical robot to engage 6–12-month-old deaf infants in linguistic interactions. The system was endowed with a perception system that was capable of estimating infant attention and engagement through thermal imaging and eye tracking [96]. This study was part of a larger project designed to develop a system called RAVE (Robot AVatar thermal Enhanced language learning tool). RAVE is aimed to be an augmentative learning tool that can provide linguistic input, in particular visual language inputs, to facilitate language learning during one widely recognized critical developmental period for language (ages 6–12 months [97]) [96,98–101]. To this end, thermal IR imaging was used to determine the emotional arousal and attentional valence, providing new knowledge about when infants are most optimally "Ready to Learn", even before the onset of language production. This is prominent for infants who might not otherwise receive sufficient language exposure. Of particular concern are deaf babies, many of whom are born to parents who do not know a signed language [96].

Although these studies were very ambitious and fascinating, they were still carried out in constrained laboratory settings, probably implying a not entirely free interaction. To the state of the art, the only two studies performed so far that have concerned HRI applications in a out-of-laboratory context are those reported in Section 7 [86,87].

Goulart et al. used a mobile social robot called N-MARIA (New-Mobile Autonomous Robot for Interaction with Autistics), which was built in UFES/Brazil to assist children during social relationship rehabilitation [86]. In the interaction with the child, which lasted two minutes, the child was encouraged to make communication and tactile interaction with the robot. The robot was equipped with low-cost hardware (thermal and visible camera) used to give information about the emotional state of the child, which was related to five emotions (i.e., surprise, fear, disgust, happiness, and sadness). Results showed that the system was able to recognize those emotions, achieving 85.75% accuracy. Such accuracy was comparable with gold standard techniques in emotion recognition, such as facial expression analysis and speech tone analysis [102–106]. Filippini et al. used the social robot "Mio Amico" robot, produced by ©Liscianigiochi, in which the mobile thermal system FLIR ONE (which includes a thermal and a visible camera) was installed on the head of the robot [87]. The study aimed to endow the robot with the capability of real-time assessment of the interlocutor's state of engagement (positive, neutral, and negative emotional engagement). During the interaction between the robot and the child, the robot could either tell a fairy tale or sing a song. At the end of the fairy tale or the song, the robot asked the child if he/she liked the fairy tale/song and if he wanted to listen to another one. Based on the child's answer, the robot could choose the next action. The engagement state of the infant was classified by analyzing the child's thermal modulation using a low computational processing pipeline. Figure 5 summarizes in a–c the processing pipeline for the interlocutor's state of engagement identification and (d) the child–robot interaction. The accuracy reached was 70%. Although this study presented a lower level of accuracy compared to Goulart et al. [86], it is worth mentioning that the estimation of the level of engagement can be considered a hard task, since it represents a complex emotion (combination of the basic ones), and it has been poorly investigated in the literature. Besides, the study reported in Filippini et al. [87] represents the first and unique study in which the robot could actually change its activities based on the child affective state, opening the way to a bidirectional interaction.

**Figure 5.** (**A**–**C**) Processing pipeline. (**A**) The first phase relied on the visible image to detect the child's face and locate specific facial landmarks. (**B**) The corresponding landmarks on the thermal image were then obtained thanks to a previous optical calibration procedure. (**C**) The thermal signal is extracted from the nose tip region and processed in order to obtain the subject's level of engagement. (**D**) The child–robot interactions are guided by the robot's understanding of the child's level of engagement.

#### **9. Thermal IR Imaging-Based A**ff**ective Computing in Intelligent Systems Such as Driver-Assistance Systems or Autonomous Vehicles**

Besides humanoid robots, systems such as driver-assistance systems or autonomous vehicles can also be classified as "robots" due to their intelligent behaviors. Indeed, driving functions are becoming increasingly automated; consequently, motorists could run the risk of being cognitively removed from the driving process. Thermal IR imaging was demonstrated to be a valid technique for assessing variation in cognitive load [52,107,108]. Most of all, it could enable sensing the real-time state of motorists non-invasively i.e., without disrupting driving-related tasks and, unlike RGB cameras, independently from external light conditions [109]. Studies of thermal IR imaging in driving contexts stated that a rise in mental workload leads to an increase in the difference between nose and forehead temperature [108,110]. Moreover, thermal IR imaging was demonstrated to be a valuable indicator of the driver arousal level, from alertness to drowsiness [64,111]. A recent study employed it to detect human thermal discomfort in order to develop a fully automated climate control system in the vehicles [112]. Even in the view of automated vehicles, in complex situations, they will need human

assistance for a long time. Therefore, for road safety, driver monitoring is more relevant than ever, in order to keep the driver alert and awake.

Autonomous driving relies on the understanding of objects and scenes through images. A recent study assessed a fusion system consisting of a visible and thermal sensor on object recognition from a driving dataset, and it demonstrated that thermal images significantly improve detection accuracy [113]. Miethig et al. argued that thermal IR imaging can improve the overall performance of existing autonomous detection system and reduce pedestrian detection time [114].

In conclusion, thermal IR imaging can be greatly useful also in vehicle technology; however, further testing is needed to better understand how it would improve automated vehicles and the knowledge about cognitive states in traffic safety.

#### **10. Discussion**

The advancement in robotics, especially in social robotics, is breathtaking. Social robots could potentially revolutionize the way of taking care of the sick or the elderly people, the way of teaching and learning, and even the definition of the concept of companionship. Nowadays, social robots hold the promise of extending life expectancies and improving health and the quality of life. In this review, the impact that social robots have in fields such as education, healthcare, and tourism is briefly surveyed. In all these areas, they are mainly intended to improve or protect the lifestyle and health, both physical and mental, of the human user. In fact, robots can also help socially impaired people relate to others, practice empathetic behaviors, and act as a steppingstone toward human contact. However, the most important and desirable requirement is that the robot meets the person's needs. Robots actually need to be able to recognize the interlocutor's affective state to communicate naturally with him/her and to engage him/her not only on a cognitive level but on an emotional level as well.

To be able to get information about the partner's affective state, there are at least two possibilities. Either the user has to explicit and voluntarily express information about his/her emotions to the robot—for example, using natural language or facial expressions and gestures, or the robot has to recognize involuntary affective information from physiological measures, such as respiration, heart rate, skin conductance, skin temperature, and blood pressure. In the present review, the use of an ecological technology is promoted, such as the thermal IR imaging-based affective computing technique. It aimed to facilitate HRI by endowing the robots with the capability of autonomously identifying the person's emotional state. Thermal IR imaging, by recording facial cutaneous temperature and its topographic distribution, is able to identify specific features clearly correlated to emotional state and measures associated with standard physiological signals of the sympathetic activity. Emotional state, in fact, determines a redistribution of the blood in the vessels through vasodilation or vasoconstriction phenomena, which are regulated by ANS. These phenomena can be identified and monitored over time because they produce a thermal variation on the skin. The thermal IR imaging technique is already validated in the literature for emotion recognition tasks.

Regarding emotion recognition, gold standard techniques such as speech and facial expression analysis were mentioned. However, it is worth mentioning that to distinguish between emotions, different models or theories have been so far developed and used by psychologists or cognitive neuroscientists. Thus, it is difficult to take a theory of one research field, such as psychology, and apply it to another, such as HRI. This remains an open discussion issue in the HRI research field. For instance, in speech analysis, the emotion recognition models developed using the utterances of a particular language usually do not yield appreciably good recognition performance for utterances from other languages [115]. On the other hand, in facial expression and gesture analysis, two main theories are currently established in emotion research: a discrete approach, claiming the existence of universal "basic emotions" [116], and a dimensional approach, assuming the existence of two or more major dimensions that can describe different emotions and distinguish between them [117]. The thermal IR imaging technique, i.e., the ROI's temperature modulation analysis, has demonstrated to be suitable for emotion recognition based on both the basic emotion approach and on the dimensional theory of

the emotion (as reported in Section 4). This makes thermal IR imaging a cross-cutting and ubiquitous technique in the area of emotion recognition and consequently a valuable contribution in HRI studies. Of course, thermal IR imaging is not the first and unique technology explored for emotion recognition through physiological measures in HRI, but it seems to be one of the most ecological. In this review, the major challenges toward the application of this technique in HRI fields has been highlighted, bridging the gap between the constrained laboratory setting and the real-world scenario. To this end, a few studies have been reported in the literature and were here analyzed. The application of thermal IR imaging-based Affective Computing in the field of HRI has been reviewed as well. Interesting results were reported by Goulart et al, in which the developed system was able to recognize emotions, achieving 85.75% accuracy [86]. To some extent, such accuracy can be comparable with gold standard techniques in emotion recognition for HRI, such as facial expression analysis and speech tone analysis.

An important aspect that further draws attention to this technique is its adaptability in applications where there is interaction between social robots and challenging populations such as neonates. This was the case of RAVE, the learning tool composed of an avatar and a social robot, which was designed to facilitate langue learning during the critical developmental period (age 6–12 months) and devolved to infants who might not otherwise receive proper language exposure. In conclusion, we believe that this review paves the way for the use of thermal IR imaging in HRI, which could endow the social robot with the capability of recognizing the interlocutor's emotions relying on involuntary physiological signals measurements. These measurements may be fed to multivariate linear [118,119] or non-linear regressors or classification algorithms [120], also relying on data-driven machine learning and deep learning approaches [121,122]. In this way, it is possible to avoid the artifact of social masking and make HRI suitable also for people who lack the ability to express emotions. Beyond robots, intelligent systems such as autonomous vehicles or even smart buildings could also benefit from this technique.

#### **11. Conclusions**

HRI is a relatively young discipline that has attracted a lot of attention in recent years due to the increasing availability of complex robots and people's exposure to such robots in their daily lives. Moreover, robots are increasingly being developed for real-world application areas, such as education, healthcare, eldercare, and other assistive applications. A natural HRI is crucial for the beneficial influence that robots can have in human life. Understanding the interlocutor's need and affective state during the interplay is the foundation of a human-like interaction. To this end, an ecological technology such as thermal IR imaging, which can provide information about physiological parameters associated to the subject affective state, was here presented and surveyed. The technology can provide the ground for the further development of robust social robots and to facilitate HRI. Thermal IR imaging has already been validated in the literature in the fields of emotion recognition. This review can act as a guideline to, and foster, the use of thermal IR imaging-based affective computing in HRI applications, which is intended to support a natural HRI, with special regard to those who find difficult to express emotions.

**Author Contributions:** Conceptualization, C.F., A.M.; investigation, C.F., D.C., A.M.C., D.P.; writing—original draft preparation, C.F.; writing—review and editing D.C., A.M.C., D.P.; supervision, A.M.; funding acquisition, A.M. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by PON FESR MIUR R&I 2014-2020 - Asse II - ADAS+, ARS01\_00459 and PON MIUR SI-ROBOTICS, ARS01\_01120.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Call Redistribution for a Call Center Based on Speech Emotion Recognition**

#### **Milana Bojani´c 1,\*, Vlado Deli´c <sup>1</sup> and Alexey Karpov <sup>2</sup>**


Received: 26 April 2020; Accepted: 25 June 2020; Published: 6 July 2020

#### **Featured Application: A call redistribution method for a call center based on speech emotion recognition is proposed. The research goal is e**ffi**ciency improvement in emergency call centers based on automatic recognition of more urgent callers.**

**Abstract:** Call center operators communicate with callers in different emotional states (anger, anxiety, fear, stress, joy, etc.). Sometimes a number of calls coming in a short period of time have to be answered and processed. In the moments when all call center operators are busy, the system puts that call on hold, regardless of its urgency. This research aims to improve the functionality of call centers by recognition of call urgency and redistribution of calls in a queue. It could be beneficial for call centers giving health care support for elderly people and emergency call centers. The proposed recognition of call urgency and consequent call ranking and redistribution is based on emotion recognition in speech, giving greater priority to calls featuring emotions such as fear, anger and sadness, and less priority to calls featuring neutral speech and happiness. Experimental results, obtained in a simulated call center, show a significant reduction in waiting time for calls estimated as more urgent, especially the calls featuring the emotions of fear and anger.

**Keywords:** emotion recognition; intelligent speech signal processing; affective computing; human computer interaction; supervised learning

#### **1. Introduction**

Spoken language processing combines knowledge from the interdisciplinary area of natural language processing, cognitive sciences, dialogue systems, and information access. Speech Emotion Recognition (SER) and text-to-speech synthesis (TTS), including voice and style conversion, as part of human–machine spoken dialogue systems correspond to certain cognitive aspects underlying the human language processing system [1]. In the last few decades, there has been growing interest in developing human–machine interfaces that are more adaptive and responsive to a user's behavior [2]. In that sense, the use of emotion in speech synthesis and recognition of emotion in speech takes an important place in attempts to improve naturalness of human–machine interaction (HMI) [3]. As to TTS, different applications such as smart environments, virtual assistants, intelligent robots, and call centers have set requirements for different speech styles identified with corresponding emotional expressions [4]. Recognition of emotions in HMI is not restricted to speech analysis only, but also image analysis (facial expression recognition, eye-tracking data) and physiological signals (pulse rate, skin conductance, facial electromyography, electroencephalography (EEG) signal) [5]. Emotion recognition in spoken dialogue systems such as call centers provides a possibility to respond to callers according to the detected emotional state or to pass control over to human operators [2,6–8].

In the SER research, two main approaches are utilized in describing the emotional space. The first approach describes the emotional space with a finite number of prototypical emotions according with categorical emotion model. The second approach uses dimensions (typically arousal and valence) to determine possible emotional states in the space defined by chosen dimensions. The latter approach corresponds to dimensional emotion models. Dimensional emotion models mostly use two or three dimensions (e.g., valence, arousal, and sometimes dominance) to describe the emotional space in which the emotional variability is to the greatest extent determined by the first two dimensions and thus used as a basis for research in the field of SER [9]. The valence dimension describes the pleasantness of emotion and ranges from positive (e.g., joy) to negative (e.g., anger). The arousal dimension indicates the level of activation during some emotional experience and it ranges from passive (e.g., sleepiness) to active (e.g., high excitement). The position of some basic, categorical emotions in the valence–arousal space is shown in Figure 1.

**Figure 1.** The circumplex model of affect in the valence–arousal space. Adapted from [10].

The dimensional models allow using emotional categories (appropriately positioned in a two-dimensional emotional space) among which it is possible to determine a distance metric [10]. Goncalves et al. utilized four dimensions (namely, valence, arousal, sense of control, and ease in achieving a goal) to describe user's emotional state while interacting with an electronic game [11]. Landowska proposed a procedure to obtain new mappings with mapping matrices for estimating the dimensions of a valence-arousal-dominance model from Ekman's six basic emotions [12]. The procedure, as well as the proposed metrics, might be used, not only in evaluation of the mappings between representation models, but also in a comparison of emotion recognition and annotation results. Emotion valence classification in self-assessed affect challenge is reported in [13]. Detection of a degree of speaker's sleepiness can help recognizing his/her emotional arousal as well [14]. Sometimes both approaches, emotion category and valence-arousal classification, are utilized for comparison, as in the INTERSPEECH Emotion Sub-Challenge on acted speech corpus [15].

In a situation when all call center operators are busy and unable to answer a new call, the system puts that call on hold regardless of its urgency. By way of illustration, if a call is the fifth call in the queue in a given moment, a caller which is terrified, angry, or upset would be left to wait for a certain period of time before his/her call is considered. This period of time is equivalent to the time in which all the preceding calls are answered. This classical approach in call centers does not take into consideration the urgency of a call and calls are processed in the order in which they are received. Petrushin utilized the emotion recognition as a part of a decision support system for prioritizing telephone voice messages in a call center and assigning a proper agent to respond the message [6]. His goal was to recognize two possible states: "agitation" which includes anger, happiness and fear, and "calm" which includes normal state and sadness. The average recognition accuracy was in the range of 73–77%.

In this research, the first presumption was that there are some calls which are more urgent and which should be processed faster. The second presumption was that the urgency of the call correlates with a caller emotional state reflected through speech. The motivation behind this research was to improve the effectiveness of call center service through giving the first level priority to the callers who are experiencing a negative valence emotional state (fear and anger), the second level priority to a sad or neutral emotional state, and the third level priority to a joyful emotional state. The proposed approach consists of recognition of caller's emotional speech and redistribution of the calls according to the proposed emotion ranking. Thus, faster processing and the decrease in waiting time for callers estimated as more urgent, is achieved.

The paper is organized as follows: Section 2 covers related works including acoustic modeling of emotional speech and the underlying emotional speech corpus, as well as methods for emotion classification. The proposed algorithm for redistribution of calls is described in detail in Section 3. The simulation and experimental results are reported and discussed in Section 4. Finally, conclusion remarks and future research directions are summarized in Section 5.

#### **2. Materials and Methods**

#### *2.1. Emotional Speech Corpus*

The GEES (Corpus of Verbal Expressions of Emotions and Attitudes—in Serbian: *Korpus Govorne Ekspresije Emocija i Stavova*) is the first corpus of acted emotional speech recorded in Serbian [16]. Six actors (3 female, 3 male) were recorded while verbally expressing semantically neutral textual material into five basic emotions: anger, joy, fear, sadness, and neutral. The underlying textual material included 32 isolated words, 30 short sentences, 30 long sentences, and one passage of 79 words. The corpus was evaluated by human listeners and reported recognition accuracy was 94.7% [16]. In this study, a part of corpus containing short and long sentences was taken into consideration because it better reflects a real conversation scenario. The isolated words and the passage were omitted from the research. Aiming to have each speaker equally represented, 58 recorded utterances from every speaker in each emotion class were used for the feature extraction. The total number of utterances used in experiments was 1740. It has been pointed out that acted emotions are more exaggerated than real ones [17] and discussed that acted emotions have limited application in real-life situations. Still, by studying the acoustic features of emotional speech on the acted emotion corpus, one can analyze acoustic variations and get insight into acoustic correlates of emotional speech. Those acoustic correlates are, to a greater extent, present in emotions occurring in real life situations or in elicited emotional speech. In that sense, the relationships between the acoustic features and the acted emotions, as well as between the acoustic features and the real life emotions, do not contradict [18]. Using acted emotions in emotional speech recognition is a way to obtain and study generic (maybe universal) expressions of emotions [19]. Additionally, our research setting was to recognize more intensive emotional states which are reflecting more urgent callers. These intensive vocal emotional expressions are more frequent in acted emotional speech corpora than in natural speech corpora.

#### *2.2. Acoustic Modeling*

The most commonly used acoustic features for SER are: prosodic features (pitch, intensity, duration), cepstral features (MFCC), spectral features (formant position and bandwidth), and occasionally voice quality features (harmonic-to-noise ratio, jitter, shimmer), in line with the studies [19–23]. The task of finding a robust feature set has led to the idea of applying statistical functionals to low-level descriptors (LLD) and resulted in very large feature vectors containing up to a few thousands of prosodic and spectral features [19]. Recently, new trends in machine learning have been directing research of automatic affect recognition towards end-to-end technique that combines deep, convolutional and recurrent neural networks trained directly on underlying raw audio signal [24,25]. A proposal of multilevel model based on a combination of LLDs and convolutional recurrent neural

network model is given in [26]. Still, a lot of research in the area is based on hand-crafted features that have shown to be robust in many computational paralinguistics tasks such as emotion, autism, accent, addressee, deception, cognitive and physical load detection, and so on [20–22] (list of the INTERSPEECH Paralinguistics Challenge tasks up to 2019 is available at http://www.compare.openaudio.eu/tasks/). Schuller et al. introduced the INTERSPEECH 2013 ComParE feature set [20]. It contained 6373 features including energy, spectral, cepstral (MFCC) and voicing related LLDs (pitch, voicing probability, jitter, shimmer), as well as a few LLDs including logarithmic harmonic-to-noise ratio (HNR), spectral harmonicity and psychoacoustic spectral sharpness, etc. This set of hand-crafted acoustic features is still state-of-the-art now [27]. Another more minimalistic feature set proposed in [23] includes prosodic, excitation, vocal tract, and spectral descriptors, obtained by applying functionals to 18 LLDs that give a total of 52 utterance-level features only. Kaya et al. used the ComParE feature set for the proposed cascaded normalization. The proposed normalization approach, combining speaker level, value level and feature vector level normalization, has shown a superior performance in the task of cross-corpus acoustic emotion recognition on five corpora recorded in five languages [28]. Utterance level features, obtained through the statistical analysis of prosodic features (pitch, energy), spectral information (formants, spectrum centroid and spectrum cut-off frequency) and cepstral information (mel-frequency bands energy), are extracted to recognize seven basic emotions in Mandarin [29]. While in some studies SER relies on the prosodic and voice quality feature set only [30], and in others on cepstral features only [31], our previous study showed that a combination of both spectral and prosodic features has a higher discrimination capability for speech emotion recognition than prosodic or spectral features used separately [32]. Wagner et al. compared and discussed the advantages and usability of hand-crafted and learned representations (an end-to-end system that learns the data representation directly from the raw waveforms) [33]. Their research suggests that hand-crafted features can better generalize to unseen data and can also provide a better robustness to various acoustic conditions in comparison to purely end-to-end systems.

The proposed approach to acoustic modeling is based on the statistical analysis of the acoustic feature contours and it is performed in three steps. The openSMILE toolkit [34], used as official baseline for the series of INTERSPEECH Computational Paralinguistics challenges, is used to extract the acoustic feature set. The first step includes the extraction of short-term pitch, energy and 12 MFCC values on a frame basis. Additionally, the voicing probability and the zero crossing rate are calculated for every frame. Sequences of those short-term pitch, energy and MFCC values form feature contours. In the second step, the first derivative of the acoustic features is calculated in order to model the dynamics of speech parameters. The third step of the feature extraction process involves a statistical analysis of the feature contours. The proposed set of 12 statistical functionals has been chosen from three groups of functionals which are the most frequently used [19]:


Finally, the extracted feature set results in 384 features for each of the processed utterances.

#### *2.3. Classification Methods*

A recent survey in the field of SER provided an overview of traditional classifiers and deep learning algorithms applied for SER [35]. Among traditional classifiers, they listed Hidden Markov Model (HMM), Gaussian Mixture Model (GMM), Support Vector Machines (SVM), and Artificial Neural Networks (ANN), Decision Trees (DT), k-Nearest Neighbor (kNN), k-means, and Naive Bayes Classifiers, concluding that there is no generally accepted machine learning algorithm used in this field. Recently, the focus on research changed direction towards Deep Neural Networks (DNN), with most widely used Convolutional Neural Networks (CNN), and Recurrent Neural Networks (RNN). For the

purpose of speech emotion classification in this study, Linear Discriminant Classifiers (LDC) and kNN are taken into account due to their simplicity, efficiency and low computational requirements. LDCs and kNN classifiers have been used since the very first studies and turned out to be quite successful for both acted and spontaneous emotional speech [19]. Zbancioc et al. used a weighted kNN classifier for the classification task of four emotions (anger, sadness, joy and neutral) contained in the SROL emotion corpus utilized in their research [36].

In all our experiments, 10-fold partitioning of the data set was used to estimate the recognition accuracy of a particular classifier. Training and test sets included utterances from all six speakers, so these results belong to speaker-dependent experiments. Although "speaker-independent" experiments (e.g., leave-one-speaker-out) are possible on the GEES with 6 speakers (for example, the results reported by [37]), we decided to perform speaker-dependent tests in order to train classifier with more samples belonging to different speakers. In such a way, the acoustic variability present in the feature space is better modelled providing better prediction ability even when tested with an unknown speaker. Indeed, the accuracies obtained in speaker-independent cross-validation tend to be lower than the accuracies obtained in speaker-dependent cross-validation [37], but not significantly [38].

The first considered classifier is the linear Bayes classifier with the underlying assumption that classes are modeled by Gaussian densities and equal covariance matrices. Maximum likelihood estimates of Gaussian density parameters are used. As to the linear Bayes classifier, the average recognition accuracy achieved in our emotion classification experiments on the GEES corpus was 91.5% [32]. Joy was recognized with 84.2% and anger with 88.8% recognition rate. Class recognition rates for fear, neutral and sadness in the case of linear Bayes classifier were 92.5%, 97.1% and 94.8%, respectively. Table 1 shows a normalized confusion matrix for the linear Bayes classifier applied on the GEES corpus. From Table 1, it could be noted that sadness is misrecognized as a neutral state in 4% of test samples and fear is confused with neutral state in 3.2%. Neutral state has the highest recognition rate, thus its misclassification as fear and joy is about 1%. Anger and joy have lower recognition rates due to the problem of mutual misclassification, about 11% of anger test samples is recognized as joy and almost 15% of joy is misrecognized as fear.


**Table 1.** Normalized confusion matrix for linear Bayes classifier.

For the second classifier, the kNN classifier is used as a very intuitive method that classifies unlabeled examples based on their similarity to examples in the training set. It implicitly involves non-parametric density estimation, which leads to a very simple approximation of the linear Bayes classifier. Employing high dimensionality feature vectors, dimensionality reduction is sometimes applied in order to improve classification results, as in [39], where a speaker-penalty graph learning is proposed to penalize the impact of different speakers. Due to the fact that the recognition accuracy of the kNN classifier is affected by the high dimensionality of feature set, linear discriminant analysis (LDA) feature reduction has been applied on feature set [40]. In the five-class emotion classification task on the GEES corpus, the kNN classifier achieved the average recognition accuracy of 91.3% after LDA feature reduction and with k = 9. The lowest class recognition rate was obtained for joy (83.6%) and anger (86.8%). Regarding fear, neutral and sadness, higher class recognition rates were achieved—93.7%, 95.9% and 96.3%, respectively. Employing LDA, kNN achieved the average accuracy almost equal to the best result in our experiments (91.5%). In the case of linear Bayes classifier, there were no improvements after LDA feature reduction probably due to good linear separability

between classes in the original feature space. Using both classification methods in our SER experiments, lower recognition results obtained for joy and anger may be explained with the observed tendency in human perception tests to misclassify anger and joy from the GEES corpus [16].

In our earlier study, the SER experiments on the same GEES corpus were performed using a multilayer perceptron (MLP) with one hidden layer [41]. The number of neurons in the input layer was equal to the number of extracted features (same feature vector as described in Section 2.2), and the number of neurons in the output layer was equal to the number of emotion classes (5). MLP was trained using standard backpropagation (BP) algorithm with varying number of neurons in the hidden layer. The highest recognition rate was achieved with 15 neurons in the hidden layer. Further increase in neurons in the hidden layer resulted in insignificant improvement of the recognition rate at the cost of increased computational complexity and thus longer processing time.

The average recognition accuracy achieved in emotion classification experiments with MLP was 90.4%. Joy was recognized with 82.5% and anger with 86.5% recognition rate. In the case of these two emotions, results of MLP underperform results of the linear Bayes classifier approximately by 2%. Sadness and neutral are emotions with the highest recognition rates of 97.7% and 93.7%, respectively. Fear is recognized with 91.7%.

It can be noted that two emotions with the lowest recognition rates, namely joy and anger, have a lower recognition accuracy compared to the experimental results with the linear Bayes classifier. This is an additional reason why we decided to use the linear Bayes classifier in the proposed system, besides it is a fast classification method.

#### *2.4. Comparison of the SER Results with Other Studies*

In general, it is a difficult task to objectively compare results of one SER research with other results reported in literature. This is due to a high diversity of research approaches to SER, regarding speech emotion corpora, the extracted feature set, classification methods and additional experimental settings (e.g., speaker-dependent or speaker-independent tests, cross-validation method applied). Regarding acted speech, two corpora have been used in plenty of research: Berlin Emotional Speech Database (Emo-DB) containing the total of 535 sentences uttered by 10 actors (5 male, 5 female) in seven emotional states, and Danish Emotional Speech Database (DES) containing the total of 419 utterances portrayed in five emotional states by 4 actors (2 male, 2 female) [28,37]. In the research by Hassan et al. the proposed 3DEC classification was tested on all three corpora (Emo-DB, DES, GEES), and the best results were achieved on the GEES corpus [37]. It can be explained by the fact that the GEES contains more samples available for training the classifier than other two corpora, and by the fact that the overall human recognition accuracy reported for GEES is 94.7%, against 86.1% for Emo-DB and 67.3% for DES. The overall human accuracy reflects the distinction degree of the acoustic representation of basic emotions in a corpus, which is very high for the GEES corpus.

In our earlier study [42], the comparison of basic emotion classification in valence-arousal space was made on the Emo-DB and the GEES corpora. The mapping of basic emotions into three classes along the valence axis (positive, neutral, and negative), and three classes along the arousal axis (high, neutral, and low) was performed. The recognition results along the arousal axis were above 90% for both corpora. The average recognition results along the valence axis were 83.2% for the GEES and 76.9% for Emo-DB. It is in line with the findings showing that arousal discrimination tasks, based on acoustic features, achieve higher recognition rates than valence discrimination tasks [28].

We consider the GEES with 1740 utterances portrayed in 5 emotional states by 6 actors as a suitable and adequate basis for SER research. Also, taking into account that Serbian is still an under-resourced language, there are far less available emotional speech data and the corresponding research for Serbian (GEES is the only one emotional speech corpus accessible for research purposes) even in comparison with other Slavic languages like Russian [43], Czech [44], etc.

A comparative analysis of our results and the results of some other SER studies conducted on the GEES corpus was performed. Due to the fact that this is a rather small corpus in Serbian, it was not a subject of much research. Two SER researches on the GEES corpus were found to be compared with our results.

The first study for comparison is by Hasan et al. [37], who proposed a hierarchical classification technique using SVM for binary emotion classification on every level. As to feature extraction they decided to apply a "brute force" approach and 6552 acoustic features for each utterance. The extracted feature vector included 56 low-level descriptors (among which is pitch, energy, spectral energy, MFCCs) and 39 statistical functionals applied to these LLDs and their first and second derivatives. In the experiments three acted databases were used: the Danish Emotional speech (DES), the Berlin database (Emo-DB) and the Serbian GEES database, and one spontaneous database (Aibo corpus). The proposed hierarchical classification, called 3DEC, is based on input data in such a way that input data and their confusion plots determine the hierarchy of the proposed classification scheme. They used both speaker-dependent and speaker-independent approaches for SVM-based model training and testing. We present only results of speaker-dependent tests as to be able to make a comparison with our results. For the speaker-dependent test, 10-fold cross-validation for the whole corpus is applied, as in our case.

The reported [37] average recognition accuracy on the GEES corpus is 94.1%. It is achieved with the proposed 3DEC combination of SVM classifiers in the speaker-dependent test. Recognition accuracy in their research is obtained as an unweighted average accuracy (UA), i.e., accuracy per class is averaged by the total number of classes. It should be noted that in the case of the GEES corpus, UA accuracy is identical to weighted average accuracy (WA) due to equally balanced emotion classes. Comparing our result with the result of Hasan et al. [37], it can be seen that our average recognition accuracy is lower by 3%. It should be noted that our result is obtained with a significantly smaller feature vector (384 features against 6552 features in [37]). Additionally, the classification methods applied are different. In our experiments, the linear Bayes classifier is used as a simple and fast method for training and test stages, and their proposed 3DEC combination of SVMs requires training of four SVMs. We consider that our proposed SER achieves a slightly lower result compared to the best recognition accuracy reported for the GEES (94.1% in [37]), but having significantly smaller feature vectors and computationally less demanding classification method.

One more study on the GEES corpus, by Shaukat et al. [45], applied the multistage (hierarchical) emotion categorization with SVM. In their research, the extracted utterance-level vectors of 318 features, among which are pitch, energy, MFCC, formants and their statistical functionals (e.g., mean, variance, maximum, minimum, etc.). In the experiment on the GEES corpus, they reported the average emotion recognition rate of 90.63%.

Comparing our result with the result of Shaukat et al. [45], it can be seen that our average recognition accuracy is higher by 1%. They applied a hierarchical classification techniques with 4 SVMs, thus training of all 4 SVMs is necessary. It should be noted that feature vector set used in [45] is smaller, but an important difference is that their experiments were performed on individual speaker sub-corpora and overall recognition accuracy was calculated as an average value of recognition accuracies obtained for each individual speaker. Our recognition accuracy is evaluated after 10-fold cross-validation on the whole corpus, like in the study of Hasan et al. [37], which we consider as a more objective measure of recognition performance.

#### **3. Algorithm for Call Redistribution Based on Speech Emotion Recognition**

As mentioned earlier, in this research, one presumption was that the urgency of the call correlates with the caller's emotional state reflected through speech. Our focus was on emergency call centers and health care centers for elderly people. Aiming to recognize more urgent callers among them, we have proposed the ranking of five basic emotions.

So, basic emotions with negative valence (fear, anger and sadness) reflect unpleasantness of the speaker and our presumption was that those speakers have a health, or any other, more urgent problem. On the other hand, there are positive valence emotions (e.g., joy) and neutral valence (neutral state) that are supposed to reflect less urgent speaker's state and those calls could be processed later.

The proposed ranking of five basic emotions is:


In the proposed ranking, fear is put first because it is an emotion that people experience when facing a serious problem (serious injuries, heart attack, accidents, etc.). In the research conducted on the CEMO corpus containing dialogues recorded in a real-world medical call center, it was pointed out that patients had often expressed stress, pain, fear of being sick or even real panic [8]. Fear is the most common emotion in the CEMO corpus, with different levels of intensity and many variations [7]. Anger is the second negative and high arousal emotion, expressed in various stressful and disturbing situations. Sadness is in third place. It is an emotion with negative valence which is typical for elderly and lonely people. Holmen et al. reported that experiencing loneliness had a negative influence on the state of mood, so loneliness and sad mood prevailed especially among elderly subjects with cognitive difficulties [46]. Joy is in last place because it is considered to reflect full satisfaction and good mood, which are not indicators of urgent states.

The research setting is explained using an example of five calls received at the same moment—while all operators are busy. For each call, the initial part of the caller's speech is recorded. This recording is further processed and the feature vector *x i* is extracted. The feature vector is forwarded to a classifier which gives one of the five emotion labels (anger, joy, fear, sadness, and neutral) to input speech. Finally, after SER, those five calls are redistributed according to the recognized emotions and the proposed emotion ranking. The proposed framework of call processing is shown in Figure 2. For example, in the scenario shown in Figure 2, the original call order was neutral, joyful, sad, afraid, and angry; after SER and proposed call redistribution, the system will firstly process the call featuring fear, then the call featuring anger, afterwards a sad caller, then neutral, and at last the call featuring joy.

The proposed algorithm, whose block diagram is shown in Figure 3, has the following steps:


$$p = \begin{bmatrix} p\_1 = f \ \text{ $ } p\_2 = a \ \text{$  } p\_3 = s \ \text{ $ } p\_4 = n \ \text{$  } p\_5 = j \end{bmatrix}^T \tag{1}$$

where *f* represents fear, i.e., it denotes the speaker recognized as being in a state of fear, *a* denotes the speaker recognized as angry, *s* marks the speaker recognized as sad, *n* refers to neutral, and *j* to a joyful state of the speaker. The introduced priority vector, i.e., emotion ranking, represented in Equation (1), is proposed considering application in emergency call centers and health care centers for elderly people. It should be noted that the proposed algorithm is not restricted to the aforementioned priority vector only. Regarding a specific domain of application, a new emotion ranking can be adopted.

*4.* Calls are processed in the new order which is obtained after their emotion labeling based on SER (and ASR) and redistribution according to the proposed emotion ranking, i.e., the priority vector.

*f*

*j*

Firstly, all callers that are recognized as afraid are processed, after them angry callers and so on. In the end, the callers recognized as joyful are processed. The final goal of the redistribution is reduction in waiting time for the callers recognized as the priority. Let us denote the waiting time *t*1*<sup>i</sup>* of a caller *i* without SER and call redistribution, where *i* = 1, . . . , *C* and *C* is the number of calls received at the same moment. Then, *t*2*<sup>i</sup>* denotes the waiting time of a caller *i* after SER and call redistribution (after application of the proposed algorithm). The objective function is:

$$\max \sum\_{i=1}^{C} t\_{1i} - t\_{2i} \tag{2}$$

according to the priority vector *p*. The objective function is formulated as to maximize waiting time reduction for the callers recognized as the priority regarding the priority vector *p*. So, the goal of call redistribution is to maximize waiting time reduction for the caller *i*, if the caller *i* is set as priority regarding the vector *p*. In our experiments this is the case for the caller recognized as being afraid—fear is in first place in the priority vector *p*. Afterwards, the objective function maximizes the waiting time reduction for the caller recognized as being angry, since anger is in second position in the priority vector *p*.

**Figure 2.** Proposed call redistribution based on SER.

*x*

**Figure 3.** Block diagram of the proposed algorithm.

 The call processing and the proposed algorithm are intended to be a part of a client-server application based upon computer-telephone integration (CTI). The main part of the application is running on the server side located on a remote computer. A client side is located at a call center. When a call is received, if there is at least one free operator, the call is answered immediately. In the case all operators are busy at that moment, the client side initiates a connection with the server which is waiting for clients. After the connection is established, a new session is started and the client sends a recorded speech sample of the call. On the server side, the feature vector is extracted for a received speech record and it is forwarded to SER module which classifies it into one of the predefined emotion categories. If a new call is received at a call center within a period of 30 s, the client sends the recorded speech sample of the second call and steps 1 and 2 of the algorithm are performed. The established session lasts as long as there are new calls within the time frame of 30 s, which is chosen as the overlapping time between two consecutive calls. When there are no more calls within the specified time period, all calls processed during the current session of client-server communication are redistributed according to the proposed emotion ranking. The call redistribution is intended to be applied on a finite number of calls received in a short period of time while all the operators are busy. In the experiments, the situations of three, five and seven simultaneously received calls were considered. Let us denote them as group of calls. For example, when a call center simultaneously receives seven calls, those seven calls will be redistributed according to the SER system output and processed in a new order. While operators are answering those seven calls, if there is a new incoming call, it will be put in a new group of calls to be redistributed. The proposed emotion ranking can be specified after the connection establishment, so that the server adapts the system response to the specific type of a call center (a client). At the end of the session, the server sends the client the list of redistributed calls which are then processed according to the redistributed order.

#### **4. Simulation and Experimental Results**

The research was designed as a set of experiments in a simulated call center receiving a different number of calls simultaneously, i.e., during a short period of time when all operators are busy. The experiments focused on: (i) the redistribution of calls based on emotion label assigned after speech emotion recognition task, and (ii) the evaluation of time period in which the call was put on hold without and after speech emotion recognition was applied for call redistribution. During experiments, the number of simultaneously received calls varied from 3 to 7. In all experiments, prosodic and spectral feature set was used and the linear Bayes classifier and kNN were considered as classification techniques, as described in Section 2.

An average waiting time, without and with the redistribution, for each emotional state is evaluated as an average value of waiting time obtained for 50 experimental iterations in the simulated call center with one ideal active human-operator (a human factor is not considered). This procedure is repeated for each experimental setting (3, 5, and 7 simultaneously received calls). Waiting time reduction estimate is made under assumptions about underlying distribution of emotions in input calls and distribution of call duration. We assumed that all the emotions had a uniform distribution as well as that call durations were uniformly distributed across the chosen range (from 30 s to 3 min 50 s). The specified range was chosen with the assumption that it is wide enough to take into consideration the duration of shorter, medium, and longer phone calls as well. Thus, the evaluated waiting time after call redistribution may be shorter for every caller proportionally to the number of active operators in the call center.

A pseudo-random number generator is used for the generation of emotion labels (random choice of emotion for input call) as well as the generation of input call duration. The order of the calls in queue (the order of the call arrival) has, as in simulation as in real-world call center, the biggest influence on the estimated waiting time which a caller could spend in callers' queue. In our simulations, the order of the call arrival featuring specific emotion is also unknown and thus determined by generated pseudo-random number. Thus, regarding every iteration in simulation, the random number of occurrences of each emotion class with the associated random call duration, and finally random order of calls (emotions) in callers' queue, jointly influence the variations of estimated average waiting time, before and after call redistribution. Additionally, the recognition rate of some emotional state has an influence on the average waiting time after call redistribution.

Simulation of call redistribution in a call center is explained on an experimental example for three simultaneously received calls. Each call is represented by one utterance in the GEES corpus. Firstly, the vector of randomized emotion labels for three input calls was generated. According to the input emotion label vector, three utterances belonging to chosen emotion classes were randomly (regarding a speaker) selected from the corpus and provided as an input to SER. As an initial part of the simulation, duration of a call, generated as a random value between 30 s and 3 min 50 s, was appended to each of these utterances. Knowing the initial order of the simulated calls (determined with input emotion label vector), the initial waiting time in the caller's queue is calculated for each caller as a sum of call duration for all preceding callers in the queue. Every caller is represented with input utterance determined with input emotion label. Thus, initial waiting time for every emotion class is evaluated. Secondly, based on the classifier output each input utterance gets one of the five emotion labels, thus output emotion label vector is obtained. Given the output emotion label, calls are redistributed according to the priority vector. New waiting time is calculated for each caller based on the new position in redistributed caller's queue. Accordingly, new waiting time for every emotion class is evaluated.

Table 2 shows the average waiting time which a caller will spend if his/her call is among three calls received at the same moment while all operators are busy, before and after application of SER and call redistribution. It can be observed that there is a significant waiting time reduction for callers recognized as being in a state of fear: initially, they were waiting for about 2 min 40 s, and after SER and the proposed call redistribution they had to wait only 8 s. In the case of an angry caller, there is also a noticeable waiting time reduction: the initial waiting time was 2 min 19 s and after redistribution only about 1 min. In the case of a sad caller, there is little time saving expressed in few seconds: the initial waiting time was 2 min and after redistribution reduced to 1 min 45 s. Regarding neutral and joyful emotional states of the caller, there is an increase in waiting time after SER and call redistribution: about 1 min increased waiting time for a neutral caller and about 2 min for a joyful caller. This increase was expected as the redistribution always places callers with these emotions at the end of callers' queue.

**Table 2.** Average waiting time when 3 calls are received simultaneously while all operators are busy.


The average waiting time which a caller will spend if his/her call is among five calls received in a short period of time while all operators are busy, before and after the application of the proposed algorithm, is shown in Table 3. In the case of fear as the first in emotion ranking, there is the biggest and significant decrease in waiting time: from 4 min 17 s to 25 s after SER and redistribution. There is also a significant decrease in waiting time for angry callers: from 4 min 36 s to 1 min 57 s.


**Table 3.** Average waiting time when 5 calls are received simultaneously while all operators are busy.

Unlike the experiment with three calls at the same time, in the experiment with five calls, callers recognized as being sad have achieved nearly 1 min shorter waiting time after SER and redistribution. In the case of a neutral state, the waiting time is increased for about 1 min. For callers recognized as being joyful, the increase is larger and amounts to about 4 min.

Table 4 shows the average waiting time which a caller will spend if his/her call is among 7 calls received simultaneously, i.e., in a short period of time while all operators are busy. As in two previous experimental settings, three emotions ranked as the priority one (fear, anger and sadness) have a significant decrease in waiting time. Calls featuring fear have the biggest waiting time reduction: it amounts to about 5 min 40 s. Calls featuring anger have achieved 2 min 20 s reduction in waiting time. In the case of a sad caller, the achieved decrease in waiting time is about 1 min. It can be observed that neutral and joyful callers have an increase in waiting time: 2 min 37 s and 5 min 30 s, respectively.

1

2


**Table 4.** Average waiting time when 7 calls are received simultaneously while all operators are busy.

The comparative results of average waiting time in all three experimental settings (3, 5, and 7 simultaneously received calls) regarding the callers in all five emotional states, are shown in Figure 4. As the callers in the state of fear have the highest priority, their average waiting time is significantly reduced in all experimental settings, even up to twenty times in the case of three simultaneously received calls, ten times in the case of five simultaneously received calls, and six times reduced in the case of seven calls. Angry callers are given the second priority in redistribution, so in all experiments the decrease in their average waiting time is achieved. In the case of three and five simultaneously received calls, the waiting time after redistribution is reduced to less than half of the waiting time before redistribution. In the case of seven simultaneously received calls, the waiting time is reduced by one third of the initial waiting time.

**Figure 4.** Average waiting time for five emotional states, without and after application of the proposed call redistribution.

From experimental results shown in Figure 4, it can be noticed that sad callers will have a moderate decrease in waiting time after the proposed call redistribution. The absolute value of waiting time reduction is the biggest in the case of seven simultaneously received calls, but the relative value of reduction is the biggest in the case of five calls and it amounts to about 18% of the initial waiting time.

As can be observed from Figure 4, callers in a neutral state have increased waiting time after call redistribution, about 1 min increase in the case of three and five simultaneously received calls, and about 2 min increase in the case of seven simultaneously received calls. Joy is marked as the emotion with the lowest priority, which is why callers featuring joy are put at the end of the caller's queue. It causes a significant increase in waiting time for the caller in a state of joy, about twice as longer waiting time after the proposed call redistribution in all experimental settings.

1 2 Table 5 shows the waiting time reduction for five emotional states after SER and the proposed call redistribution is applied, in all experimental settings (with 3, 5 and 7 simultaneously received calls) in a simulated call center. Time reduction is calculated as difference between the average waiting time

− − − − − − without the call redistribution and the average waiting time after application of SER and the proposed call redistribution:

$$
\Delta t\_{\varepsilon} = \overline{\overline{t}}\_{1\varepsilon} - \overline{\overline{t}}\_{2\varepsilon} \tag{3}
$$

where *t*1*<sup>e</sup>* denotes the average waiting time for a caller in the emotional state *e* without SER and call redistribution, *e* denotes one of the five emotional states (fear, anger, sadness, neutral and joy), and *t*2*<sup>e</sup>* denotes the average waiting time for a caller in the emotional state *e* after application of SER and call redistribution.

**Table 5.** Waiting time reduction after the proposed call redistribution is applied. Time is expressed in [min]:[s].


The positive values of waiting time reduction in Table 5 indicate the real reduction in waiting time after call redistribution, which is the case of the callers recognized as being in a state of fear, anger or sadness. Negative values of waiting time reduction indicate that waiting time after call redistribution is actually increased, which is the case of the callers recognized as being in a neutral or joyful state. From the results presented in Table 5, it can be observed that as the number of simultaneously received calls grows, the calls featuring three recognized emotions considered as indicators of more urgent caller's state, namely fear, anger and sadness, show the tendency to have a decreased waiting time after the proposed call redistribution. On the other hand, the calls featuring recognized neutral speech and joy show tendency of increased waiting time as the number of simultaneously received calls grows, but it is considered justified as long as more urgent calls are processed instead of less urgent one.

To examine the results in the case of larger number of iterations, the simulations were performed using 200, 500, and 1000 iterations in all three experimental settings (3, 5, and 7 simultaneously received calls). For each experimental setting, obtained results are presented in Tables 6–8, respectively. Regarding initial average waiting time, even with 1000 iterations there are differences in evaluated initial average waiting time across five emotional states due to combination of random order of emotions in callers' queue and random duration of each call in the queue. Similar to the experiments with 50 iterations, after application of SER and call redistribution, calls featuring fear and anger have achieved significant reduction in waiting time. Unlike the simulation with 50 iterations, calls featuring sadness achieved in some cases slight increase and in some cases slight decrease in waiting time after call redistribution. This can be explained with the fact that neutral callers are put in the middle of callers' priority, so it was expected that their waiting time after increased number of iterations is evaluated as slightly changed initial average value. As can be observed from Tables 6–8, callers recognized as being in neutral and joyful states will have increased waiting time, similar to the results obtained in the simulation with 50 iterations.

Experimental results show the decrease in waiting time of the prioritized emotions. Indeed, there is a minor probability of misrecognizing anger as joy (because both are characterized by a high arousal, but opposite valence poles), and placing that caller at the end of the callers' queue, but possible negative effect depends on the position of such a call in original queue and emotional states of other callers in it. Overall experimental results show an essential decrease in waiting time of the prioritized emotions with negative valence.

In real-world emergency call centers, it is unlikely to expect all emotions equally distributed, as it was case in our simulation experiments. It is more likely to receive more calls featuring fear and less calls featuring joy, as it is reported for the CEMO corpus recorded in a real-world medical call center [7]. Although the results of the proposed SER might be to a certain extent lower in real-world emergency call center, we consider, based on high recognition accuracy for fear, sadness, and neutral that the proposed approach to SER and call redistribution based on it would improve effectiveness of such call center service.


**Table 6.** Average waiting time when 3 calls are received simultaneously while all operators are busy.





#### **5. Conclusions**

The presented research has addressed the problem occurring in emergency call centers when there are several incoming calls in a short period of time while all operators are busy. The proposed solution takes into account a caller's emotional state, by recognizing emotion in speech and giving priority to the caller with negative valence emotion (fear, anger and sadness). The research aims to improve efficiency of emergency call centers based on recognition of more urgent callers. Utilizing the proposed emotion ranking and call redistribution, there is a significant reduction in waiting time for the callers recognized as being in the state of fear. A noticeable waiting time reduction is also achieved in the case of callers recognized to be angry, and a slight reduction in the case of callers recognized to be sad. On the other hand, the algorithm puts neutral and joyful callers at the end of the call queue, so those callers will have an increased waiting time. This is the price to be paid, and it has been considered that less urgent callers are more capable of bearing a longer waiting time.

Additionally, the waiting time for the most urgent calls can be shortened by giving the signal to operators who process lower priority calls that there is an emergency call on hold. Depending on the dialogue strategy in a call center, the current call will be ended faster or put on hold, so that an emergency call would be received immediately.

Although there are evident differences between the emotional speech corpus recorded in a real call center and the acted emotional speech corpus recorded under controlled conditions, the experimental results in the simulated call center give a promising sign that the proposed approach to SER and call redistribution based on it would improve effectiveness of a real call center service. The proposed algorithm is a basis for detecting critical users in the specific type of call centers considered in the research.

Other SER techniques can be used instead of the proposed one, with similar results related to the improvement of a call center effectiveness. The proposed SER based on hand-crafted features (like at the OpenSMILE toolkit) could be faster and more robust in real conditions than any DNN or end-to-end based SER system, particularly in the case of a rather small GEES corpus, i.e., the only one available in Serbian that was suitable for the presented research. Due to the lack of available data, any DNN- or end-to-end-based SER system for Serbian could not be trained well, and there is a high risk of model over-fitting. In the only emotional speech corpus for under-resourced Serbian (GEES), there are just 1800 utterances, which is definitely not enough for state-of-the-art NN-based approaches.

Further research should consider "in the wild" recordings from real-world call centers (emergency call centers or health care centers for elderly people), so that the proposed approach could be tested on realistic data and its efficiency verified. Further research may also be directed toward combining paralinguistic and linguistic information. Recordings of the initial part of a call (1–2 sentences with duration of 5–8 s) in human–machine dialogue can be used as input not only into SER, but also into ASR. After ASR, recognized keywords can be used as an additional indicator of certain emotional states and thus priorities. It could increase reliability of the emotion estimation and utility of the proposed algorithm, even in the case of a lower arousal, i.e., more passive levels of emotion activation. Of course, a possible fusion of SER and ASR depends on the dialogue strategy, and the language and vocabulary expected in particular human–machine interactions.

**Author Contributions:** Conceptualization, M.B. and V.D.; methodology, M.B., V.D. and A.K.; formal analysis, M.B.; investigation, M.B.; writing—original draft preparation, M.B.; writing—review and editing, V.D. and A.K.; visualization, M.B.; supervision, V.D. and A.K. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work has resulted from cooperation between researchers from two institutions at the project HARMONIC (ERA.Net RUS Plus, 2017-2021) related in part to human–machine interaction, as well as supported by the Russian Science Foundation project #18-11-00145 (Section 2.2).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

#### *Article*
