# **Human Computer Interaction for Intelligent Systems**

Edited by Matúš Pleva, Yuan-Fu Liao and Patrick Bours Printed Edition of the Special Issue Published in *Electronics*

www.mdpi.com/journal/electronics

## **Human Computer Interaction for Intelligent Systems**

## **Human Computer Interaction for Intelligent Systems**

Editors

**Mat ´uˇs Pleva Yuan-Fu Liao Patrick Bours**

MDPI Basel Beijing Wuhan Barcelona Belgrade Manchester Tokyo Cluj Tianjin

*Editors* Matu´ s Pleva ˇ Department of Electronics and Multimedia Communications, Faculty of Electrical Engineering and Informatics, Technical University of Kosice ˇ Kosice ˇ Slovakia

Yuan-Fu Liao Artificial Intelligence Innovation, Industry Academia Innovation School National Yang Ming Chiao Tung University HsinChu Taiwan

Patrick Bours Department of Information Security and Communication Technology, Faculty of Information Technology and Electrical Engineering, Norwegian University of Science and Technology Gjøvik Norway

*Editorial Office* MDPI St. Alban-Anlage 66 4052 Basel, Switzerland

This is a reprint of articles from the Special Issue published online in the open access journal *Electronics* (ISSN 2079-9292) (available at: www.mdpi.com/journal/electronics/special issues/hci systems).

For citation purposes, cite each article independently as indicated on the article page online and as indicated below:

LastName, A.A.; LastName, B.B.; LastName, C.C. Article Title. *Journal Name* **Year**, *Volume Number*, Page Range.

**ISBN 978-3-0365-6577-4 (Hbk) ISBN 978-3-0365-6576-7 (PDF)**

© 2023 by the authors. Articles in this book are Open Access and distributed under the Creative Commons Attribution (CC BY) license, which allows users to download, copy and build upon published articles, as long as the author and publisher are properly credited, which ensures maximum dissemination and a wider impact of our publications.

The book as a whole is distributed by MDPI under the terms and conditions of the Creative Commons license CC BY-NC-ND.

## **Contents**


### **Soonshin Seo and Ji-Hwan Kim**


## **About the Editors**

#### **Mat ´uˇs Pleva**

Matu´ s Pleva (Matus Pleva) graduated Ph.D. in telecommunications from the Department of ˇ Electronics and Multimedia Communications of the Faculty of Electrical Engineering and Informatics at the Technical University of Kosice (2010). He works as an associate professor in the field of informatics and is Head of the Department. His research interests are acoustic modeling, acoustic event detection, speaker recognition, speech processing, human–machine interaction, embedded systems and parallel computing, security and biometrics, computer networking, IoT, etc. He is now leading the Slovak part of the project NITRO Clubs EU. He just successfully finished "Deep Learning for Advanced Speech Enabled Applications" bilateral project with the National Taipei University of Technology and "Content innovation and lecture textbooks for Biometric Safety Systems" project as principal investigator. He also participated in more than 50 national and international projects and COST actions. He is a member of ACM (Association for Computing Machinery) and HiPEAC (High Performance Embedded Architecture and Compilation). He is a member of Multi-modal Imaging of Forensic Science Evidence tools for Forensic Science and Wearable Robots for Augmentation, Assistance, or Substitution of Human Motor Functions COST actions. He was an MC member of the IC1106 COST action "Integrating Biometrics and Forensics for the Digital Age". He recently started a bilateral collaboration between TUKE and CAVS, MSU, US—the first demo output of the cooperation in the field of robotics and HCI. This collaboration resulted in bilateral Erasmus, MoU, and MoA agreements. He also participated in more than 50 national and international projects and COST actions. He has published over 150 technical papers in journals and conference proceedings with over 1000 citations to date.

#### **Yuan-Fu Liao**

Yuan-Fu Liao received B.S., M.S., and Ph.D. degrees from the Department of Communication Engineering, National Chiao Tung University (NCTU), Hsinchu, Taiwan, in 1991, 1993, and 1998, respectively. From January 1999 to June 1999, he was a Postdoctoral Researcher at the Department of Communication Engineering, National Chiao-Tung University. From September 1999 to February 2002, he became a Research Engineer at Philips Research East Asia, Taiwan. From February 2002 to July 2022, he was with the Department of Electronic Engineering, National Taipei University of Technology, Taipei, Taiwan. Since August 2022, he has joined the Institute of Artificial Intelligence Innovation, Industry Academia Innovation School, National Yang Ming Chiao Tung University, HsinChu, Taiwan, where he is currently a full Professor. His major research interests are Speech Signal Processing (Speech, Speaker, Language and Emotion Recognition, Speech Synthesis), Audio Signal Processing (Speech Enhancement, Microphone Array), Natural Language Processing (Machine Translation), and Machine Learning (Deep Learning, Deep Neural Networks).

#### **Patrick Bours**

Patrick Bours studied Discrete Mathematics at Eindhoven University of Technology in the Netherlands (M.Sc. 1990, Ph.D. 1994). Worked 10 years at the Netherlands National Communication Security Agency (NLNCSA) as a senior policy member in the area of crypto with a focus on public key crypto and random number generation. He started in 7/2005 as a PostDoc at the Norwegian Information Security Laboratory (NISlab) at Gjøvik University College in a project "Authentication in a Health Service Environment". As of 7/2008 he was appointed as Associate Professor at NISlab and specialized in authentication and more specifically biometrics. His main research interest is in behavioural biometrics with many papers on Gait Recognition (recognizing a person by his/her walking style), Keystroke Dynamics (recognizing a person by his/her typing style), and Continuous Authentication (making sure that the person that is using a device is the same as the person that logged on to that device, so detecting a change of user). Since 9/2012 he have a full professor position at NISlab. He was also head of NISlab in the period from 7/2009 to 6/2012. He is currently working on the "Chatroom Security" project where the goal is to detect people with fake profiles and child predators based on their typing and stylometric behaviour. Additionally, he is working on the detection of contract cheating to detect the academic dishonesty of students.

## *Editorial* **Human–Computer Interaction for Intelligent Systems**

**Matúš Pleva 1,\* , Yuan-Fu Liao <sup>2</sup> and Patrick Bours <sup>3</sup>**


#### **1. Introduction**

The further development of human–computer interaction applications is still in great demand as users expect more natural interactions. For example, speech communication in many languages is expected as a basic feature for intelligent systems, such as robotic systems, autonomous vehicles, or virtual assistants. For this Special Issue, we invited submissions from researchers addressing the unique opportunities and challenges associated with human–computer interaction with intelligent systems. We encouraged authors to submit reports describing systems built for different languages and multilingual systems. We also invited submissions from researchers studying the linguistic, emotional, prosodic, and dialogue aspects of speech communication. We proposed to have a dialogue about other input and output modalities, including multimodal systems, fusion/fission algorithms, and deep learning methods. We encouraged the authors to report in detail the state-of-art results, and provide useful reviews and data used to build such systems to support development in those areas. The rapidly growing domain of virtual reality applications is of interest both as an application domain in which new interfaces and interaction methods are needed and as a potential testbed for evaluating speech and other interface modalities.

#### **2. Short Presentation of the Papers**

Every high-quality research started using a deep state-of-the-art review. We are proud to present great reviews in our collection about speech emotions [1], automatic spelling correction [2] and the usage of art in virtual reality [3].

#### *2.1. Review Papers*

Lieskovská et al. [1] presented a review of the recent development in speech emotion recognition and also examines the impact of various attention mechanisms on speech emotion recognition performance. Overall comparison of the systems was performed on a widely used IEMOCAP [4] benchmark database.

Hládek et al. [2] created a survey of automatic spelling correction algorithms. It follows from the previous work by Kukich [5] from 1992. It covers almost 20 years of research conducted since this paper. The article proposes a theoretical framework, overview of the approaches, benchmarks, and evaluation methods. It gives great insight for researchers after the last comprehensive survey because it is the first comprehensive survey paper about this topic after a long period.

Aldridge and Bethel [3] conducted an assessment of how art is being used in virtual reality (VR), and the feasibility of brain injury patients to participate in virtual art therapy was investigated. Studies included in this review highlight the importance of the artistic

**Citation:** Pleva, M.; Liao, Y.-F.; Bours, P. Human–Computer Interaction for Intelligent Systems. *Electronics* **2023**, *12*, 161. https://doi.org/10.3390/ electronics12010161

Received: 26 December 2022 Accepted: 27 December 2022 Published: 29 December 2022

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

subject matter, sensory stimulation, and measurable performance outcomes for assessing the effect art therapy has on motor impairment in VR.

#### *2.2. Research Papers*

Stark et al. [6] described design and implementation of a new method for control and monitoring of mechatronic systems connected to the IoT network using a selected segment of extended reality to create an innovative form of the human–machine interaction. In the proposed solution, modern detection and recognition methods of 3D objects in augmented reality are used instead of conventional methods of control and monitoring of mechatronic IoT systems based on scanning QR codes.

Machová et al. [7] focus on increasing the effectiveness of the lexicon-based sentiment analysis. Within the research, two lexicons were built: the first was a big, domain-dependent lexicon created by translating and merging several existing dictionaries, and the second was a small, domain-dependent lexicon since it contained only words with the same meaning in different domains. These lexicons were labeled by assigning a degree of polarity to each word in the lexicon using Particle Swarm Optimization methods. The article also contains the results of experiments with the distribution of polarity values for different labeling techniques. The created lexicons were used in a new approach to sentiment analysis and evaluated. Sometimes, when the lexicon does not contain the words used in an analyzed text, the lexicon-based sentiment analysis itself fails. For such cases, it was supplemented with a machine learning model for sentiment analysis. This hybrid approach achieved very good results.

Szabóová et al. [8] paper is from the field of the analysis of emotions from a text that was obtained from a dialogue between a human and a robot and thus combines the field of sentiment analysis with HRI (human–robot interaction). Information about the emotional state of the person the robot is interacting with can help the robot choose the most appropriate response. Both a lexicon-based approach and machine learning were used for the emotion recognition (Naïve Bayes (multi-nomial, Bernoulli, and Gaussian), Support Vector Machine, and feed-forward neural network using various data representations, such as Bag-of-Words, TF-IDF, and sentence embeddings (ConceptNet Numberbatch)). The result of the experiments was an ensemble classifier consisting of the nine best models for each emotion. The model was demonstrated in four different scenarios with the humanoid robot NAO. Results concluded, that the best scenario for human acceptance is the one with emotional classification accompanied by emotional movements of the robot.

Shao et al. [9] presented the classification of dual-arm robot operator's mental workload by using the heart rate variability (HRV) signal. Average classification accuracy of 98.77% was obtained using the K-Nearest Neighbor (KNN) method.

Ondáš et al. [10] introduced a novel pediatric audiometry application for hearing detection in the home environment. Conditioned play audiometry principles were adopted to create a speech audiometry application, where children help the virtual robot Thomas assign words to pictures. Several game scenarios together with the setting condition issues were created, tested, and discussed.

Agarwal et al. [11] focuses on designing a grammar detection system that understands both structural and contextual information of sentences for validating whether the English sentences are grammatically correct. The paper proposes a new Lex-Pos sequencing approach that contains both information, linguistic, as well as syntactic, of a sentence. Long Short-Term Memory (LSTM) neural network architecture has been employed to build the grammar classifier. The study conducts nine experiments to validate the strength of the Lex-Pos sequences. The results showed that the Lex-Pos-based models are observed as superior in giving more accurate predictions and they are more stable.

Trnka et al. [12] depicted a system for predicting the values of Activation and Valence (AV) directly from the sound of emotional speech utterances without the use of its semantic content or any other additional information. The system uses X-vectors to represent sound characteristics of the utterance and a Support Vector Regressor for the estimation of the AV values. The aim of the work was to test whether in each unseen database the predicted values of Valence and Activation will place emotion-tagged utterances in the AV space in accordance with expectations based on Russell's circumplex model of affective space.

Gondi and Pratap [13] from Facebook AI Research presented an innovative performance evaluation of offline speech recognition on Raspberry Pi CPU compared to Jetson Nano GPU. It was shown that after PyTorch mobile optimization and quantization, the models can achieve real-time inference on the Raspberry Pi CPU with a small degradation to word error rate. On the other hand, the Jetson Nano GPU has inference latency three to five times better, compared to Raspberry Pi.

Seo and Kim [14] presented a self-attentive multi-layer aggregation with feature recalibration and deep length normalization for a text-independent speaker verification system. The ResNet with the scaled channel width and layer depth was used to reduce the number of model parameters as a baseline. A self-attention mechanism was applied to perform multi-layer aggregation with dropout regularizations and batch normalizations. Further, deep-length normalization was used on a recalibrated feature in the training process. Experimental results using the VoxCeleb1 [15] evaluation dataset showed that the performance of the proposed methods was comparable to that of state-of-the-art models.

Baˇcíková et al. [16] used the term domain usability (DU) to describe the aspects of the user interface related to the terminology and domain. A new method called ADUE (Automatic Domain Usability Evaluation) for the automated evaluation of selected DU properties on existing user interfaces was introduced. The authors executed ADUE on several real-world Java applications and report their findings.

Lin et al. [17] developed posting recommendation systems (RSs) to support users in composing reasonable posts and receiving effective answers. The posting RSs were evaluated by a user study containing 27 participants and three tasks to examine if users engaged more in the question-generation process. The results show that the proposed mechanism enables the production of question posts with better understanding, which leads experts to devote more attention to answering their questions.

Jinsakul et al. [18] presented an innovative approach to improve Thailand's government's systems to include handicraft products with a 3D display option for smartphones. The 1775 participants' evaluation results in this study proved that the proposed 3D handicraft product application affected users by attracting their attention towards them.

**Author Contributions:** Conceptualization, M.P.; methodology, M.P. and Y.-F.L.; writing—original draft preparation, M.P.; writing—review and editing, M.P. and P.B.; supervision, Y.-F.L.; funding acquisition, M.P. All authors have read and agreed to the published version of the manuscript.

**Funding:** This paper was supported by the Slovak Research and Development Agency (Agentúra na podporu výskumu a vývoja) under projects APVV-SK-TW-21-0002 and APVV-SK-TW-17-0005; the Scientific Grant Agency (Vedecká grantová agentúra MŠVVaŠ SR a SAV), project numbers VEGA 1/0753/20 & VEGA 2/0165/21; and the Cultural and Educational Grant Agency (Kultúrna a edukaˇcná grantová agentúra MŠVVaŠ SR), project number KEGA 009TUKE-4-2019, both funded by the Ministry of Education, Science, Research, and Sport of the Slovak Republic.

**Acknowledgments:** We would like to thank all the authors for the papers they submitted to this Special Issue. We would also like to acknowledge all the reviewers for their careful and timely reviews to help improve the quality of this Special Issue.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

## *Review* **A Review on Speech Emotion Recognition Using Deep Learning and Attention Mechanism**

**Eva Lieskovská \*, Maroš Jakubec, Roman Jarina and Michal Chmulík**

Faculty of Electrical Engineering and Information Technology, University of Žilina, Univerzitná 8215/1, 010 26 Žilina, Slovakia; maros.jakubec@feit.uniza.sk (M.J.); roman.jarina@uniza.sk (R.J.); michal.chmulik@uniza.sk (M.C.)

**\*** Correspondence: eva.lieskovska@feit.uniza.sk

**Abstract:** Emotions are an integral part of human interactions and are significant factors in determining user satisfaction or customer opinion. speech emotion recognition (SER) modules also play an important role in the development of human–computer interaction (HCI) applications. A tremendous number of SER systems have been developed over the last decades. Attention-based deep neural networks (DNNs) have been shown as suitable tools for mining information that is unevenly time distributed in multimedia content. The attention mechanism has been recently incorporated in DNN architectures to emphasise also emotional salient information. This paper provides a review of the recent development in SER and also examines the impact of various attention mechanisms on SER performance. Overall comparison of the system accuracies is performed on a widely used IEMOCAP benchmark database.

**Keywords:** speech emotion recognition; deep learning; attention mechanism; recurrent neural network; long short-term memory

#### **1. Introduction**

The aim of human–computer interaction (HCI) is not only to create a more effective and natural communication interface between people and computers, but its focus also lies on creating the aesthetic design, pleasant user experience, help in human development, online learning improvement, etc. Since emotions form an integral part of human interactions, they have naturally become an important aspect of the development of HCI-based applications. Emotions can be technologically captured and assessed in a variety of ways, such as facial expressions, physiological signals, or speech. With the intention of creating more natural and intuitive communication between humans and computers, emotions conveyed through signals should be correctly detected and appropriately processed. Throughout the last two decades of research focused on automatic emotion recognition, many machine learning techniques have been developed and constantly improved.

Emotion recognition is used in a wide variety of applications. Anger detection can serve as a quality measurement for voice portals [1] or call centres. It allows adapting provided services to the emotional state of customers accordingly. In civil aviation, monitoring the stress of aircraft pilots can help reduce the rate of a possible aircraft accident. Many researchers, who seek to enhance players' experiences with video games and to keep them motivated, have been incorporating the emotion recognition module into their products. Hossain et al. [2] used multimodal emotion recognition for quality improvement of a cloud-based gaming experience through emotion-aware screen effects. The aim is to increase players' engagement by adjusting the game in accordance with their emotions. In the area of mental health care, a psychiatric counselling service with a chatbot is suggested in [3]. The basic concept consists of the analysis of input text, voice, and visual clues in order to assess the subject's psychiatric disorder and inform about diagnosis and treatment. Another suggestion for emotion recognition application is a conversational chatbot, where

**Citation:** Lieskovská, E.; Jakubec, M.; Jarina, R.; Chmulík, M. A Review on Speech Emotion Recognition Using Deep Learning and Attention Mechanism. *Electronics* **2021**, *10*, 1163. https://doi.org/10.3390/ electronics10101163

Academic Editors: Matus Pleva, Yuan-Fu Liao, Patrick Bours and Chiman Kwan

Received: 11 March 2021 Accepted: 9 May 2021 Published: 13 May 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

speech emotion identification can play a role in better conversation [4]. A real-time SER application should find an optimal trade-off between less computing power, fast processing times, and a high degree of accuracy. real-time SER application should find an optimal trade-off between less computing power, fast processing times, and a high degree of accuracy. In this review, we focus on works dealing with the processing of acoustic clues from

clues in order to assess the subject's psychiatric disorder and inform about diagnosis and treatment. Another suggestion for emotion recognition application is a conversational chatbot, where speech emotion identification can play a role in better conversation [4]. A

In this review, we focus on works dealing with the processing of acoustic clues from speech to recognise the speaker's emotions. The task of speech emotion recognition (SER) is traditionally divided into two main parts: feature extraction and classification, as depicted in Figure 1. During the feature extraction stage, a speech signal is converted to numerical values using various front-end signal processing techniques. Extracted feature vectors have a compact form and ideally should capture essential information from the signal. In the back-end, an appropriate classifier is selected according to the task to be performed. speech to recognise the speaker's emotions. The task of speech emotion recognition (SER) is traditionally divided into two main parts: feature extraction and classification, as depicted in Figure 1. During the feature extraction stage, a speech signal is converted to numerical values using various front-end signal processing techniques. Extracted feature vectors have a compact form and ideally should capture essential information from the signal. In the back-end, an appropriate classifier is selected according to the task to be performed.

*Electronics* **2021**, *10*, x FOR PEER REVIEW 2 of 29

**Figure 1.** Block scheme of general speech emotion recognition system. **Figure 1.** Block scheme of general speech emotion recognition system.

Examples of widely used acoustic features are mel-frequency cepstral coefficients (MFCCs), linear prediction cepstral coefficients (LPCC), short-time energy, fundamental frequency (F0), formants [5,6], etc. Traditional classification techniques include probabilistic models such as the Gaussian mixture model (GMM) [6–8], hidden Markov model (HMM) [9], and support vector machine (SVM [10–12]. Over the years of research, also various artificial neural network architectures have been utilised, from the simplest multilayer perceptron (MLP) [8] through extreme learning machine (ELM)[13], convolutional neural networks (CNNs) [14,15], to deep architectures of residual neural networks (Res-Nets) [16] and recurrent neural networks (RNNs) [17,18]. In particular, long short-term memory (LSTM) and gated recurrent units (GRU)-based neural networks (NNs), as stateof-the-art solutions in time-sequence modelling, have been widely utilised in speech signal modelling. In addition, various end-to-end architectures have been proposed to learn jointly both extraction of features and classification [15,19,20]. Examples of widely used acoustic features are mel-frequency cepstral coefficients (MFCCs), linear prediction cepstral coefficients (LPCC), short-time energy, fundamental frequency (F0), formants [5,6], etc. Traditional classification techniques include probabilistic models such as the Gaussian mixture model (GMM) [6–8], hidden Markov model (HMM) [9], and support vector machine (SVM [10–12]. Over the years of research, also various artificial neural network architectures have been utilised, from the simplest multilayer perceptron (MLP) [8] through extreme learning machine (ELM) [13], convolutional neural networks (CNNs) [14,15], to deep architectures of residual neural networks (ResNets) [16] and recurrent neural networks (RNNs) [17,18]. In particular, long short-term memory (LSTM) and gated recurrent units (GRU)-based neural networks (NNs), as state-of-the-art solutions in time-sequence modelling, have been widely utilised in speech signal modelling. In addition, various end-to-end architectures have been proposed to learn jointly both extraction of features and classification [15,19,20].

Besides LSTM and GRU networks, the introduction of an attention mechanism (AM) in deep learning may be considered as another milestone in sequential data processing. The purpose of AM is, as with human visual attention, to select relevant information and filter out irrelevant ones. The attention mechanism, first introduced for a machine translation task [21], has become an essential component of neural architectures. Incorporating AM into encoder–decoder-based neural architectures significantly boosted the performance of machine translation even for long sequences [21,22]. Motivated by the success of attention on machine translation, many researchers have considered it as an essential component of neural architectures for a remarkably large number of applications including natural language processing (NLP) and speech processing. Since emotional salient information is unevenly distributed across speech utterances, an integration of AM into NN Besides LSTM and GRU networks, the introduction of an attention mechanism (AM) in deep learning may be considered as another milestone in sequential data processing. The purpose of AM is, as with human visual attention, to select relevant information and filter out irrelevant ones. The attention mechanism, first introduced for a machine translation task [21], has become an essential component of neural architectures. Incorporating AM into encoder–decoder-based neural architectures significantly boosted the performance of machine translation even for long sequences [21,22]. Motivated by the success of attention on machine translation, many researchers have considered it as an essential component of neural architectures for a remarkably large number of applications including natural language processing (NLP) and speech processing. Since emotional salient information is unevenly distributed across speech utterances, an integration of AM into NN architecture is also of interest among the SER research community.

architecture is also of interest among the SER research community. Although several review articles have been devoted to automatic speech emotion recognition [23–29], to the best of the authors' knowledge, a comprehensive overview of SER solutions containing attention mechanisms is lacking. Motivated by this finding, in this article, we provide a review of the recent development in the speech emotion recognition field with a focus on the impact of AM in deep learning-based solutions on SER Although several review articles have been devoted to automatic speech emotion recognition [23–29], to the best of the authors' knowledge, a comprehensive overview of SER solutions containing attention mechanisms is lacking. Motivated by this finding, in this article, we provide a review of the recent development in the speech emotion recognition field with a focus on the impact of AM in deep learning-based solutions on SER performance.

performance. The paper is organised as follows: Firstly, the scope and methodology of the survey are discussed in Section 2. In Section 3, we address some of the key issues in deep learning-based SER development. Section 4 provides a theoretical background of the most commonly used neural architectures incorporating AM. Then, we review recently proposed SER systems incorporating different types of AM. Finally, we compare the accuracy of the selected systems on the IEMOCAP benchmark database in Section 5. The section is concluded by a short discussion on the impact of AM on SER system performance.

#### **2. Scope and Methodology**

The paper is divided into two main parts: the first part discusses a general concept of SER and related works, including a description of the novel and deep features, and transfer learning and generalisation techniques, and the focus of the second part is on DNN models incorporating attention mechanism. We used Scopus and Web of Science (WoS) citation databases to search for relevant publications. A number of published papers by year of publication is given in Table 1. This is a general amount of works when searching by the keywords: speech, emotion, recognition, attention. Due to the excessive amount of research work dealing with this topic, only selected papers from the last 4 to 5 years of intensive research are reported in our study. In this review, the speech-related works were mainly taken into consideration; papers dealing with other physiological signals such as EEG, heart rate variability, as well as a fusion of multiple modalities were excluded.

**Table 1.** Number of publications during the initial search for speech emotion recognition and attention speech emotion recognition.


For an additional overview of research work dealing with SER from previous and latest years, we refer a reader to reviews and surveys listed in Table 2. Note, our review does not cover all the topics related to SER such as detailed descriptions of speech features, classifiers, and emotional models, which are addressed more closely in other survey papers. We assume a reader's knowledge in probabilistic and machine learning-based approaches in data classifiers as well as in the basic DNN architectures. To the best of the authors' knowledge, none of the other reviews or surveys (listed in Table 2) deal with attention mechanism in more detail; hence, we consider it to be our main contribution.

**Table 2.** A brief summary of reviews and surveys related to SER.




#### *Evaluation Metrics*

In this section, common metrics of accuracy evaluation are listed. For a multiclass classification task, accuracy is assessed per class firstly and then the average accuracy is determined. This is denoted as unweighted accuracy hereafter. If the class accuracies are weighted according to the number of per-class instances, then the evaluation metric may not reflect the unbalanced nature of data (which is very common with databases of emotional speech). Therefore, the unweighted accuracy is often a better indicator of the system's accuracy. The common evaluation metrics for the SER tasks are as follows:

• Precision is the ratio of all correctly positively classified samples (true positive—TP) to all positive classified samples (TP and false positive—FP). For K-class evaluation, the precision is computed as follows:

$$\text{precision} = \frac{\sum\_{\mathbf{k}=1}^{\mathbf{K}} \frac{\text{TP}\_{\mathbf{k}}}{\text{TP}\_{\mathbf{k}} + \text{FP}\_{\mathbf{k}}}}{\mathbf{K}} \,. \tag{1}$$

• Recall is the ratio of all correctly positively classified samples (TP) to the number of all samples in a tested subgroup (TP and false negative FN). Recall indicates a class-specific recognition accuracy. Similarly, as in the case of precision, the recall for a multiclass classification problem is computed as the average of recalls for individual classes.

$$\text{recall} = \frac{\sum\_{\mathbf{k}=1}^{\mathbf{K}} \frac{\text{TP}\_{\mathbf{k}}}{\text{TP}\_{\mathbf{k}} + \text{FN}\_{\mathbf{k}}}}{\mathbf{K}} \,. \tag{2}$$


$$\text{WAR} = \frac{\sum\_{\mathbf{k}=1}^{\mathbf{K}} |\mathbf{s}\_{\mathbf{k}}| \cdot \text{recall}\_{\mathbf{k}}}{\sum\_{\mathbf{k}=1}^{\mathbf{K}} |\mathbf{s}\_{\mathbf{k}}|} \tag{3}$$

• F1 score is defined as the harmonic mean of the precision and recall.

$$\text{F1} = 2 \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}} \tag{4}$$

Note, all of the above-mentioned classification metrics are in the range of [0, 1] (×100 %). A regression problem is often encountered when dealing with a continuous emotional scale. The appropriate metric for the regression is the correlation coefficient determined in two ways:

• Pearson's correlation coefficient (PCC; ρ) measures the correlation between the true and predicted values (x and y, respectively). Given the pairs of values {(xn, yn)}, n = 1, 2, . . . , N, Pearson's correlation coefficient is computed as follows:

$$\rho = \frac{\sum\_{\mathbf{n}=1}^{\mathrm{N}} (\mathbf{x\_{\mathrm{n}} - \boldsymbol{\mu}\_{\mathrm{x}}) \left(\mathbf{y\_{\mathrm{n}} - \boldsymbol{\mu}\_{\mathrm{y}}\right)}}{\sqrt{\sum\_{\mathbf{n}=1}^{\mathrm{N}} (\mathbf{x\_{\mathrm{n}} - \boldsymbol{\mu}\_{\mathrm{x}})^2 \sum\_{\mathbf{n}=1}^{\mathrm{N}} \left(\mathbf{y\_{\mathrm{n}} - \boldsymbol{\mu}\_{\mathrm{y}}\right)^2}}},\tag{5}$$

where n denotes the index of the current pair, and µ<sup>x</sup> and µ<sup>y</sup> are mean values of x<sup>n</sup> and yn, respectively.

• Concordance Correlation Coefficient (CCC; ρc) examines the relationship between the true and predicted values from a machine learning model. CCC lies in the range of [−1, 1], where 0 indicates no correlation and 1 is perfect agreement or concordance.

$$\rho\_{\rm c} = \frac{2\rho\sigma\_{\rm x}\sigma\_{\rm y}}{\sigma\_{\rm x}^2 + \sigma\_{\rm y}^2 + \left(\mu\_{\rm x} - \mu\_{\rm y}\right)^2},\tag{6}$$

where µ is the mean value and σ is standard deviation, and ρ is Pearson's correlation coefficient.

A comparison of published SER solutions is difficult due to the different experimental conditions used. Thus, we tried to do at least an intuitive comparative analysis of the published DNN-based SER systems performance. We grouped the systems according to the emotional datasets used for the conduction of experiments. Since the settings of the datasets differ significantly, we also group the compared works according to emotional labelling (discrete/continuous SER) and/or the number of classes being recognised and common cross-validation scenario. For the evaluation, we use the most widely used IEMOCAP database, on which most of the state-of-the-art systems have been tested. For comparison, we also listed the performance of the systems tested on EmoDB and RECOLA datasets.

#### **3. Speech Emotion Recognition and Deep Learning**

In this section, we review the most relevant issues in today's SER system development in general: (1) emotional speech database development, (2) speech features extraction and DL based emotion modelling, and (3) selected techniques for SER performance improvement, such as data augmentation, transfer learning, and cross-domain recognition (the attention mechanism is addressed in Sections 4 and 5). A comparison of the state-of-the-art works (excluding AM) based on common criteria is provided at the end of this Section.

#### *3.1. Databases of Emotional Speech*

Since the state-of-the-art SER solutions are exclusively based on data-driven machine learning techniques, the selection of a suitable speech database is naturally a key task in building such SER systems. Several criteria have to be taken into account when selecting a proper dataset, such as the degree of naturalness of emotions, the size of the database, and the number of available emotions. The databases can be divided into three basic categories:


Naturally, speech databases are created in various languages, and they may consist of a variety of emotional states. However, emotion labelling is not unified. Recognised emotion can be labelled into several discrete emotional classes, as shown in Table 3. The common way is labelling to six basic (known as the big six) emotional categories—anger, disgust, fear, happiness, sadness, surprise, and neutral. If SER is considered a regression problem, the emotions are mapped to continuous values representing the degree of emotional arousal, valence, and dominance. Valence is a continuum ranging from unhappiness to happiness, arousal ranges from sleepiness to excitement, dominance is in a range from submissiveness to dominance (e.g., control, influence) [31]. In Table 3, the most widely used databases of emotional speech are listed.

We would like also to draw attention to the following issue related to speech emotion rating and annotation. It has to be distinguished between emotion perceived (or observed) and emotion elicited (induced). Unlike in music emotion recognition, or affective analysis of movies where attention is paid to the listener's or spectator's experience, in the case of speech emotion recognition, the focus is on the speaker and his emotional state. The way the data is annotated is of much importance, especially in the case of annotation of spontaneous and induced emotions of the speaker. The emotion in speech is usually annotated by a listener. Another option is to use the rating provided by the speaker himself (felt or induced emotions) or obtained by analysis of the speaker's physiological signals. Since the experimental studies have shown a considerable discrepancy between emotion ratings by speaker and observer, correct and unambiguous emotion rating is still an open issue [32].


**Table 3.** Comparison of databases of emotional speech.

Meaning of acronyms are as follows: Num. of subjects: F—female, M—male; Discrete labels: A—anger, B—boredom, C—contempt, D—disgust, E—excitement, Em—emphatic, F—fear, H—happiness, He—helplessness, I—irritation, J—joy, M—motherese, N—neutral, O—other, R—reprimanding, S—sadness, Sr—surprise; Dim. Labels: dimensional labels (arousal, valence, dominance); Modality: A—audio, V—video, T—text, MCF—motion capture of face, ECG—electrocardiogram, EDA—electrodermal activity. <sup>1</sup> Overall, 46 subjects participated in samples recording; however, only 27 subjects were available for audio–visual emotion recognition challenge (AVEC) [43].

#### *3.2. Acoustic Features*

The purpose of SER is to automatically determine the emotional state of the speaker via a speech signal. Changes in the waveform's frequency and intensity may be observed when comparing different emotionally coloured speech signals [9]. The aim of SER is to capture these variations using different discriminative acoustic features. Acoustic features (referred to as low-level descriptors (LLDs) are often aggregated by temporal feature integration methods (e.g., statistical and spectral moments) in order to obtain features at a global level [44]. High-dimensional feature vectors can be transformed into a compact representation using feature selection (FS) techniques. The aim is to find substantial information from the feature set and discard redundant values simultaneously. In this way, it is possible to optimise the time complexity of the system while maintaining similar accuracy.

Over the many years of research, the focus has been placed on the selection of the ideal set of descriptors for emotional speech. MFCCs originally proposed for speech/speaker recognition are well established also for the derivation of emotional clues. Prosodic descriptors (such as pitch, intensity, rhythm, and duration), as well as voice quality features (jitter and shimmer), are common indicators of human emotions as well [8]. In addition, numerous novel features and feature selection techniques have been developed and successfully applied to SER [7,44–50]. For instance, Gammatone frequency cepstral coefficients proposed by Liu [45] yielded a 3.6% average increase in accuracy compared to MFCCs. Epoch-based features extracted by the zero time windowing also provided emotion-specific and complementary information to MFCCs [46]. Ntalampiras et al. [44] proposed a multiresolution feature called perceptual wavelet packet based on critical-band analysis. It takes into account that not all parts of the spectrum affect human perception in the same way. In [7], the nonlinear Teager–Kaiser energy operator (TEO) was used in combination with MFCC for the detection of stressed emotions. Kerkeni et al. [47] proposed modulation spectral features and modulation frequency features—based on empirical mode decomposition of the input signal and TEO extraction of the instantaneous amplitude and instantaneous frequency of the AM–FM components. Yogesh et al. [48] extracted nonlinear bispectral features and bicoherence features from speech and glottal waveforms.

However, despite great research efforts, there is still no single solution for the most appropriate features. For better comparability of SER systems and their obtained results, attempts to unify feature extraction have been made. When selecting appropriate audio features for SER, it is a common practice to use the openSMILE open-source audio feature extraction toolkit. It contains several feature sets intended for automatic emotion recognition, some of which were proposed in several emotion-related challenges and benchmark initiatives.


#### *3.3. Data-Driven Features*

Apart from speech parameterisation from handcrafted features, another popular approach is to let a neural network (NN) to perform feature extraction. A typical example is the utilisation of CNN to learn from 2D speech spectrograms, log-mel spectrograms, or

even from the raw speech signals [19,55]. CNN is usually supplemented by fully connected (FC) layers and softmax for classification [56]. Architecture, which consists of multiple convolutional layers, is often referred to in literature as deep CNN (DCNN). Huang and Narayanan [55] examined the ability of CNN to perform task-specific spectral decorrelation using log-mel filter-bank (MFB, or log-mel spectrogram) as input features. Since MFCCs are log-mels decorrelated by discrete cosine transform (DCT), the authors demonstrated that the CNN module was a more effective task-specific decorrelation technique under both clean and noisy conditions (experiments were conducted on eNTERFACE'05 [35] database). Aldeneh and Provost [14] experimentally proved that a system based on the minimum set of 40 MFB features and CNN architecture can achieve similar results as SVM trained on a large feature set (1560). Compared to a complex system based on deep feature extraction derived from 1582-dimensional features and an SVM classifier [10], the proposed 40 MFB-CNN provides a more effective and end-to-end solution. Fayek et al. [15] proposed various end-to-end NN architectures to model intra-utterance dynamics. CNN had better discriminative performance than DNN and LSTM architectures, all trained with MFB input features. Vrysis et al. [57] conducted a thorough comparison between standard features, temporal feature integration tactics, and 1D and 2D DCNN architectures. The designed convolutional algorithms delivered excellent performance, surpassing the traditional feature-based approaches. The best 2D DCNN architecture achieved higher accuracy than 1D DCNN with the comparable number of parameters. Moreover, 1D DCNN was four times slower on execution. Hajarolasvadi and Demirel [58] proposed 3D CNN model for speech emotion recognition. The utterances in form of overlapping frames were processed in two ways—88 dimensional features and spectrogram were extracted for each frame. The representation of 3D spectrogram was based on the selection of *k* most discriminant frames with *k*-means clustering algorithm applied to the extracted features. Using this approach, it is possible to capture both spectral and temporal information. The proposed architecture was able to outperformed pretrained 2D CNN model transferred to SER task. Mustaqeem and Kwon [59] proposed plain CNN architecture called deep stride CNN, which used strides for downsampling of input feature maps instead of the pooling layer. The authors dealt with proper pre-processing in form of noise reduction through novel adaptive thresholding and decreasing of computational complexity by utilising simplified CNN structure. This stride CNN improved accuracy by 7.85% and 4.5% on IEMOCAP and RAVDESS datasets, respectively and significantly outperformed state-of-the-art systems.

#### *3.4. Temporal Variations Modelling*

Emotional content in speech varies through time; therefore, it is appropriate to leverage the techniques which are effective for temporal modelling, such as stochastic HMM or neural networks with recurrent units (e.g., LSTM or GRU).

Tzinis and Potamianos [17] studied the effects of variable sequence lengths for LSTMbased recognition (see Section 4 for RNN–LSTM description). Recognition on sequences concatenated at frame-level yielded better results on phoneme length (90 ms). The best results were achieved over statistically aggregated segments at the word level (3 s)—64.16% WA and 60.02% UA (IEMOCAP). In this case, extraction of higher-level statistical functions from multiple LLDs over speech segments led to a more salient representation of underlying emotional dynamics. The proposed solution yielded comparable results to a more complex system based on deep feature extraction and SVM classifiers [10,60].

Recurrent layers are often used in combination with CNN (referred to as CRNN) for the exploitation of temporal information from emotional speech [61]. In this way, both local and global characteristics are modelled. Zhao et al. [62] compared the performance of 1D and 2D-CNN LSTM architectures with raw speech and log-mel spectrograms as input, respectively. Moreover, 2D-CNN LSTM performed better in the modelling of local and global representations than its 1D counterpart. The 2D-CNN LSTM outperformed traditional approaches such as DBN and CNN. Luo et al. [63] proposed a two-channel

system with joint learning of handcrafted HSFs/DNN and log-mel spectrogram/CRNN learned features. In this way, it is possible to obtain different kinds of information from emotional speech. The authors also designed another jointly learned architecture—multi-CRNN with one CRNN channel learning from a longer time scale of spectrogram segment and a second CRNN channel for deeper layer-based feature extraction. Their CRNN baseline consisted of CNN–LSTM with a concatenation of three pooling layers (average, minimum, and maximum). Jointly learned SER systems extracted more robust features than the plain CRNN system and HSF–CRNN outperformed multi-CRNN. Satt et al. [64] proposed CNN–BiLSTM architecture with spectrogram as input and worked with two different frequency resolutions. The results indicated that lower resolution yields lower accuracy by 1–3%. The combination of CNN and BiLSTM achieved better results in comparison with the stand-alone CNN model. Moreover, unweighted accuracy was improved by the proposed two-step classification, where special emphasis was put on a neutral class. Ma et al. [65] dealt with the accuracy loss introduced by the speech segmentation process, i.e., division of longer utterances into segments of the same length. They proposed a similar approach to Satt et al. [64] (a combination of CNN and BiGRU), except that spectrogram of the whole sentence, was used as input. They introduced padding values and dealt with the appropriate processing of valid and padded sequences. Moreover, different weights were assigned to the loss so that the length of the sentence does not affect the bias of the model. There was a significant performance improvement over segmentation methods with fixed-length inputs. Compared to [64], the proposed model using variable-length input spectrograms achieved absolute improvements of 2.65% and 4.82%, in WA and UA.

A significant part of the works on SER prefers to model emotions on continuous scale (usually in the activation–valence emotional plane). Several works on continuous SER have also proven that CNN-based data-driven features outperform traditional hand-engineered features [19,66,67]. For example, authors of [19,67] proposed end-to-end continuous SER systems, in which 1D CNN was applied on the raw waveform and temporal dependencies were then modelled by the Bi-LSTM layers. Khorram et al. [66] proposed two architectures for continuous emotions recognition—dilated CNN with a varying dilation factor for different layers and downsampling/upsampling CNN—with different ways of modelling long-term dependencies. AlBadawy and Kim [68] further improved the accuracy of valence with joint modelling of the discrete and continuous emotion labels. Table 4 summarises the top performances of the continuous SER systems tested on the RECOLA dataset.

**Table 4.** Comparison of continuous SER on RECOLA datasets: A–V = activation–valence, ρc—concordance correlation coefficient.


#### *3.5. Transfer Learning*

The methods based on leveraging pretrained neural networks can often obtain better results than traditional techniques [11,12]. As a result of some studies, pretrained neural networks also outperform randomly initialised networks [69]. The use of transfer learning is especially appropriate for SER, due to the lack of large speech emotion corpora. The deep spectrum features proposed in [12], which were derived from feeding spectrograms through the pretrained network designed for the image classification task, AlexNet [70], is reported to match and even outperform some of the conventional feature extraction techniques. Zhang et al. [11] proposed the use of the AlexNet DCNN pretrained model to learn from three-channel log-mel spectrograms extracted from emotional speech (the

additional two channels contained first and second-time derivates of the spectra, known as delta features). The authors also proposed discriminant temporal pyramid matching (DTPM) pooling strategy to aggregate segment-level features (obtained from the DCNN block) to the discriminative utterance-level representations. According to the results obtained with four different databases, AlexNet fine-tuned on emotional speech performed better in comparison with the simplified DCNN model and at the same time, DTPM based pooling outperformed the conventional average pooling method. Xi et al. [16] conducted several experiments with the utilisation of a pretrained model for speaker verification tasks. The authors proposed a residual adapter which is the residual CNN ResNet20 trained on the VoxCeleb2 speaker dataset with adapter modules trained on IEMOCAP emotion data. The residual adapter outperformed ResNet20 trained on emotional data only. This proved the inadequacy of using a small dataset for training with the ResNet architecture.

#### *3.6. Generalisation Techniques*

The lack of sufficient size of datasets and their imbalanced nature are problems often encountered in SER. With the increase in complexity and size of DNNs, the need for a large dataset is essential for their good performance. One of the solutions is to extend the dataset by various deformation techniques. This approach is limited by the possibility of losing the emotional content by inappropriate deformation of speech samples. The insufficient amount of data can also be addressed by utilising data from other emotional databases. However, there arises a problem of mismatched conditions between training and testing data or in other words problem of mismatched domains.

#### 3.6.1. Data Augmentation

Audio datasets can be effectively expanded (or augmented) using various deformation techniques such as pitch and/or time shifting, the addition of background noise, and volume control [71]. The addition of various noise levels can expand the dataset up to several times [72]. In this subsection, data augmentation techniques applied specifically for the SER task are briefly listed.

In [14], the augmentation based on speed perturbation resulted in an improvement of 2.3% and 2.8% on IEMOCAP and MSP–IMPROV datasets, respectively. Etienne et al. [73] applied several augmentation techniques on highly unbalanced samples from the IEMO-CAP database: vocal tract length perturbation based on rescaling of the spectrograms along the frequency axis, oversampling of classes (happiness and anger), and the use of a higher frequency range. Compared to baseline, the application of all three techniques increased the UA by about 4% (absolute improvement). Vryzas et al. [74] pointed out the fact that changes in the timing and tempo characteristics could result in an undesired loss of emotional clues. They used pitch alterations with constant tempo based on sub-band sinusoidal modelling synthesis for augmentation of data. Although augmentation has not increased the accuracy of the proposed CNN system (for the AESDD dataset [33]), it can improve its robustness and generalisation.

The popular approach of data augmentation is the use of generative adversarial networks (GANs) for generating new in-distribution samples. GAN consists of two networks, which are trained together: generator for generating new samples and discriminator for deciding the authenticity of samples (generated vs. true sample) [75]. Sahu et al. [76] employed vanilla and conditional GAN networks (trained on the IEMOCAP dataset) for generating synthetic feature vectors. The proposed augmentation made slight improvements in SVM's performance when real data were appended with synthetic data. The authors pointed out that a larger amount of data is needed to have a successful GAN framework. Chatziagapi et al. [77] leveraged GAN for spectrogram generation to address the data imbalance. Compared to standard augmentation techniques, authors achieved 10% and 5% relative performance improvement on IEMOCAP and FEEL-25k, respectively.

Fu et al. [78] designed an adversarial autoencoder (AAEC) emotional classifier, through which the dataset was expanded in order to improve the robustness and generalisation of the classifier. The proposed model generated most of the new samples almost within the real distribution.

#### 3.6.2. Cross-Domain Recognition

In the domain adaptation approach, there is an effort to generalise the model for effective emotion recognition across different domains. The performance of a speech emotion recognition system tuned for one emotional speech database can deteriorate significantly for different databases, even if the same language is considered. One may encounter mismatched domain conditions such as different environments, speakers, languages, or various phonation modes. All these conditions worsen the accuracy of the SER system in a cross-domain scenario. Therefore, a tremendous effort has been made to improve the generalisation of the classifier.

Deng et al. [79] proposed unsupervised domain adaptation based on autoencoder. The idea was to train the model on a whispered speech from the GeWEC emotion corpus, while normal speech data were used for testing. Inspired by Universum learning, the authors enhanced the model by integration of the margin-based loss, which adds information from unlabelled data (from another database) to the training process. The results showed that the proposed method outperformed other domain adaptation methods. Abdelwahab and Busso [80] discussed the negative impact of mismatched data distributions between training and testing dataset (target and source domain) on the emotion recognition task. To compensate for the differences between the two domains, the authors used domain adversarial neural network (DANN) [81], which is an adversarial multitask training technique for performing emotion classification tasks and the domain classification. DANN effectively reduced the gap in the feature space between the source and target domains. Zheng et al. [82] presented a novel multiscale discrepancy adversarial (MSDA) network for conducting multiple timescales domain adaptation for cross-corpus SER. The MSDA is characterised by three levels of discriminators, which are fed with global, local, and hybrid levels of features from the labelled source domain and unlabelled target domain. MSDA integrates multiple timescales of deep speech features to train a set of hierarchical domain discriminators and an emotion classifier simultaneously in an adversarial training network. The proposed method achieved the best performance over all other baseline methods. Noh et al. [83] proposed a multipath and group-loss-based network (MPGLN), which supports supervised domain adaptation from multiple environments. It is an ensemble learning model based on a temporal feature generator using BiLSTM, a transferred feature extractor from the pretrained VGG-like audio classification model, and simultaneous minimisation of multiple losses. The proposed MPGLN was evaluated over five multidomain SER datasets and efficiently supported multidomain adaptation and reinforced model generalisation.

Language dependency and emotion recognition with consideration of different languages are common issues that may be encountered in SER. One of the solutions would be to identify language firstly and then to perform language-dependent emotion recognition [5]. Another solution would be to share different language databases and to process them jointly. This is denoted as a multilingual scenario. In the case of a cross-lingual scenario, one dataset is used for training and the other one for testing. Tamuleviˇcius et al. [72] put together a cross-linguistic speech emotion dataset with the size of more than 10.000 emotional utterances. It consists of six emotion datasets of different languages. Moreover, augmentation of data was performed with the addition of white noise and application of Wiener filtering (expansion of dataset up to nine times). For the representations of speech emotion, authors chose several two-dimensional acoustic feature spaces (cochleagrams, spectrograms, mel-cepstrograms, and fractal dimension-based features), and they used CNN for classification. The results showed the superiority of cochleagrams over the other utilised feature spaces and confirmed that emotions are language dependent. With the increase of different language datasets in the training partition, the results obtained by testing with remaining datasets slightly increased.

#### *3.7. DNN Systems Comparison*

In this subsection, we tried to do at least a coarse comparison of the performance of related works discussed above (remark, it is not possible to make an exact comparison due to different test conditions, even if the same dataset was used). Note this summary does not contain works incorporating attention mechanisms. The attention mechanism is discussed in Section 4.

We focused on finding common criteria and the selection of datasets for comparative analysis. From literature review, we selected the two most widely used databases—EmoDB and IEMOCAP—and sorted out the related works in terms of the number of emotions used for classification and cross-validation scheme. The resulted comparison of the SER systems on EmoDB and IEMOCAP is in Tables 5 and 6 respectively.

For the EmoDB dataset, we considered research works that used all emotion classes and the leave-one speaker-out (LOSO) method of cross-validation—speaker-independent scenario. The human evaluation of emotions from EmoDB showing the average recognition rate of 84.3% was surpassed by most of the works under comparison.

As seen in Table 5, the system incorporating handcrafted features with proper temporal feature integration method yielded state-of-the-art results (>90% WA) in [44]. Thus, the aggregation of different descriptors carries significant emotional information. However, the disadvantage is that the high dimensional feature sets often cause an increase in computational complexity. The low accuracy of pretrained AlexNet in [84] was caused by the reduction of bandwidth and µ-law companding for the purpose of the development of a real-time SER system (7% reduction in accuracy). Table 5 shows that end-to-end CRNN architecture [62], outperformed other works under comparison.

**Table 5.** Comparison of SER systems based on classification using a complete EmoDB dataset.


In the case of IEMOCAP, the expansion of highly underrepresented class Happiness, by merging it together with Excitement, naturally yields better results, especially in UA measure. This effect can be seen in the first part of Table 6. (Emotions: A, E + H, N, S). The common procedure for dataset partition is to employ a leave-one session-out cross-validation (fivefold). A common approach is to use data from one speaker for validation and data from the remaining speakers for testing. IEMOCAP contains both scripted and improvised scenarios. Scripted recordings are often not incorporated into SER systems, due to possible correlation with lingual content (systems working with improvised data are marked with an asterisk in Table 6). Note that the SER system trained on the improvised dataset outperformed the system applied on the scripted dataset [86,87]. The degree of naturalness of emotional speech has a significant impact on recognition accuracy. Learning on improvised data only can result in better performance than the combination of improvised and scripted data. This means that better accuracies can often be achieved with smaller data set.

**Table 6.** Comparison of SER systems for IEMOCAP dataset. Meaning of acronyms: A—anger, E—excitement, H—happiness, N—neutral, S—sadness.


\* Improvised data only.

For the IEMOCAP database, with the fivefold cross-validation technique and four emotions for classification (anger, sadness, happiness, and neutral), DNN–ELM [13], based on deep feature extraction and ELM classifier, yielded an accuracy of about 52.13% in WA and 57.91% in UA. These results were considered as a baseline for further evaluation. These results were surpassed by the RNN architecture with the proper extraction of higher-level statistical functionals from multiple LLDs over speech segments. The results of 64.16% WA and 60.02% UA were obtained even on a full dataset (improvised and scripted).

Deep features extracted by CNN often surpass the traditional feature-based approaches [57,89]. A combination of CNN and BiLSTM (CRNN) is effective in the derivation of both local and global characteristics. CRNN often achieves better results in comparison with the stand-alone CNN models [62,64]. Ma et al. [65] emphasised the importance of using the whole sentences for classification because the segmentation of utterances caused the degradation of accuracy. The proposed CRNN architecture with variable-length spectrograms as input features increased the baseline results by 19% and 6% in WA and UA, respectively. Compared to hybrid models, the CRNN end-to-end approach is more effective for implementation.

There is also discussion about the performance of 1D and 2D convolutions. In our study, 2D DCNN outperformed 1D DCNN with a similar number of parameters [57]. Moreover, 1D DCNN was four times slower on execution. In the case of CRNN, 2D-CNN–LSTM outperformed its 1D counterpart in [62]. Yenigalla et al. [4] used phoneme

embeddings in addition to spectrograms as input to a model consisting of two separate CNN channels. This two-channel solution further improved results obtained by CRNN proposed by Ma et al. [65] (from 71.45% WA\* to 73.9% WA\* and from 64.22% UA\* to 68.5% UA\*). The approach based on transfer learning utilising a pretrained model from a speaker verification task yielded similarly high-performance [16]. The authors further proved the benefits of applying domain-agnostic parameters for SER and the inadequacy of using a small dataset for training with the ResNet architecture. According to Table 6, the deep stride CNN architecture [59] achieved the highest accuracy for both WA and UA. The proposed stride CNN increases the accuracy by using salient features extraction from raw spectrograms and reducing computational complexity. However, the experiments were conducted with an 80/20% split of the dataset, which differs from the LOSO model with an additional validation data partition.

#### **4. Speech Emotion Recognition with Attention Mechanism**

Before discussing the attention mechanism, we provide the theoretical background of the LSTM recurrent networks, which were first used as the base architecture for AM.

#### *4.1. LSTM–RNN*

Let the input sequence **X** = (**x**1, **x**2, . . . , **x**T), **X** ∈ R T×d , be transformed by RNN into hidden state vectors representation **H** = (**h**1, **h**2, . . . , **h**T), **H** ∈ R T×n . Here, d and n denote the dimension of input vectors and the number of hidden units, respectively. A basic principle of RNN lies in the fact that the previous information from sequence **h**t−<sup>1</sup> contributes to shaping the current outcome **h**<sup>t</sup> . Output vector **y**<sup>t</sup> of the simple RNN is obtained as follows:

$$\mathbf{h}\_{\mathbf{l}} = f \ (\mathbf{W} \mathbf{x}\_{\mathbf{l}} + \mathbf{U} \mathbf{h}\_{\mathbf{l}-1}) \,\tag{7}$$

$$\mathbf{y}\_t = \mathcal{g}(\mathbf{V}\mathbf{h}\_t) \tag{8}$$

where **W** ∈ R n×d , **U** ∈ R n×n **, V** ∈ R <sup>n</sup>×<sup>n</sup> are learnable weights, and *f*, *g* are activation functions.

Note that long-term dependencies in a sequence cannot be captured by a simple RNN unit due to the gradient vanishing problem [90]. Various recurrent units (such as Long short-term memory (LSTM), gated recurrent unit (GRU)) with different internal infrastructure were developed to enable capture dependencies over a longer period.

LSTM [91] uses internal gates to overcome the above-mentioned constraints of the simple recurrent units. The input sequence flows through three types of gates—forget gate **f**<sup>t</sup> (9), input gate **i**<sup>t</sup> (10), and output gate **o**<sup>t</sup> (13). Another component of LSTM is a memory cell **c**<sup>t</sup> (12), whose state is updated at each time step. The process of cell state update depends on the previous hidden state vector **h**t−1, current input vector **x**<sup>t</sup> , and the previous cell state **c**t−<sup>1</sup> (previous cell state can be also included into gates, and this is called peephole connection). The inner structure of LSTM is shown in Figure 2. Here, **X** = (**x**1, **x**2, . . . , **x**T) denotes input sequence, where T is the length of the sequence. The individual operations in LSTM are formalised as follows:

$$\mathbf{f}\_{\mathbf{t}} = \sigma(\mathbf{W}\_{\mathbf{f}}\mathbf{x}\_{\mathbf{t}} + \mathbf{U}\_{\mathbf{f}}\mathbf{h}\_{\mathbf{t}-1} + \mathbf{V}\_{\mathbf{f}}\mathbf{c}\_{\mathbf{t}-1} + \mathbf{b}\_{\mathbf{f}}),\tag{9}$$

$$\mathbf{i}\_{\mathbf{t}} = \sigma(\mathbf{W}\_{\mathbf{i}}\mathbf{x}\_{\mathbf{t}} + \mathbf{U}\_{\mathbf{i}}\mathbf{h}\_{\mathbf{t}-1} + \mathbf{V}\_{\mathbf{i}}\mathbf{c}\_{\mathbf{t}-1} + \mathbf{b}\_{\mathbf{i}}),\tag{10}$$

$$\mathbf{z}\_{\text{l}} = \tan \mathbf{h} (\mathbf{W}\_{\text{z}} \mathbf{x}\_{\text{l}} + \mathbf{U}\_{\text{z}} \mathbf{h}\_{\text{l}-1} + \mathbf{b}\_{\text{z}}),\tag{11}$$

$$\mathbf{c}\_{\mathbf{t}} = \mathbf{f}\_{\mathbf{t}} \circ \mathbf{c}\_{\mathbf{t}-1} + \mathbf{i}\_{\mathbf{t}} \circ \mathbf{z}\_{\mathbf{t}} \tag{12}$$

$$\mathbf{o}\_{\mathbf{t}} = \sigma(\mathbf{W}\_{\mathbf{o}}\mathbf{x}\_{\mathbf{t}} + \mathbf{U}\_{\mathbf{o}}\mathbf{h}\_{\mathbf{t}-1} + \mathbf{V}\_{\mathbf{o}}\mathbf{c}\_{\mathbf{t}} + \mathbf{b}\_{\mathbf{o}}),\tag{13}$$

$$\mathbf{h} = \mathbf{o}\_{\mathbf{t}} \circ \tan \mathbf{h}(\mathbf{c}\_{\mathbf{t}}).\tag{14}$$

Here, **W**<sup>l</sup> ∈ R n×d , **U**<sup>l</sup> ∈ R n×n , **V**<sup>l</sup> ∈ R n×n , and **b**<sup>l</sup> ∈ R n , l ∈ {f, i, z, o} are weight matrixes and bias terms. Tanh and σ are the hyperbolic tangent function and sigmoid function. Sign ◦ denotes the Hadamard product.

**y** t

structure were developed to enable capture dependencies over a longer period.

= *g*(**Vh**<sup>t</sup>

LSTM [91] uses internal gates to overcome the above-mentioned constraints of the simple recurrent units. The input sequence flows through three types of gates—forget gate **f**t (9), input gate **i**<sup>t</sup> (10), and output gate **o**t (13). Another component of LSTM is a memory cell **c**t (12), whose state is updated at each time step. The process of cell state update depends on the previous hidden state vector **h**t−1, current input vector **x**t, and the previous cell state **c**t-1 (previous cell state can be also included into gates, and this is called peephole connection). The inner structure of LSTM is shown in Figure 2. Here, **X** = (**x**1, **x**2, ..., **x**T) denotes input sequence, where T is the length of the sequence. The individual operations

**f**<sup>t</sup> = σ(**W**f**x**<sup>t</sup> + **U**f**h**t−<sup>1</sup> + **V**f**c**t−<sup>1</sup> + **b**<sup>f</sup>

**i**<sup>t</sup> = σ(**W**i**x**<sup>t</sup> + **U**i**h**t−<sup>1</sup> + **V**i**c**t−<sup>1</sup> + **b**<sup>i</sup>

**z**<sup>t</sup> = tanh(**W**z**x**t + **U**z**h**t−<sup>1</sup> + **b**<sup>z</sup>

**c**<sup>t</sup> = **f**<sup>t</sup> ∘ **c**t−<sup>1</sup> + **i**<sup>t</sup> ∘ **z**<sup>t</sup>

**o**<sup>t</sup> = σ(**W**o**x**<sup>t</sup> + **U**o**h**t−<sup>1</sup> + **V**o**c**<sup>t</sup> + **b**<sup>o</sup>

**h**<sup>t</sup> = **o**<sup>t</sup> ∘ tanh(**c<sup>t</sup>**

Here, **W**l ∈ Rn×d, **U**l ∈ Rn×n, **V**l ∈ Rn×n, and **b**l ∈ R<sup>n</sup>, l ∈ {f, i, z, o} are weight matrixes and bias terms. Tanh and σ are the hyperbolic tangent function and sigmoid function. Sign ∘ de-

where **W** ∈ Rn×d, **U** ∈ Rn×n**, V** ∈ Rn×n are learnable weights, and *f*, *g* are activation functions. Note that long-term dependencies in a sequence cannot be captured by a simple RNN unit due to the gradient vanishing problem [90]. Various recurrent units (such as Long short-term memory (LSTM), gated recurrent unit (GRU)) with different internal infra-

), (8)

), (9)

), (10)

), (11)

), (13)

, (12)

). (14)

**Figure 2.** Detail of inner structure of LSTM. The peephole connections are depicted with red lines. **Figure 2.** Detail of inner structure of LSTM. The peephole connections are depicted with red lines.

In contrast to LSTM, which incorporates past information into DNN, the ability to look into the future is added in bidirectional LSTM architecture (BiLSTM). As the name implies, BiLSTM is composed of forward and backward LSTM layers. The calculation process of layers depends on the way from which direction the input sequence is read. In contrast to LSTM, which incorporates past information into DNN, the ability to look into the future is added in bidirectional LSTM architecture (BiLSTM). As the name implies, BiLSTM is composed of forward and backward LSTM layers. The calculation process of layers depends on the way from which direction the input sequence is read.

#### *4.2. Attention Mechanism 4.2. Attention Mechanism*

in LSTM are formalised as follows:

notes the Hadamard product.

Incorporation of the attention mechanism (AM) into DNN-based SER systems was often motivated by research in the NLP field [18,91,92] and computer vision [92]. We give a brief explanation of the attention mechanism from the NLP's point of view due to the similarity of the tasks. "Language" attention can be traced back to work related to neural Incorporation of the attention mechanism (AM) into DNN-based SER systems was often motivated by research in the NLP field [18,91,92] and computer vision [92]. We give a brief explanation of the attention mechanism from the NLP's point of view due to the similarity of the tasks. "Language" attention can be traced back to work related to neural machine translation [21]. Here, the typical encoder–decoder approach was supplemented by the network's ability to soft-search for salient information from a sentence to be translated. The authors used BiRNN/RNN as encoder/decoder, both with the GRU inner structure [93]. The machine translation decoding process can be described as the prediction of the new target word **y**<sup>t</sup> , which is dependent on context vector **c** obtained from a current sentence and previously predicted words [93].

P **y**t | **y** <sup>&</sup>lt; <sup>t</sup> , **c** = g **h**t , **<sup>y</sup>**t−<sup>1</sup> , **c** (15)

Fixed encoding of sentences, which was considered to be a drawback in performance, was substituted by a novel attention mechanism. The main idea behind the attention is to obtain a context vector created as a weighted sum of encoded annotations (18), while attention weights **a** are learned by the so-called alignment model (16)—i.e., jointly trained feedforward neural network.

$$\mathbf{e}\_{\mathbf{k}\mathbf{j}} = \mathbf{v}\_{\mathbf{a}}^{\mathrm{T}} \tan \mathbf{h} \left( \mathbf{W}\_{\mathbf{a}} \mathbf{h}\_{\mathbf{k}-1} + \mathbf{U}\_{\mathbf{a}} \mathbf{h}\_{\mathbf{j}} \right) \tag{16}$$

$$\mathbf{a}\_{\mathbf{k}\mathbf{j}} = \frac{\exp(\mathbf{e}\_{\mathbf{k}\mathbf{j}})}{\sum\_{\tau=1}^{T} \exp(\mathbf{e}\_{\mathbf{k}\tau})} \tag{17}$$

$$\mathbf{c}\_{\mathbf{k}} = \sum\_{\mathbf{j=1}}^{\mathbf{T}} \mathbf{a}\_{\mathbf{k}\mathbf{j}} \mathbf{h}\_{\mathbf{j}} \tag{18}$$

where **v**<sup>a</sup> ∈ R n , **W**<sup>a</sup> ∈ R <sup>n</sup>×n, and **<sup>U</sup>**<sup>a</sup> <sup>∈</sup> <sup>R</sup> <sup>n</sup>×2n are weight matrices. Assuming two RNNs as the encoder and decoder, the attention weights are obtained by considering hidden states of the encoder **h**<sup>j</sup> and hidden states of the decoder **h**k−<sup>1</sup> of the last predicted word. A context vector is computed at each time step and the proposed network architecture is trained jointly. Figure 3 shows a general scheme of the described process incorporating AM.

**Figure 3.** Encoder–decoder framework with an attention mechanism. **Figure 3.** Encoder–decoder framework with an attention mechanism.

#### AM Modifications AM Modifications

As during the last years, numerous AM concepts and variations have been proposed and implemented, several different taxonomies of AM already exist. Different strategies of classification of AM can a reader find e.g., in [94,95]. Here, we point out some of the key works addressing different implementations of AM. As during the last years, numerous AM concepts and variations have been proposed and implemented, several different taxonomies of AM already exist. Different strategies of classification of AM can a reader find e.g., in [94,95]. Here, we point out some of the key works addressing different implementations of AM.

machine translation [21]. Here, the typical encoder–decoder approach was supplemented by the network's ability to soft-search for salient information from a sentence to be translated. The authors used BiRNN/RNN as encoder/decoder, both with the GRU inner structure [93]. The machine translation decoding process can be described as the prediction of the new target word **y**t, which is dependent on context vector **c** obtained from a current

, **c**) = g(**h**<sup>t</sup>

Fixed encoding of sentences, which was considered to be a drawback in performance, was substituted by a novel attention mechanism. The main idea behind the attention is to obtain a context vector created as a weighted sum of encoded annotations (18), while attention weights **a** are learned by the so-called alignment model (16)—i.e., jointly trained

> exp(ekj) ∑ exp(ekτ

T τ=1

**c** = ∑**a**kj**h**<sup>j</sup> T

j=1

where **v**a ∈ R<sup>n</sup>, **W**<sup>a</sup> ∈ Rn×n, and **U**<sup>a</sup> ∈ Rn×2n are weight matrices. Assuming two RNNs as the encoder and decoder, the attention weights are obtained by considering hidden states of the encoder **h**<sup>j</sup> and hidden states of the decoder **h**k−1 of the last predicted word. A context vector is computed at each time step and the proposed network architecture is trained jointly. Figure 3 shows a general scheme of the described process incorporating AM.

)

, **y** t−1

, **c**) (15)

(17)

(18)

tanh(**W**a**h**k−<sup>1</sup> + **U**a**h**j) (16)

sentence and previously predicted words [93].

feedforward neural network.

P(**y** t | **y** < t

ekj = **v**<sup>a</sup> T

**a**kj =

Luong et al. [22] proposed implementing AM globally and locally. Global attention uses whole information from a source sentence. In this case, the context vector was computed as the weighted average of all source hidden states, while attention weights were obtained from the current target hidden state **h**<sup>k</sup> and each source hidden state **h**j. This approach works on a principle similar to Bahdanau et al. [21], but it differs in simplified computation. Moreover, various alignment functions were examined (see Table 7). As the Luong et al. [22] proposed implementing AM globally and locally. Global attention uses whole information from a source sentence. In this case, the context vector was computed as the weighted average of all source hidden states, while attention weights were obtained from the current target hidden state **h**<sup>k</sup> and each source hidden state **h**<sup>j</sup> . This approach works on a principle similar to Bahdanau et al. [21], but it differs in simplified computation. Moreover, various alignment functions were examined (see Table 7). As the name implies, *local attention* focuses only on the subset from the source sentence. It is a computationally more efficient method. Context vector takes into account a preselected range of source hidden states with an aligned position corresponding to each target word. Thus, this type of context vector has a fixed length. The aligned position is either at the current target word at time *t* or can be learned to be predicted. According to results, dot alignment worked well for the global attention and general was better for the local attention. The best performance achieved local attention model with predictive alignments. The machine translation model with the attention mechanism outperformed conventional non-attentional models.

**Table 7.** Computation of different alignment scores.


Lin et al. [96] applied AM on sentiment analysis tasks. This approach allowed the system to perform a standalone search for significant parts of a sentence and thus reducing redundant information. Firstly, BiLSTM encoded words from source sentences into individual hidden states **H** and then the attention weights are computed as an alignment model from **H**. Sentence embedding vector was created as a weighted sum of hidden states. It was not enough to focus only on a certain component of the sentence. Therefore, a concept of multiple hops of attention was proposed, where more such embeddings for different parts of the sentence were created. The sentence embeddings in a form of 2D matrices were then used for sentiment recognition. Moreover, the authors proposed a penalisation technique to ensure that the summation weights cannot be similar.

AM is also a powerful tool for fine-grained aspect-level sentiment classification. Based on the aspect information, the sentiment of the sentence can take on different meanings. Wang et al. [97] firstly proposed an embedding representation of each aspect. Then attention-based LSTM learns the sentiment of a given sentence and is able to focus on important parts by considering a given aspect. Aspect embeddings were incorporated as concatenation to hidden states vectors and attention weights were obtained subsequently. Embeddings could be additionally appended to word vectors as well. In this way, the information from the aspect is preserved in a hidden vector. This novel approach for aspectlevel sentiment classification outperformed baseline systems. In [98], the aspect expression from sentences was formed as a weighted summation of aspect embeddings. The number of aspects was preselected and the weights were computed so that context information, as well as aspect expression, were included. An unsupervised objective was applied to improve the training procedure. Another way how to improve the attention model was the inclusion of words, which are in vicinity to the target aspect expression. This method takes advantage of the fact that that context words closer to the target offer complementary clues in sentiment classification. The application of both methods improved results in comparison with various LSTM attention systems.

Chorowski et al. [99] divided encoder–decoder-based attention mechanism into three different categories according to parameters used during the alignment process. Here, the computation of attention weights vector **a**<sup>k</sup> can be based on location in form of previous attention vector **a**k−<sup>1</sup> , current content **H,** or a combination of both in hybrid AM. Table 8 shows different implementations of AM. Even though hybrid AM seeming to be the best solution for encoder–decoder based speech recognition [99], the decoder part is omitted in SER, and therefore, the AM for SER task is simplified.

**Table 8.** The implementations of the attention mechanisms.


*4.3. Attention Mechanism in Speech Emotion Recognition*

This section provides a description of various implementations of AM for speech emotion recognition. As for emotional speech, one label is often used to characterise the whole utterance, although it is clear that the sentence may contain unemotional and silent intervals as well. Therefore, the searching techniques for important parts of emotional speech have been developed.

The first attempts to make the model focus on emotionally salient clues were proposed before the invention of the attention weights. Han et al. considered the speech segments with the highest energy to contain the most prominent emotional information [100]. Lee and Tashev [13] proposed the BiLSTM–ELM system for SER and the importance of each frame is decided using the expectation maximisation algorithm. Moreover, to represent the uncertainty of emotional labels, a speech sample is able to acquire one of two possible states—given emotion and "zero" emotion. The benefit of this system was leveraging RNN's ability to model long contextual information from emotional speech and addressing the uncertainty of emotional labels. The BiLSTM–ELM outperformed the DNN–ELM system, implemented according to [100], with 12% and 5% absolute improvement in UA and WA, respectively.

Most of the attention mechanisms in the SER field are based on the previously described method of attention weights computation using Equations (16) and (17). However, various modifications of AM were proposed, e.g., different input features can be used (feature maps) and simplified computations were developed (the decoder part is omitted for SER systems).

#### 4.3.1. Attentive Deep Recurrent Neural Networks

Huang and Narayanan [101] implemented two types of attention weights: contentbased AM (19) inspired by [21,99] and its simplified version (20).

$$\mathbf{a}\_{\mathbf{j}} = \text{softmax}\left(\mathbf{v}\_{\mathbf{a}}^{\mathsf{T}} \sigma\_{\mathbf{a}}(\mathbf{W}\_{\mathbf{a}} \mathbf{h}\_{\mathbf{j}})\right) \tag{19}$$

$$\mathbf{a}\_{\mathbf{j}} = \text{softmax}\left(\mathbf{v}\_{\mathbf{a}}^{\mathsf{T}} \mathbf{h}\_{\mathbf{j}}\right) \tag{20}$$

In order to avoid overfitting, the authors proposed separate training of BiLSTM and AM components as well as application of dropout before the summation of hidden vectors. According to the results, the simplified implementation of the attention weights defined by (20) yielded better results. The AM-based system outperforms the non-AM system—an improvement from 57.87% to 59.33% in WA and from 48.54% to 49.96% in UA was observed. Moreover, the authors experimentally proved that the attention selection distribution was not just correlated to the frame energy curve.

In [18], Mirsamadi et al. pointed out the fact that only a few words in the labelled utterance were emotional. They highlighted the importance of considering silence intervals and emotionless parts of the utterance as well. Here, the attention weights were computed using the softmax function on the inner product between trainable attention vector **u** and RNN output **y<sup>t</sup>** at each time step, similarly as (20). In the subsequent step, the weighted average in time was performed, and the softmax layer was applied for final emotion classification. This deep RNN architecture with AM is able to focus on emotionally significant cues and on their temporal variations at the utterance level. The proposed combination of BiLSTM and the novel mean-pooling approach with local attention revealed improved performance over many-to-one training and slightly increased results over the mean-pooling method. With only 32 LLDs, the absolute improvement of 5.7% and 3.1% (in WA and UA) was achieved over the traditional SVM model, which needed additional statistical functions for satisfactory results. Tao and Liu [102] discussed the limitation of the time-dependent RNN model and the proposed advanced LSTM (A–LSTM) for better temporal context modelling. Unlike LSTM, which uses the previous state to compute a new one, A–LSTM makes use of multiple states by combining information from preselected time steps. The weights were learned and applied to the inner states of LSTM. The authors proposed the DNN–BiLSTM model with the learning of multiple tasks—emotion, speaker, and gender classification. Moreover, BiLSTM was followed by an attention-based weighted pooling layer. A relative improvement of 5.5% was achieved with A–LSTM, compared to conventional LSTM. Thus, the time dependency modelling capability of LSTM was improved. The proposed solution did not outperform Mirsamadis attentive RNN [18].

AM was also introduced into the forgetting gate f<sup>t</sup> of LSTM cell in [103]. Here, the updating of the cell state (21) is viewed as a weighted sum of the previous cell state **ct**−**<sup>1</sup>** and the current value for update **z<sup>t</sup>** .

$$\mathbf{c} = \mathbf{f}\_{\mathbf{t}} \circ \mathbf{c}\_{\mathbf{t}-1} + (1 - \mathbf{f}\_{\mathbf{t}}) \circ \mathbf{z}\_{\mathbf{t}} \tag{21}$$

$$\mathbf{f}\_{\mathbf{t}} = \sigma(\mathbf{W}\_{\mathbf{f}} \tan \mathbf{h}(\mathbf{V}\_{\mathbf{f}} \mathbf{c}\_{\mathbf{t}-1})) \tag{22}$$

The weights for the cell state updating were obtained by training of the self-attention model (20), with **W**<sup>f</sup> ∈ R <sup>n</sup>×<sup>n</sup> and **<sup>V</sup>**<sup>f</sup> <sup>∈</sup> <sup>R</sup> <sup>n</sup>×<sup>n</sup> as trainable parameters. Calculation complexity of the proposed attention gate was reduced by taking into account only the cell state at the previous moment **c**t−1. The ComParE frame-level features were used for classification, while the proposed network had the ability to learn high-level dependencies. The second AM was utilised in the output gate. It was in form of weights applied in both time and feature dimensions. Compared to the traditional LSTM, the obtained results showed an absolute improvement of 2.8%, 13.8%, and 8.5% in UAR for CASIA, eNTERFACE, and GEMEP, respectively. Xie et al. [104] proposed a dense LSTM with attention-based skip connections between the layers. In order to address the variable distribution of significant

emotional information in speech, attention weights were incorporated into the LSTMs output in the time dimension. This approach was inspired by the global attention described in [22]. Assuming that different speech features have different abilities to distinguish emotion categories, weighting on feature dimension was also implemented. Results showed that attention applied to the output of each layer improved unweighted average recall and accelerated convergence speed in comparison with the general LSTM approach.

#### 4.3.2. Attentive Deep Convolutional Neural Network

Neumann and Vu [86] performed a comparison of different speech features with an attentive CNN architecture. It contains an attention layer based on a linear scoring function. Additionally, the authors applied MTL for both categorical and continuous labels (activation and valence). The results indicated a small difference in performance between MFB, MFCC, and eGeMAP features and a slight improvement of accuracy with the MTL approach. The best results were reported with a combination of MFB features, attentive CNN with MTL learning. Li et al. [92] used two types of convolution filters for extraction of time-specific and frequency-specific features from the spectrograms. Feature extraction was followed by CNN architecture for modelling high-level representation. Inspired by attention-based low-rank second-order pooling proposed for the task of action classification from single RGB images [105], the authors applied this novel pooling method after the last convolutional layer. It was based on a combination of two attention maps—the class-specific top-down and class-agnostic bottom-up attention. The authors reported on the strong emotional representation ability of the proposed architecture. In order to preserve the information from variable length utterance as a whole without the need for segmentation, Zhang et al. [69] designed fully convolutional network (FCN) architecture adapted AlexNet with removed fully connected layers. The proposed pretrained FCN architecture takes spectrograms of variable length as input without the need for division of utterances or padding to the required length [64,65]. Furthermore, the attention mechanism identifies important parts of spectrograms and ignores nonspeech parts. FCN architecture outperformed the nonattentive CNN–LSTM method proposed in [64] and achieved comparable results with attention-based convolutional RNN [106]. Thus, the proposed FCN architecture is able to capture the temporary context without the need for additional recurrent layers.

#### 4.3.3. Attentive Convolutional–Recurrent Deep Neural Network

In many cases, the extraction of large feature sets is replaced by direct learning of emotional speech characteristics by deep CNN architectures. Satt et al. [64] segmented utterances into 3 s intervals firstly. Then, the spectrograms were extracted and were directly fed to the CNN–LSTM architecture. Harmonic modelling was applied on spectrogram to eliminate nonspeech parts of the emotional utterance. This step was particularly useful for the classification of emotion in noisy conditions. Lastly, the attention mechanism was added to the LSTM layer, which did not improve the achieved results. Zhao et al. [107] used two streams for feature extraction—fully convolutional network (FCN) with temporal convolutions and Attention–BiLSTM layers—and concatenated the outputs for further DNN based classification. The results indicated improvements over attention–BiLSTM and Att–CNN [86] architectures. Sarma et al. [20] proposed a raw speech waveformbased end-to-end time delay neural network (TDNN) with LSTM–attention architecture. Accuracy improvement on the IEMOCAP database, as well as reduction of confusion among individual categories, was observed with the use of AM. Huang and Narayanan [55] proposed CLDNN architecture with the convolutional AM. System leveraged task-specific spectral decorrelation of CNN applied on log-mel features and temporal modelling by BiLSTM layers. Main modules were frozen during the training of attention weights. Improved results were achieved with the use of AM under the clean test-set conditions. Chen et al. [106] discussed the negative impact that the personalised features (containing speaker's characteristics, content, etc.) have on the ability of the SER system to generalise

well. Assuming that the time derivates of the coefficients (delta features) reduce these undesirable effects, a 3D log-mel spectrogram (consisted of log-mels including delta and delta–delta features) was proposed for the compensation of the personalised features. The authors proposed an attention-based convolutional RNN system (ACRNN) for emotion recognition. When compared with DNN–ELM-based system [100], 3D-ACRNN achieved significant improvement in recognition accuracy on IEMOCAP and EmoDB databases. 3D-ACRNN also outperformed 2D-ACRN based on standalone log-mels. Li et al. [108] proposed an end-to-end self-attentional CNN–BiLSTM model. The attention mechanism based on the same procedure as in [96] concentrates on salient parts of speech. Additionally, the gender recognition task was added to improve emotion recognition in a multitask learning manner. As the gender of the speaker affects the emotional speech, these variations can be taken advantage of. The state-of-the-art results were reported with increased overall accuracy on the IEMOCAP database. Dangol et al. [109] proposed an emotion recognition system based on 3D CNN–LSTM with a relation-aware AM that integrates pairwise relationships between input elements. The 3D spectrogram representations provided both spectral and temporal information from the speech samples. In order to increase the accuracy of emotion recognition, the computation process of attention weights was modified and the synthetic individual evaluation oversampling technique was used to update the feature maps.

In [110], the authors used prosodic characteristics with a fusion of three classifiers working at the syllable, utterance, and frame levels. They used a combination of methods such as the mechanism of attention and the feature selection based on RFE. System performance was improved by identification of relevant features, incorporating attention and score-level fusion. Zheng et al. [111] performed ensemble learning by the integration of three models/experts, each focusing on different feature extraction and classification tactics. Expert 1 is a two-channel CNN model that effectively learns time- and frequency-domain features. Expert 2 is GRU with AM that learns short-term speech characteristics from the principal component analysis (PCA) processed spectrograms with a further fusion of mean value features of the spectrograms. Expert 3 performs end-to-end multilevel emotion recognition using BiLSMT with attention mechanism with a combination of local (CRNN model learning from speech spectrum) and global features (HSFs). Each expert accessed emotional speech in a different way and their combination reduced the negative effects of data imbalance and results in better generalization ability.

For better clarity, the AM-based SER systems are also summarised in Table 9.


**Table 9.** Comparison of SER systems with an attention mechanism. Meaning of acronyms: A—anger, E—excitement, Fr—frustration, H—happiness, N—neutral, S—sadness; A/V—activation/valence.


#### **Table 9.** *Cont*.

#### **5. Impact of Attention Mechanism on SER**

We performed a comparison of related works based on the most common settings to study the impact of AM on speech emotion recognition. We applied the same methodology as in Section 3.7. Since IEMOCAP is the most commonly used database in the published works, we chose it for further analysis.

Tables 10 and 11 show the comparison of SER systems on IEMOCAP for two kinds of classes of emotions: (1) anger, happiness, neutral and sad and (2) an extension of the 'excitement' class. As previously explained, it is not possible to make an exact comparison of the systems due to different test conditions, even if the same dataset was used. Thus, the reported accuracies listed in Tables 10 and 11 provide only coarse information in terms of their performance comparison.


**Table 10.** Comparison of system accuracies on IEMOCAP database for four emotions. Meaning of acronyms: AM—attention mechanism, A—anger, H—happiness, N—neutral, S—sadness.

**Table 11.** Comparison of system accuracies on IEMOCAP database for additional combination of excitement and happiness. Meaning of acronyms: AM—attention mechanism, A—anger, E—excitement, H—happiness, N—neutral, S—sadness.



**Table 11.** *Cont*.

The following conclusions, in particular, can be drawn from the works under study:

	- The implementation of appropriate AM can be linked to various factors such as the derivation of accurate context information from speech utterances. As in NLP, the better the contextual information obtained from the sequence, the better the performance of the system. The duration of divided segments significantly influences the accuracy of emotion recognition [20,63,86]. Therefore, appropriate input sequence lengths must be determined in order to effectively capture the emotional context.
	- Proper representation of emotional speech is also an important part of deriving contextual information. RNN is suitable for modelling long sequences. Extraction of higher-level statistical functions from multiple LLDs over speech segments with a combination of LSTM [18] can be compared to 32 LLDs with BiLSTM and local AM [18]. Transfer learning is a suitable solution particularly for small emotional datasets [16]. However, more works should be considered to make conclusions. End-to-end systems that combined CNN as feature extractor and RNN for modelling of the long-term contextual dependencies achieved high performance on IEMOCAP data and on EmoDB [62,106]. Various combinations of RNN and CNN are able to outperform separate systems [62,107]. The twochannel CNN taking phoneme embeddings and spectrograms on input seem to further improve the accuracy [4]. Thus, it can be beneficial to allow the model

to learn different kinds of features. Moreover, leveraging multitask Learning for both the discrete and continuous recognition tasks improves the accuracy of SER systems [10,112]. CRNN architecture together with multitask learning was a part of the state-of-the-art solution on IEMOCAP proposed in [108]. Here, AM clearly improved system performance.

• Recurrent networks provide temporal representation for the whole utterance and better results are obtained with its aggregation by pooling for further recognition [18,20]. Several works compare different pooling strategies. The attention pooling is able to outperform global max pooling and global average pooling (GAP) [18,102,107]. The same was true for the attention pooling strategy for convolutional feature maps in [92] (attention-based pooling outperformed GAP). It can be concluded that learning of the attention weights indeed allows the model to adapt itself to changes in emotional speech.

#### **6. Conclusions**

This study provides a survey on speech emotion recognition systems from very recent years. The aim of the SER research can be summarised as the search for innovative ways how to appropriately extract emotional context from speech. We can observe a trend in the use of deep convolutional architectures that can learn from spectrogram representations of utterances. Together with recurrent networks, they are considered as a strong base for SER systems. Throughout the years, more complex SER architectures were developed with an emphasis on deriving emotionally salient local and global contexts. As can be inferred from our study, the attention mechanism can improve the performance of the SER systems; however, its benefit is not always evident. Although AM modules have become a natural part of today's SER systems, AM is not an indispensable element for the achievement of high accuracies or even state-of-the-art results.

**Author Contributions:** Conceptualisation, E.L., R.J. and M.J.; methodology, E.L. and M.J.; writing original draft preparation, E.L. and M.J.; writing—review and editing, R.J. and M.C.; supervision, R.J. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations and Acronyms**



#### **References**


## *Review* **Survey of Automatic Spelling Correction**

## **Daniel Hládek \* , Ján Staš and Matúš Pleva**

Department of Electronics and Multimedia Communications, Faculty of Electrical Engineering and Informatics, Technical University of Košice, Nˇemcovej 32, 040 01 Košice, Slovakia; jan.stas@tuke.sk (J.S.); matus.pleva@tuke.sk (M.P.)

**\*** Correspondence: daniel.hladek@tuke.sk; Tel.: +421-055-602-2298

Received: 13 August 2020; Accepted: 6 October 2020; Published: 13 October 2020

**Abstract:** Automatic spelling correction has been receiving sustained research attention. Although each article contains a brief introduction to the topic, there is a lack of work that would summarize the theoretical framework and provide an overview of the approaches developed so far. Our survey selected papers about spelling correction indexed in Scopus and Web of Science from 1991 to 2019. The first group uses a set of rules designed in advance. The second group uses an additional model of context. The third group of automatic spelling correction systems in the survey can adapt its model to the given problem. The summary tables show the application area, language, string metrics, and context model for each system. The survey describes selected approaches in a common theoretical framework based on Shannon's noisy channel. A separate section describes evaluation methods and benchmarks.

**Keywords:** spelling correction; natural language processing; diacritization; error model; context model

## **1. Introduction**

There are many possible ways to write the same thing. Written text sometimes looks different from what the reader or the author expects. Creating apprehensive and clear text is not a matter of course, especially for people with a different mother language. An unusually written word in a sentence makes a spelling error.

A spelling error makes the text harder to read and, worse, harder to process. Natural language processing requires normalized forms of a word because incorrect spelling or digitization of text decreases informational value. A spelling error, for example, in a database of medical records, diminishes efficiency of the diagnosis process, and incorrectly digitized archive documents can influence research or organizational processes.

A writer might not have enough time or ability to correct spelling errors. Automatic spelling correction (ASC) systems help to find the intended form of a word. They identify problematic words and propose a set of replacement candidates. The candidates are usually sorted according to their expected fitness with the spelling error and the surrounding context. The best correction can be selected interactively or automatically.

Interactive spelling correction systems underline incorrectly written words and suggest corrections. A user of the system selects the most suitable correction. This scenario is common in computer-assisted proofreading that helps with the identification and correction of spelling errors. Interactive spelling correction systems improve the productivity of professionals working with texts, increase convenience when using mobile devices, or correct Internet search queries. They support learning a language, text input in mobile devices, and web search engines. Also, interactive spelling correction systems are a component of text editors and office systems, optical character recognition (OCR) systems, and databases of scanned texts.

Most current search engines can detect misspelled search queries. The suggestion is shown interactively for each given string prefix. A recent work by Cai and de Rijke [1] reviewed approaches for correcting search queries.

A large quantity of text in databases brought new challenges. An automatic spelling correction system can be a part of a natural language processing system. Text in the database has to be automatically corrected because interactive correction would be too expensive. The spelling correction system automatically selects a correction candidate according to the previous and following texts. Noninteractive text normalization can improve the performance of information retrieval or semantic analysis of a text.

Figure 1 displays the process of correction-candidate generation and correction. The error and context models contribute to ranking of the candidate words. The result of automatic correction is a sequence of correction candidates with the best ranking.

**Figure 1.** Interactive processes of error production and correction.

In the next section, you'll find an explanation of the method we used to select and sort the articles in this report. Subsequently, in Section 3, we describe the characteristic spelling errors and divide them into groups according to how they originated. Section 4 defines the task of correcting spelling errors and describes the ASC system. This survey divides the ASC systems into three groups, each with its section: a priori spelling correction (Section 5), spelling correction in the context (Section 6), and spelling correction with a learning error model (Section 7). Section 8 introduces the methods of evaluation and benchmarking. The concluding Section 9 summarizes the survey and outlines trends in the research.

#### **2. Methodology**

The survey selected papers about spelling correction indexed in Scopus (http://scopus.com) and Web of Science (https://apps.webofknowledge.com) (WoS) from 1991 to 2019. It reviews the state-of-the-art and maps the history from the previous comprehensive survey provided by Kukich [2] in 1992.

First, we searched the indices with a search query "spelling correction" for the years 1991–2019. Scopus returned 1315 results, WoS returned 794 results. We excluded 149 errata, 779 corrections, 7 editorials, 45 reviews, and around 140 papers without any citations from both collections. We removed 250 duplicates, and we received 740 results (440 journal articles and 300 conference papers). We read the titles and abstracts of the remaining papers and removed 386 works that are not relevant to the topic of automatic spelling correction.

We examined the remaining 354 documents. Then, we removed articles without clear scientific contribution to spelling correction, without proper evaluation, or that just repeated already known things. We examined, sorted, and put the remaining 119 items into tables. We included additional references that explain essential theoretical concepts and survey papers about particular topics in the surrounding text.

First, we defined the spelling correction problem and established a common theoretical framework. We described the three main components of a spelling correction system.

This work divides the selected papers into three groups. The first group uses a set of expert rules to correct a spelling error. The second group adds a context model to rearrange the correction candidates with the context. The third group learns error patterns from a training corpus.

Each group of methods has its own section with a summarizing table. The main part of the survey is the summary tables. The tables briefly describe the application area, language, error model, and context model of the spelling correction systems. The tables are accompanied by a description of the selected approaches.

The rows in the tables are sorted chronologically and according to author. We selected chronological order because it shows the general scientific progress in spelling correction in the particular components of the spelling correction system. An additional reference in the table indicates if one approach enhances the previous one.

Special attention is paid to the evaluation methods. This section identifies the most frequent evaluation methods, benchmarks and corpora.

#### **3. Spelling Errors**

The design of an automatic spelling correction system requires knowledge of the creation process of a spelling error [3]. There are several works about spelling errors. A book by Mitton [4] analyzed spelling-error types and described approaches to construct an automatic spelling correction system. The authors in Yannakoudakis and Fawthrop [5] demonstrated that the clear majority of spelling errors follow specific rules on the basis of phonological and sequential considerations. The paper [5] introduced and described three categories of spelling errors (consonantal, vowel, and sequential) and presented the analysis results of 1377 spelling error forms.

Moreover, the authors in Kukich [2], Toutanova and Moore [6], and Pirinen and Lindén [7] divided spelling errors into two categories according to their cause:


Examples of typographic and cognitive spelling errors are in Table 1.


**Table 1.** Examples of cognitive and typographic errors.

OCR errors are a particular type of typographic error caused by software. The process of document digitization and optical character recognition often omits or replaces some letters in a typical way. Spelling correction is part of postprocessing of the digitized document because OCR systems are

usually proprietary and difficult to adapt. Typical error patterns appear in OCR texts [8]. The standard set for evaluation of an OCR spelling correction system is the TREC-5 Confusion Track [9].

Some writing systems (such as Arabic, Vietnamese, or Slovak) use different character variants that change the meaning of the word. The authors in [10] confirmed that the omission of diacritics is a common type of spelling error in Brazilian Portuguese. Texts in Modern Standard Arabic are typically written without diacritical markings [11]. This is a typographic error when the author omits additional character markings and expects the reader to guess the original meaning. The missing marks usually present short vowels or modification of the letter. They are placed either above or below the graphemes. The process of adding vowels and other diacritic marks to Arabic text can be called diacritization or vowelization [11]. Azmi and Almajed [12] focused on the problem of Arabic diacritization (adding missing diacritical markings to Arabic letters) and proposed an evaluation metric, and Asahiah et al. [13] published a survey of Arabic diacritization techniques.

#### **4. Automatic Spelling Correction**

An automatic spelling correction system detects a spelling error and proposes a set of candidates for correction (see Figure 2). Kukich [2] and Pirinen and Lindén [7] divide the whole process into three steps:


The spelling correciton systems.

detection

**Figure 2.** Process of automatic spelling correction.

#### *4.1. Error Detection*

A word could either be new or just uncommon, could be a less-known proper name, or could belong to another language. However, a correctly spelled word could be semantically incorrect in a sentence. Kukich [2] divided spelling errors according to the dictionary of correct words:


Most spelling correction systems detect a non-word error by searching for it in a dictionary of correct words. This step requires a fast-lookup method such as hash table [14] or search tree [15,16].

Many non-word error spelling correction systems use open-source a priori spelling systems, such as Aspell or Hunspell for error detection, correction-candidate generation, and preliminary candidate ranking.

An automatic spelling correction system identifies real-word errors by semantic analysis of the surrounding context. More complex error-detection systems may be used to detect words that are correctly spelled but do not fit into the syntactic or semantic context. Pirinen and Lindén [7] called it real-word error detection in context.

Real-word errors are hard to detect because detection requires semantic analysis of the context. The authors in [17] used a language model to detect and correct a homophonic real-word error in the Bangla language. The language model identifies words that are improbable with the current context.

Boytsov [18] examined methods for indexing a dictionary with approximate matching. Deorowicz and Ciura [19] claim that a lexicon of all correct words could be too large. Too large a lexicon can lead to many real-word errors or misdetection of obscure spellings.

The situation is different for languages where words are not separated by spaces (for example, Chinese). The authors in [20] transformed characters into a fixed-dimensional word-vector space and detected spelling errors by conditional random field classification.

#### *4.2. Candidate Generation*

ASC systems usually select correction candidates from a dictionary of correct words after detection of a spelling error. Although it is possible to select all correct words as correction candidates, it is reasonable to restrict the search space and to inspect only words that are similar to the identified spelling error.

Zhang and Zhang [21] stated that the task of similarity joining is to find all pairs of strings for which similarities are above a predetermined threshold, where the similarity of two strings is measured by a specific distance function. Kernighan et al. [22] proposed a simplification to restrict the candidate list to words that differ with just one edit operation of the Damerau–Levenshtein edit distance—substitution, insertion, deletion, or replacement of succeeding letters [23].

The spelling dictionary generates correction candidates for the incorrect word by approximately searching for similar words. The authors in [24] used a character-level language model trained on a dictionary of correct words to generate a candidate list. Reffle [25] used a Levenshtein automaton to propose the correction candidates. Methods of approximate searching were outlined in a survey published by Yu et al. [26].

An index often speeds up an approximate search in the dictionary. The authors in [19,27] converted the lexicon into a finite-state automaton to speed up searching for a similar string.

#### *4.3. Ranking Correction Candidates*

A noisy-channel model proposed by Shannon [28] described the probabilistic process of producing an error. The noisy channel transfers and distorts words (Figure 3).

**Figure 3.** Word distorted by noisy channel.

The noisy-channel model expresses similarity between two strings as a probability of transforming one string into another. Probability *P*(*s*|*w*) that a string *s* is produced instead of word *w* describes how similar the two strings are. The similarity between two strings is defined by an expert or depends on a training corpus with error patterns.

A more formal definition of automatic spelling correction uses the maximum-likelihood principle. Brill and Moore [29] defined the automatic spelling correction of a possibly incorrect word *s* as finding the best correction candidate *w<sup>b</sup>* from a list of possible correction candidates *w<sup>i</sup>* ∈ *W* with the highest un-normalized probability:

$$w\_b = \arg\max\_{w\_l \in \mathcal{C}(s)} P(s|w\_l)P(w\_l) \,. \tag{1}$$

where *P*(*s*|*wi*) is the probability of producing string *s* instead of word *w<sup>i</sup>* and *P*(*wi*) is the probability of producing word *w<sup>i</sup>* . *C*(*s*) is a function that returns valid words from dictionary *W* that serve as correction candidates for erroneous string *s*.

#### *4.4. Components of Automatic Spelling Correction Systems*

Equation (1) by Brill and Moore [29] identified three components of an automatic spelling correction system. The components are depicted in Figure 4:


**Figure 4.** Components of an automatic spelling correction system.

#### **5. Spelling Correction with a Priori Error Model**

A combination of error and context models is often not necessary. In some scenarios, a set of predefined transcription rules can correct a spelling error. An expert identifies characteristic string transcriptions. These rules are given in advance (a priori) by someone who understands the problem.

Approaches in this group detect non-word errors and propose a list of correction candidates that are similar to the original word (presented in Table 2). The a priori error model works as a guide in the search for the best-matching original word; best-matching words are proposed first, and it is easy to select the correction.

A schematic diagram for an ASC system with a priori error model is in Figure 5. The input of the a priori error model is an erroneous word. The spelling system applies one or several transcription operations to the spelling error to create a correction candidate. The rank of the correction candidate depends on the weights of the transcription rules. The output of the a priori error model is a sorted list with correction candidates.


**Table 2.** Summary of a priori spelling correction systems.

Note: DLD, Damerau–Levenshtein distance; FSA, finite-state automaton; LCS, longest common subsequence; LD, Levenshtein distance; OCR, optical character recognition.

**Figure 5.** A priori spelling correction.

The most commonly used open-source spelling systems are Aspell (http://aspell.net) and Hunspell (http://hunspell.github.io/). Hunspell is a variant of Aspell with a less restrictive license, used in LibreOffice word processor, Firefox web browser, and other programs. They are available as a standalone text filter or as a compiled component in other spelling systems or programs. The basic component of the Aspell system is a dictionary of correct words, available for many languages. The dictionary file contains valid morphological units for the given language (prefixes, suffixes, or stems). The dictionary is compiled into a state machine to speed up searching for correction candidate words.

Aspell searches for sounds-like equivalents (computed for English words by using the Metaphone algorithm) up to a given edit distance (the Damerau–Levenshtein distance) [50]. The detailed operation of the spelling correction of Aspell is described in the manual (http://aspell.net/man-html/Aspell-Suggestion-Strategy.html#Aspell-Suggestion-Strategy).

#### *5.1. Edit Distance*

Edit distance expresses the difference between two strings as a nonnegative real number by counting edit operations that are required to transform one string into another. The two most commonly used edit distances are the Levenshtein edit distance [51] and the Damerau–Levenshtein distance [52]. Levenshtein identifies atomic edit operations such as


In addition, the Damerau–Levenshtein distance adds the operation of

• Transposition, which exchanges two subsequent symbols.

The significant difference between the Levenshtein distance (LD) and the Damerau–Levenshtein distance (DLD) is that the Levenshtein distance does not consider letter transposition. The edit operation set proposed by Levenshtein [51] did not consider transposition as an edit operation because the transposition of two subsequent letters can be substituted by deletion and insertion or by two substitutions. The Levenshtein distance allows for representation of the weights of edit operations by a single letter-confusion matrix, which is not possible for DLD distance.

Another variation of edit distance is longest common subsequence (LCS) [53]. It considers only insertion and deletion edit operations. The authors in [54] proposed an algorithm for searching for the longest common sub-string with the given number of permitted mismatches. More information about longest-common-subsequence algorithms can be found in a survey [55].

#### *5.2. Phonetic Algorithms*

Many languages have difficult rules for pronunciation and writing, and it is very easy to make a spelling mistake if rules for writing a certain word are not familiar to the writer. A word is often replaced with a similarly sounding equivalent with a different spelling.

An edit operation in the phonetic algorithm describes how words are pronounced. They recursively replace phonetically important parts of a string into a special representation. If the phonetic representation of two strings is equal, the strings are considered equal. In other words, a phonetic algorithm is a binary relation of two strings that tells whether two strings are pronounced in a similar way:

$$D(s\_{\nu}s\_{t}) \to 0 \text{ or } 1\text{ .}\tag{2}$$

The phonetic algorithm is able to identify a group of phonetically similar words to some given string (e.g., to some unknown proper noun). It helps to identify names that are pronounced in a similar way or to discover the original spelling of an incorrectly spelled word. Two strings are phonetically similar only if their phonetic forms are equal.

Phonetic algorithms for spelling corrections and record linkage are different from phonetic algorithms used for speech recognition because they return just an approximation of the true phonetic representation.

One of the first phonetic algorithms is Soundex (U.S. Patent US1435663). Its original purpose was the identification of similar names for the U.S. Census. The algorithm transforms a surname or name so that names with a similar pronunciation have the same representation. It allows for the identification of similar or possibly the same names. The most phonetically important letters are consonants. Most vowels are dropped (except for in the beginning), and similar consonants are transformed into the same representation. Other phonetic algorithms are Shapex [56] and Metaphone [57]. Evaluation of several phonetic-similarity algorithms on the task of cognate identification was done by Kondrak and Sherif [58].

#### **6. Spelling Correction in Context**

An a priori model is often not sufficient to find out the best correction because it takes only incorrect word into account. The spelling system would perform better if it could distinguish whether the proposed word fits with its context. It is hard to decide which correction is more useful if we do not know the surrounding sentence. For example, if a correction for string "smilly" is "smelly", the correction "smiley" can be more suitable for some contexts.

Approaches in this group are summarized in Tables 3 and 4. The components and their functions are displayed in Figure 4. The authors in [59] described multiple methods of correction with context. This group of automatic spelling correction systems use a probabilistic framework by Brill and Moore [29] defined in the Equation (1). The error models in this group usually use the a priori rules (edit distance and phonetic algorithms). The context model is usually an *n*-gram language model. Some approaches noted below use a combination of multiple statistical models.



**Table 3.** Spelling correction systems with learning of context model—part I.

Note: BLEU, bilingual evaluation understudy; BiLSTM, bidirectional long short-term memory; CFG, context-free grammar; CRF, conditional random fields; IR, information retrieval; k-NN, k-nearest neighbors; LM, language model; LR, linear regression; ME, maximum entropy; POS, part-of-speech tagging; PMI, pointwise mutual information; RF, random forests; SVM, support vector machine; WCS, word-confusion set; WFST, weighted finite-state transducer.




Note: ANN, artificial neural network; HMM, hidden Markov model; LM, language model; LSA, latent semantic analysis; ME, maximum entropy; OCR, optical character recognition; POS, part-of-speech; SMT, statistical machine translation; WCS, word-confusion set; WSD, word-sense disambiguation.

The edit distance *D*(*s*|*w*) of the incorrect word *s* and a correction candidate *w* in the a priori error model is a positive real number. In order to fit the probabilistic framework, it can be converted into the probabilistic framework by taking a negative logarithm [100]:

$$P(s|w\_i) = -\log D(s, w) \,. \tag{3}$$

Methods of spelling correction in context are similar to morphological analysis, and it is possible to use similar methods of disambiguation from part-of-speech taggers in a context model of automatic spelling correction systems.

#### *6.1. Language Model*

The most common form of a language model is *n*-gram language model, calculated from the frequency of word sequences of size *n*. It gives the probability *P*(*w<sup>i</sup>* |*wi*−1,*i*−(*n*−1) ) of a candidate word given its history of (*n* − 1) words. If the given *n*-gram sequence is not presented in the training corpus, the probability is calculated by a back-off that considers shorter contexts. The *n*-gram language model only depends on previous words, but other classifiers can make use of arbitrary features in any part of the context. The language model is usually trained on a training corpus that represents language with correct spelling.

Neural language modeling brought new possibilities, as it can predict a word given arbitrary surrounding context. A neural network maps a word into a fixed-size embedding vector. Embedding vectors form a semantic space of words. Words that are close in the embedding space usually occur in the same context and are thus semantically close. This feature can be used in a spelling correction system to propose and rank a list of correction candidates [63,101,102].

#### *6.2. Combination of Multiple Context Models*

Context modeling often benefits from a combination of multiple statistical models. A spelling system proposed by Melero et al. [73] used a linear combination of language models, each with a certain weight. Each language model can focus on a different feature: lowercase words, uppercase words, part-of-speech tags, and lemmas.

The authors in [67] proposed a context model with multiple Bayesian classifiers. The first component of the context model is called "trigrams". This system uses parts of speech as a feature for classification. The first part of the model assigns the highest probability to a candidate word and its context containing the most probable part-of-speech tags. The second part of the context model is a naïve Bayes classifier that takes the surrounding words and collocations (preceding word and current tag) .

Another form of a statistical classifier for the context modeling with multiple models is the Winnow algorithm [96,103]. This approach uses several Winnow classifiers trained with different parameters. The final rank is their weighted sum.

The model uses the same features (occurrence of a word in context and collocation of tags and surrounding word) as those in the previous approach [67]. The paper by Golding and Roth [96] was followed by Carlson et al. [97], which used a large-scale training corpus. Also, Li and Wang [95] proposed a similar system for Chinese spelling correction.

An approach published by Banko and Brill [90] proposed a voting scheme that utilized four classifiers. This approach focused on learning by using a large amount of data—over 100 million words. It uses a Winnow classifier, naïve Bayes classifier, perceptron, and a simple memory-based learner. Each classifier has a complementarity score defined by Brill et al. [104] and is separately trained. The complementarity score indicates how accurate the classifier is.

#### *6.3. Weighted Finite-State Transducers*

If components of an ASC system (dictionary, error model, or context model) can be converted into a state machine, it is possible to create a single state machine by composing individual components. The idea of finite-state spelling was formalized by Pirinen and Lindén [7]. They compared finite-state automatic spelling correction systems with other conventional systems (Aspell and Hunspell) for English, Finnish, and Icelandic on the corpus of Wikipedia edits. They showed that this approach had comparable performance to that of others.

A weighted state transducer (WFST) is a generalization of a finite-state automaton, where each transcription rule has an input string, output string, and weight. One rule of the WFST system represents a single piece of knowledge about spelling correction—an edit operation of the error model or a probability of succeeding words in the context model.

Multiple WFSTs (dictionary, error model, and context model) can be composed into a single WFST by joining their state spaces and by removing useless states and transcription rules. After these three components are composed, the resulting transducer can be searched for the best path, which is the sequence of best-matching letters.

For example, the approach by Perez-Cortes et al. [105] took a set of hypotheses from the OCR. The output from OCR is an identity transducer (an automaton that transcribes the set of strings to the same set of strings) with weights on each transition that represents the probability of a character in the hypothesis. The character-level *n*-gram model represents a list of valid strings from the lexicon. The third component of the error model is a letter-confusion matrix calculated from the training corpus. The authors in [106,107] used handcrafted Arabic morphological rules to construct a WFST for automatic spelling correction.

A significant portion of text errors involves running together two or more words (e.g., ofthe) or splitting a single word (sp ent, th ebook) [2]. Weighted finite-state transducer (WFST) systems can identify word boundaries if the spacing is incorrect (http://openfst.org/twiki/bin/view/FST/ FstExamples). However, inserting or deleting a space is still considered problematic because spaces have the annoying characteristic of not being handled by edit-distance operations [106].

#### **7. Spelling Correction with Learning Error Model**

The previous sections presented spelling correction systems with a fixed set of rules, prepared in advance by an expert. This section introduces approaches where the error model learns from a training corpus. The optimization algorithm iteratively updates the parameters of the error model (e.g., weights of the edit operations) to improve the quality of the ASC system.

A diagram in Figure 6 displays a structure of a learning error model. The algorithm for learning the error model uses the expectation-maximization procedure. A complete automatic spelling correction system contains a context model that is usually learned separately. The authors in [108] proposed to utilize the context model in the learning of the error model. Context probability is taken into account during the expectation step. Some approaches do not consider context at all. A comparison of approaches with the learning error model is shown in Tables 5 and 6.

**Figure 6.** Spelling correction with learning error model




Note: ANN, artificial neural network; BERT, bidirectional encoder representations from transformers; BiLSTM, bidirectional long short-term memory; CRF, conditional random fields; HMM, hidden Markov model; LCM, letter-confusion matrix; LSTM, long short-term memory; NER, named entity recognition; OCR, optical character recognition; POS, part-of-speech; seq2seq, sequence-to-sequence; WFST, weighted finite-state transducer.




Note: ANN, artificial neural network; FSA, finite-state automaton; LCM, letter-confusion matrix; LM, language model; LSTM, long short-term memory; ME, maximum entropy; OCR, optical character recognition; seq2seq, sequence-to-sequence; SMT, statistical machine translation; SVM, support vector machine; WFST, weighted finite state transducer.

ASC systems with a learning error model often complement optical character recognition systems (OCR). The digitized document contains spelling errors characteristic of the quality of the paper, scanner, and OCR algorithm. If the training database (original and corrected documents) is large enough, the spelling system is adapted to the data. A training sample from the TREC-5 confusion track corpus [9] is displayed in Figure 7.

> Correct: bulletin Incorrect: bM.etin ,bWetin bMetinh bUletin Cunt: 2 2 4 23

**Figure 7.** Example misspellings of word the "bulletin" from optical character recognition (OCR).

#### *7.1. Word-Confusion Set*

The simplest method of estimating the learning error model is a word-confusion set that counts the cooccurrences of correct and incorrect words in the training corpus. It considers a pair of correct and incorrect words as one big edit operation. The word-confusion set remembers possible corrections for each frequently misspelled form (See Figure 7). This method was used by Gong et al. [145] to improve the precision of e-mail spam detection.

Its advantages are that it can be easily created and manually checked. The disadvantage of this simple approach is that it is not possible to obtain a corpus that has every possible misspelling for every possible word. The second problem of the word-confusion set is that error probabilities are far from "real" probabilities because training data are always sparse. Shannon's theorem states that it is not possible to be 100% accurate in spelling correction.

#### *7.2. Learning String Metrics*

The sparseness problems of the word-confusion set are solved by observing smaller subword units (such as letters or morphemes). For example, Makazhanov et al. [130] utilized information about morphemes in the Kazakh language to improve automatic spelling correction. The smallest possible subword units are letters. Estimating parameters of edit operations partially mitigates the sparseness problem because smaller sequences appear in the training corpus more frequently. The authors in [29] presented an error model that learned general edit operations. The antecedent and consequent parts of the edit operations can be arbitrary strings called partitions. The partition of the strings defines the edit operations.

Generalized edit distance is another form of a learning error model. The antecedent and consequent part of an edit operation is a single symbol that can be a letter or a special deletion mark. Edit distance is generalized by considering the arbitrary weight of an operation. Weights of each possible edit operation of the Levenshtein distance (LD) can be stored in a single letter-confusion matrix (LCM). **∆** weights for generalized edit distance are stored in four matrices [128]. The generalized edit distance is not always a metric in the strict mathematical sense because the distance in the opposite direction can be different. More theory about learning string metrics can be found in a book [146] or in a survey ([147], Section 5.1).

Weights **∆** in an LCM express the weight of error types (Figure 8). If the LCM is a matrix of ones with zeros on the main diagonal, it expresses the Levenshtein edit distance. Each edit operation has a value of 1, and the sum of edit operations is the Levenshtein edit distance. The edit distance with weights is calculated by a dynamic algorithm [53,148].

The LCM for a Levenshtein-like edit distance can be estimated with an expectation-maximization algorithm [100]. The learning algorithm calculates weights of operations for each training sample that are summed and normalized to form an updated letter confusion matrix.

If the training corpus is sparse (which it almost always is), the learning process brings the problem of overfitting. Hládek et al. [8] proposed a method for smoothing parameters in a letter-confusion matrix. Bilenko and Mooney [149] extended string-distance learning with an affine gap penalty (allowing for random sequences of characters to be skipped). Also, Kim and Park [150] presented an algorithm for learning a letter-confusion matrix and for calculating generalized edit distance. This algorithm was further extended by Hyyrö et al. [151].


**Figure 8.** Example of a letter-confusion matrix for the alphabet of symbols a, b, c, and d for Levenshtein distance (left) and arbitrary letter confusion matrix (right): the matrix gives a weight of transcription of the letter in the vertical axis to the letter in the horizontal axis.

#### *7.3. Spelling Correction as Machine Translation of Letters*

Spelling correction can be formulated as a problem of searching for the best transcription of an arbitrary sequence of symbols into another sequence. This type of problem can be solved with methods typical for machine translation. General string-to-string translation models are not restricted to the spelling error correction task but can also be applied to many problems, such as grapheme-to-phoneme conversion, transliteration, or lemmatization [122]. The machine-translation representation of the ASC overcomes the problem of joined and split words but requires a large corpus to properly learn the error model.

Zhou et al. [117] defined the machine-translation approach to spelling correction by the following equation:

$$\mathbf{s}' = \arg\max\_{\mathbf{s}} P(\mathbf{s}|\mathbf{S}),\tag{4}$$

where *S* is the given incorrect sequence, *s* is the possibly correct sequence, and *s* 0 is the best correction.

Characters are "words" of "correct" and "incorrect" language. Words in the training database are converted into sequences of lowercase characters, and white spaces are converted into special characters. The machine-translation system is trained on a parallel corpus of examples of spelling errors and corrections. Sariev et al. [132] and Koehn et al. [152] proposed an ASC system that utilizes a statistical machine-translation system called Moses (http://www.statmt.org/moses/).

The authors in [125] cast spelling correction into the machine translation of character bigrams. The spelling system is trained on logs of search queries. It was assumed that the corrections of queries by the user follow misspelled queries. This heuristics creates a training database. To improve precision, character bigrams are used instead of single characters.

Statistical machine-translation models based on string alignment, translation phrases, and *n*-gram language models are replaced by neural machine-translation systems. The basic neural-translation architecture, based on a neural encoder and decoder, was proposed by Sutskever et al. [110]. The translation model learns *P*(*y*1..*yT*|*x*1...*xT*) by encoding the given sequence into a fixed-size vector [117]:

$$s = f\_{\mathfrak{e}}(\mathfrak{x}\_1, \dots, \mathfrak{x}\_T) = h\_T \,. \tag{5}$$

The sequence-embedding vector is decoded into another sequence by a neural decoder [117]:

$$y\_t = f\_d(s, y\_{1'}, \dots, y\_{t-1}) = h\_T \,. \tag{6}$$

The decoder takes the encoded vector language model and generates the output. Zhou et al. [117] showed that, by using *k*-best decoding in the string-to-string translation models, they achieved much better results on the spelling correction task than those of the three baselines, namely edit distance, weighted edit distance, and the Brill and Moore model [104].

#### **8. Evaluation Methods**

The development of automatic spelling correction systems requires a way to objectively assess the results. It is clear though that it is impossible to propose a "general" spelling benchmark because the problem is language- and application-dependent.

Three possible groups of methods exist for evaluating automatic spelling correction:


The most common evaluation metrics is classification accuracy. The disadvantage of this method is that only the best candidate from the suggestion list is considered, and order and count of the other proposed correction candidates are insignificant. Therefore, it is not suitable for evaluating an interactive system.

Automatic spelling correction is similar to machine translation. A source text containing errors is translated to its most probable correct form. The approach takes the whole resulting sentence, and it is also convenient for evaluating the correction of a poor writing style and non-word errors. It was used by Sariev et al. [132], Gerdjikov et al. [153] and Mitankin et al. [131].

Machine-translation systems are evaluated using the BLEU score, which was first proposed by Papineni et al. [154]:

"The task of a BLEU implementation is to compare *n*-grams of the candidate with the *n*-grams of the reference translation and to count the number of matches. These matches are position-independent. The more matches, the better the candidate translation."

The process of automatic spelling correction is also similar to information retrieval. An incorrect word is a query, and the sorted list of the correction candidates is the response. This approach evaluates the whole list of suggestions and favors small lists of good (highly ranked) candidates for correction. The two following evaluation methodologies are used to evaluate spelling:


Machine translation and information retrieval are well-suited for evaluating interactive systems because they consider the whole candidate list. A smaller candidate list is more natural to comprehend. The best correction can be selected faster from fewer words. On the other hand, the candidate list must be large enough to contain the correct answer.

#### *8.1. Evaluation Corpora and Benchmarks*

Several authors proposed corpora for specific tasks and languages, but no approach was broadly accepted. The authors in [12] proposed the Koran as a benchmark for the evaluation of Arabic diacritizations. Reynaert [156] presented an XML format and OCR-processed historical document set in Dutch for the evaluation of automatic spelling correction systems.

The most used evaluation set for automatic spelling correction of OCR is TREC-5 Confusion Track [9]. It was created by scanning a set of paper documents. The database consists of original and recognized documents, so it is possible to identify correct–incorrect pairs for system training and evaluation. The other common evaluation set is Microsoft Speller Challenge (https://www.microsoft. com/en-us/download/details.aspx?id=52351).

Also, Hagen et al. [34] proposed a corpus of corrected search queries in English (https://www. uni-weimar.de/en/media/chairs/computer-science-and-media/webis/corpora), and provided an evaluation metric. They re-implemented the best-performing approach [157] from the Microsoft Speller Challenge (https://github.com/webis-de/SIGIR-17).

Tseng et al. [158] presented a complete publicly available spelling benchmark for the Chinese language, preceded by Wu et al. [159]. Similarly, the first competition on automatic spelling correction for Russian was published by Sorokin et al. [160].

#### *8.2. Performance Comparison*

Table 7 gives a general overview of the performance of automatic spelling correction systems. It lists approaches with well-defined evaluation experiments performed by the authors. The table displays the best value reached in the evaluation and summarizes the evaluation corpora. Only a few corpora were available that are suitable for evaluating an ASC system (such as TREC-5).

It is virtually impossible to compare the performance of state-of-the-art spelling correction systems. Each author solves a different task and uses their methodology, custom testing set, and various evaluation corpora with different languages. The displayed values cannot be used for mutual comparison but are instead a guide for selecting an evaluation method. A solution would be a spelling correction toolkit that implements state-of-the-art methods for error modeling and context classification. A standard set of tools would allow for comparison of individual components, such as error models.



**Table 7.** Reported evaluation results.

#### **9. Conclusions**

The chronological sorting and grouping of the summary tables with references in this work reveal several findings. The research since the last comprehensive survey [2] brought new methods for spelling correction. On the other hand, we can say that the progress of spelling correction in all areas was slow until the introduction of deep neural networks.

New, a priori spelling correction systems are often presented for low-resource languages. Authors propose rules for a priori error model that extend the existing phonetic algorithm or adjust the edit distance for the specifics of the given language.

Spelling correction systems in context are mostly proposed for languages with sufficient language resources for language modeling. Most of them use *n*-gram language models, but some approaches use neural networks or other classifiers. Scientific contributions for spelling in context explore various context features with statistical classifiers.

Spelling correction with the learning error model shows the biggest progress. The attention of the researchers moves from statistical estimation of the letter confusion matrices to utilization of the statistical machine translation.

This trend is visible mainly in Tables 5 and 6, where we can observe the growing popularity of the use of encoder–decoder architecture and deep neural networks since 2018. New approaches move from word-level correction to arbitrary character sequence correction because new methods based on deep neural networks bring better possibilities. Methods based on machine translation and deep learning solve the weakest points of the ASC systems, such as language-specific rules, real-word errors, and spelling errors with spaces. The neural networks can be trained on already available large textual corpora.

The definition of the spelling correction stated in the Equation (1) begins to be outdated because of the new methods. Classical statistical models of context-based (n-gram, log-linear regression, and naïve Bayes classifier) on the presence of word-level features in the context are no longer important. Instead, feature extraction is left to the hidden layers of the deep neural network. The correction of spelling errors becomes the task of transcribing a sequence of characters to another sequence of characters using a neural network, as it is stated in Equation (4). Research in the field of spelling error correction thus approaches other solutions to other tasks of speech and language processing, such as machine translation or fluent speech recognition.

On the other hand, the scientific progress of learning error models is restricted by the lack of training corpora and evaluation benchmarks. Our examination of the literature shows that there is no consensus on how to evaluate and compare spelling correction systems. Instead, almost every paper uses its own evaluation set and evaluation methodology. In our opinion, the reason is that most of the spelling approaches strongly depend on the specifics of the language and are hard to adapt to another language or a different application. Recent algorithms based on deep neural networks are not language dependent, but their weak point is that they require a large training set, often with expensive manual annotation. These open issues call for new research in automatic spelling correction.

**Author Contributions:** Conceptualization, D.H.; methodology, D.H.; formal analysis, J.S.; investigation, D.H.; resources, D.H.; writing—original draft preparation, D.H.; writing—review and editing, M.P. and J.S.; supervision, M.P.; project administration, J.S.; funding acquisition, M.P. All authors have read and agreed to the published version of the manuscript.

**Funding:** Research in this paper was supported by the Slovak Research and Development Agency (Agentúra na podporu výskumu a vývoja) under projects APVV-15-0517 and APVV-15-0731; the Scientific Grant Agency (Vedecká grantová agentúra MŠVVaŠ SR a SAV), project number VEGA 1/0753/20; and the Cultural and Educational Grant Agency (Kultúrna a edukaˇcná grantová agentúra MŠVVaŠ SR), project number KEGA 009TUKE-4-2019, both funded by the Ministry of Education, Science, Research, and Sport of the Slovak Republic.

**Acknowledgments:** The authors want to thank Jozef Juhár for the team leadership, and personal and financial support.

**Conflicts of Interest:** The authors declare no conflict of interest.

*Electronics* **2020**, *9*, 1670

## **Abbreviations**

The following abbreviations were used in this manuscript:


## **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Review* **A Systematic Review of the Use of Art in Virtual Reality**

**Audrey Aldridge \* and Cindy L. Bethel**

Department of Computer Science and Engineering, Mississippi State University, Starkville, MS 39762, USA; cbethel@cse.msstate.edu

**\*** Correspondence: ala214@msstate.edu

**Abstract:** Brain injuries can create life-altering challenges and have the potential to leave people with permanent disabilities. Art therapy is a popular method used for treating many of the disabilities that can accompany a brain injury. In a systematic review, an assessment of how art is being used in virtual reality (VR) was conducted, and the feasibility of brain injury patients to participate in virtual art therapy was investigated. Studies included in this review highlight the importance of artistic subject matter, sensory stimulation, and measurable performance outcomes for assessing the effect art therapy has on motor impairment in VR. Although there are limitations to using art therapy in a virtual environment, studies show that it can feasibly be used in virtual reality for neurorehabilitation purposes.

**Keywords:** virtual reality; art therapy; rehabilitation; neurorehabilitation; neuroplasticity; brain injury

#### **1. Introduction**

Art has been used as part of the healing process for a variety of therapeutic practices, including: mental health treatment, social problems, language and communication difficulties, medical problems, physical disabilities, and learning difficulties [1]. Art therapy involves interacting with a form of art to help patients through recovery. It works by using personal artwork from therapy, third-party artwork, or the creative process to help people explore their emotions or improve social skills. The creative process refers to the stages involved in transforming an idea into its final form. In art therapy the process is more important than the final masterpiece. The act of making art encourages creative expression without placing constraints on experience level. It provides an outlet where there are no right or wrong answers, and one is free to release any internal struggles and frustration that can form in the beginning stages of recovery [2].

As with most aspects of life, one size does not fit all and this holds true for therapy and rehabilitation. Researchers agree that the individualized treatment capability offered by the creative aspect of art therapy is essential for accommodating specific needs of patients [3–5]. By not requiring an end goal in art therapy, people have the ability to make their own choices and express themselves at their own pace and skill level. The individualization aspect of creative art therapy permits a wider range of patients to be treated and unleashes the potential for more therapeutic applications, including neurorehabilitation. Neurorehabilitation is the process of restoring the functions of the brain, usually for people who suffer from a neurological disease or brain injury. One main focus in neurorehabilitation is the plasticity of the brain, or its ability to make adaptive changes or form new connections in place of damage when exposed to environmental stimuli. Although plasticity occurs more in younger ages (developmental years) [6], it has also been found to occur in older ages at reduced levels [7].

One way to ensure the promotion of neural plasticity, regardless of age, is to have participants enter a creative state of flow [8]. Flow, one of the psychometric measures of creativity highlighted in Jung et al.'s (2010) study, has implications for promoting neuroplasticity [9]. Entering the state of flow is said to feel like being in autopilot mode—all focus

**Citation:** Aldridge, A.; Bethel, C.L. A Systematic Review of the Use of Art in Virtual Reality. *Electronics* **2021**, *10*, 2314. https://doi.org/10.3390/ electronics10182314

Academic Editors: Matúš Pleva, Yuan-Fu Liao and Patrick Bours

Received: 6 September 2021 Accepted: 17 September 2021 Published: 20 September 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

is on one activity, and everything else seems to fade away [8]. In art therapy, reaching the state of flow not only means achieving the optimal experience but also performing the activity successfully [4,8,9]. Jung et al. (2010) also found that creativity involves the activation within and between multiple brain areas, which has implications for use-dependent plasticity, healing individual parts of the brain [9]. Similarly, Makuuchi et al. (2003) found in their fMRI study that the following brain areas, shown in Figure 1, are activated during creative behavior: the parietal lobe, the premotor cortex, and the sensorimotor area (primary motor cortex and somatosensory cortex), among others [10]. These areas are considered to be involved in motor cognition [11], suggesting that art therapy can be used for restoring damaged motor areas of the brain and for inducing use-dependent neuroplasticity.

**Figure 1.** Some of the brain areas activated during creative behavior [12].

Promoting plasticity is vital for rehabilitating brain injuries. There are two types of brain injuries, traumatic, and acquired. An acquired brain injury (ABI) refers to any brain damage or alteration of brain function, i.e., stroke, tumor, or meningitis, that occurs after birth and is not hereditary or caused by a degenerative disease. A traumatic brain injury (TBI) refers to any brain damage or alteration of brain function caused by an external impact to the head, such as from a military blast. In 2016, roughly 27 million people suffered TBIs around the world [13]. In the United States, approximately 5.3 million people are currently living with a permanent disability caused by brain injury [14]. Typically after suffering from a TBI, patients are unable to recognize the injury's impact and cannot shift into a new sense of self [4]. Because they suffer from poor self awareness, brain injury patients can potentially benefit from the creativity component of art therapy, which allows for the rehabilitation of self awareness, helping patients adapt to their new disabilities [15]. Of the disabilities that can form after brain injury, including problems with behavioral and mental health, sensory processing, and communication, motor impairment will be the focus of this investigation.

Traditional methods of art therapy often requires a hands-on approach that excludes many people suffering from cognitive and motor impairments. With the technological advancements happening in the realm of human–computer interaction, new and innovative systems are being created to provide treatment to those excluded from the traditional methods of art therapy. Virtual reality (VR) systems are being used as an alternative modality to the traditional methods of therapy. Because VR is a real-time simulation of an environment, it has the capacity to accommodate the specific needs of elderly and impaired populations. In an effort to rehabilitate impaired motor functioning, researchers have studied the effect of VR on motor rehabilitation and have found it to aid in the rehabilitation of physical impairment [16–23]. With evidence supporting the use of VR in rehabilitative practices for motor impairment, an investigation into the efficacy of using art therapy in VR for neurorehabilitation needs to be conducted.

#### **2. Objectives**

This systematic review consists of an exploratory analysis of how art therapy is being used in VR for neurorehabilitation in non-adolescent people. To formulate the research questions guiding this review, the PICO (Population, Intervention, Comparison, Outcome) format was used [24]. The following research questions will be investigated and answered:


To adequately assess the limitations presented by VR, studies involving art therapy for neurorehabilitation in a non-VR setting are also included in this review.

#### **3. Methods**

A systematic review conducted according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) was performed using Google Scholar, ScienceDirect, and PubMed. The following keywords were used to find relevant studies: "art", "art therapy", "brain injury", "virtual reality", "neurorehabilitation", and "motor rehabilitation". If an article's title related to the objectives of this systematic review, the abstract was read to determine further relevance. If the abstract contained helpful information for answering this review's objectives, then the article was added to a list for further review. Additionally, any relevant-sounding references found in previously reviewed articles were added to the list.

#### *3.1. Inclusion Criteria*

To be selected as relevant or helpful in reaching the objectives of this review, an article must meet the inclusion criteria. Studies were considered eligible for inclusion if they were written in English and used art or art therapy in non-VR or in VR applications, particularly for neurorehabilitation purposes or with implications toward rehabilitating motor impairment. The desired population for inclusion was healthy adults and brain injury patients. Brain injury must refer to an ABI or a TBI for inclusion.

#### *3.2. Exclusion Criteria*

Articles with no access to full text were excluded along with review articles. Populations that included patients suffering from disorders, such as cerebral palsy that can be congenital or acquired were excluded as the condition of its occurrence is not always specified. Studies that only measured emotional and mental states were also excluded.

#### *3.3. Study Selection*

All articles that matched with the keyword "art" specifically because of the phrases "state of the art" or "state of art" were excluded from the initial search results. Studies published in English were eligible if they used art in a therapy setting, set the intervention in a non-VR or VR environment, used a non-adolescent population of healthy people or brain injury patients, and either focused on neurorehabilitation or had implications for use in neurorehabilitation. Once the list of potentially relevant articles was compiled, each article was read in full and evaluated for relevance by two researchers.

#### **4. Search Results**

Using art neurorehabilitation motor virtual "art therapy", "state of the art", "state of art"" in Google Scholar, 138 results were returned. Of the 138, only 1 article was deemed relevant. To increase search specificity, the following phrase was used in Google Scholar: "allintitle: art neurorehabilitation "state of the art", "state of art"". From this search, 6 results were returned, and 1 article was used in this review. The phrase "allintitle: art therapy "virtual reality", "state of the art", "state of art"" in Google Scholar returned 12 articles, 2 of which are reviewed below. Using the phrase, "(art or "art therapy") and (neurorehabilitation or "neurological rehabilitation" or "motor rehabilitation") and "virtual

reality" not "state of the art"" in ScienceDirect returned 36 results. Zero of the results were relevant for this review. To reduce the specificity and yield more results, the phrase ""art therapy" and ("neurorehabilitation" or "motor rehabilitation")" was used to match with words in the title, abstract, or keywords category. One result was returned but was a duplicate of an article found in the Google Scholar search. Using the phrase "art in neurorehabilitation" and filtering to full text available and non-review article types yielded 13 results from PubMed. Of the 13 results, 0 articles were used. Various other combinations of keywords were used to search the databases, especially Google Scholar as it always returned the largest number of results. The combinations of phrases used in Google Scholar that returned the most relevant articles were as follows: "art in neurorehabilitation", "art and brain injury", "art in virtual reality", and "art therapy and neuroplasticity". The phrase "state of the art" was used to eliminate many of the results from these searches. Along with the articles collected from the these database searches, relevant articles found within reference lists of the approved articles were used in this systematic review.

#### *4.1. Article Exclusion*

Several studies were included in the initial potentially relevant article list but then later removed after reading the abstract or full paper. One example of this is the study conducted by Jones et al. (2019) [25]. The authors conducted a study using art therapy to treat military service members suffering from post-traumatic stress disorder and TBI. The reason for excluding the study is that the focus was on helping the participants understand lingering trauma symptoms and improve communication and quality of life [25]. Another example of an article excluded from the final list of relevant sources is one by Kline (2016) [4]. Kline's (2016) article, titled *"Art Therapy for Individuals With Traumatic Brain Injury: A Comprehensive Neurorehabilitation-Informed Approach to Treatment"* [4], was excluded for being a literature review-based approach that did not provide experimental data.

#### *4.2. Data Extraction*

Nine articles were found to be relevant for evaluating the feasibility of using art therapy for non-adolescent and brain injury patients. Table 1 shows the diversity of the research conducted in the nine studies being reviewed. Once the list of relevant articles was finalized, studies were briefly analyzed to compare similarities and differences for grouping. To more easily display characteristics of the studies, data including population features, art practice used, intervention setting, and results of the studies performed were collected and compiled into three tables: non-VR (Table 2), VR brain injury patients (Table 3), and VR healthy participants (Table 4).



**Table 1.** *Cont.*

\* denotes age range because average age statistics missing from article. (GTB) is the Google Tilt Brush program built for VR.


**Table 2.** Summary of studies in non-VR environment.

\* denotes age range because average age statistics missing from article.

**Table 3.** Summary of studies using brain injury patients in VR environment.


\* denotes age range because average age statistics missing from article. (GTB) is the Google Tilt Brush program built for VR.

**Table 4.** Summary of studies using healthy participants in VR environment.


\* denotes age range because average age statistics missing from article. (GTB) is the Google Tilt Brush program built for VR.

#### **5. Traditional Art Therapy**

Transitioning art therapy to neurorehabilitation therapy does not seem like a far stretch. Researchers are already using art therapy to address visuospatial dysfunction and related symptoms of Parkinson's disease [29], analyzing art to detect perspective and preferences of those with limited verbal capabilities [35], and using an interactive art application to provide movement feedback in therapy [26]. With the success of art therapy in treating

mental illness and assisting in physical therapy, implications that art therapy may promote treatment progress and recovery in neurorehabilitation are apparent.

Table 2 shows the extracted data from each of the studies using art therapy in a non-VR setting. A brief summary as well as discussion of results and limitations are included for each study.

#### *5.1. Digital Art Application*

Worthen-Chaudhari et al. (2013) conducted a study assessing the feasibility of using an interactive art application in neurorehabilitation therapy [26]. Ranging from 19 to 86 years of age (average 57 ± 18 years), 21 patients suffering from motor impairment and requiring at least 75% assistance on cognitive and motor-related tasks participated in the study. Over 1–7 sessions of their assigned therapy (physical, occupational, or recreational), participants performed movements in the form of drawing in an interactive art application and were able to see their movements in real-time in the form of visual art feedback. The researchers concluded from user feedback and therapists' responses that interactive art applications are appropriate and helpful for use in neurorehabilitation [26].

The results of the feasibility study conducted by Worthen-Chaudhari et al. (2013) have implications on enhancing neurorehabilitation therapy [26]. The interactive art application kept the participants engaged and showed their movements from a different perspective. By seeing visual feedback in the form of art, participants were able to understand their movements. Another implication found from using the interactive art application was that the quality of engagement may allow participants to experience a longer period of flow [26], hence a higher chance of neuroplastic changes. A limitation with this study was the lack of measurable outcomes on performance or improvements. The participants and therapists reported the interactive art application having a positive effect on motor functioning. From this and other feedback provided by the participants and therapists, one can conclude that it is feasible for this type of art application to be used in the neurorehabilitation setting [26]. Any continuing or future work from this study should include an investigation into whether or not this type of interactive art application improves any measurable outcomes of performance or motor impairment.

#### *5.2. Art-Making Changes Brain Connectivity*

To investigate how visual art production affects functional connectivity in the brain, Bolwerk et al. (2014) recruited 28 healthy adults, 64 ± 4 years of age, to participate in one of two art interventions: art production or art evaluation [27]. With age, certain areas of the brain begin to lose specialized functioning and turn to alternative brain regions for compensation [36]. Although Bolwerk et al.'s (2014) study investigates several areas of the brain, the focus for this review will be on the sensorimotor cortex because it is involved in motor functioning [10,11,27]. From the results, Bolwerk et al. (2014) found a significant improvement in the intraregional connectivity strength of the sensorimotor cortex with less connectivity in surrounding regions for both groups of participants. However, the art production group yielded stronger changes and stronger connectivity, suggesting a reversal in the loss of specialization and a better improvement in the distinctiveness of the sensorimotor cortex. These results show that art-making promotes improved, efficient interaction between brain regions [27] and holds implications for using art therapy for neurorehabilitating motor impairment.

#### *5.3. Art Therapy for Parkinson's Disease*

In Cucca et al.'s (2018) study, 20 patients with Parkinson's Disease (Group 1) and 20 age-matched healthy people (Group 2) underwent 20 sessions of art therapy [29]. The researchers' main goals were to identify general characteristics of visuospatial dysfunction and the impact art therapy has on motor and non-motor symptoms of Parkinson's disease. Using various art mediums, including oils, pastels, clay, watercolor, and paint, participants in both groups completed 9 art therapy projects designed to build in complexity and

focus on different processes of visuospatial functioning [29]. For example, projects 2, 3, 4, and 6 were all created for the purpose of assessing an aspect of motor functioning: physical control, physical and cognitive capacity, fine motor coordination, and perceptions of physical limitations and strengths.

Cucca et al. (2018) found art therapy to be a safe and reproducible rehabilitation practice for Parkinson's disease patients [29]. Due to their results, the researchers theorized that art therapy rehabilitates by either recruiting underlying neural networks of impaired visuospatial functions, similar to action-observation and motor imagery methodologies, or by recruiting compensatory networks associated with targeted visuospatial functions [29]. Both theories have implications for promoting neuroplasticity [9] and neurorehabilitating motor areas of the brain [11]. The results from the study conducted by Bolwerk et al. (2014) seem to follow Cucca et al.'s (2018) first theory of recruiting underlying neural networks and contradict the second theory because the connectivity strength of the compensatory networks in Bolwerk et al. (2014) was reduced after the art-making intervention [27,29]. If Cucca et al's (2018) first theory is correct, research on a combined art therapy and motor imagery intervention might yield significantly stronger motor improvement results.

#### *5.4. Personal Journey Back to Mobility*

Not many articles exist that discuss measurable outcomes of using art therapy for neurorehabilitation. Most of the studies on art therapy for neurorehabilitation or art therapy in VR are testing for feasibility and usability. McDonald's (2020) own personal experience with art therapy involves using various art forms to rehabilitate her mind and body after suffering a stroke [31]. In her journey back to almost full mobility, McDonald (2020) used a variety of art mediums including paint, charcoal, colored pencils, and water colors. As her mobility improved, she moved on to a harder movement in art-making. Each medium had its own special movement required for proper use, i.e., charcoal on paper required full arm movement and was good for practicing control; colored pencils and brush strokes worked whole hand and wrist extension; and dabbing paint with a paint brush worked the fine motor movements of the fingers and wrist [31]. Along with changing art mediums, the subject matter of the art changed. Art compositions moved from familiar nature scenes to self-portrait style brain-to-muscle pieces. She also began to incorporate visualization of movement or motor imagery into her drawing process. Prior to one of her brain-muscle drawings, an electromyography reading of her deltoid (shoulder) muscle revealed a lack of muscle activation (loss of muscle control). Within days of drawing the brain-to-deltoid muscle connection, McDonald (2020) was able to raise her arm thirty degrees higher. Similar results were seen after incorporating combined brain-muscle and physical activity, such as swimming, running, and smiling, into her artwork [31].

Although it is not a typical experimental study, the results of McDonald's (2020) efforts to perform art therapy on herself further verify how important participation and engagement are in the art activity. Having completed more than thirty types of therapy post-stroke with little to no improvement, McDonald (2020) underwent art therapy and acquired the confidence, enjoyment, and physical goals she desired [31]. One limitation in this self-styled art therapy treatment was that no specific protocol was followed. McDonald (2020) moved through art projects of varying media at her leisure and based her next move off of feelings and observations. Another limitation in her personal journey article was the lack of measurable outcomes from her art therapy. There were, however, several implications to future research and practice involving the subject matter that she used in her art. Once she incorporated visualization or motor imagery and began drawing movements and brain–muscle connections, she started seeing significant improvements in her physical mobility. By the end of her journey in the article, she noted being able to lightly jog and freestyle swim [31]. Motor imagery has already successfully been used for neurorehabilitating motor functioning in brain injury patients [22,37–44]. More research needs to be done to see whether McDonald's (2020) improvements in physical mobility

stem from the subject matter change to brain-muscle connection-based art or from the addition of motor imagery and mental practice to the new subject matter.

Because the article written by Alex et al. (2021) contains one experiment in a non-VR setting and a second experiment in VR, the article was split between Tables 2 and 3, the summary and results will be discussed following Tin the next section.

#### **6. Art Therapy in Virtual Reality: Brain Injury**

This section includes a brief summary of studies consisting of brain injury patients interacting with a virtual art program. Each of the studies contained in Table 3 used stroke patients to observe different aspects of art-making in VR. Some of those include user experience, art content, and range of motion. Investigating these areas of virtual art therapy produced important points that should be considered in future research. Table 3 shows the extracted data from each of the studies using art therapy in a VR setting for brain injury patients, followed by a discussion of results and limitations for each study.

#### *6.1. Traditional Art-Making vs. Virtual Reality Art-Making*

Although this article does not focus on neurorehabilitation, it uses brain injury patients to directly compare art therapy interventions in VR to non-VR, and it highlights several important aspects and limitations of performing art therapy in both environments. The main goals of the study conducted by Alex et al. (2021) were to gain a better understanding of the art-making process in a therapeutic setting for stroke patients and to identify potential design opportunities for stroke rehabilitation using art therapy in VR [32]. The researchers observed 14 stroke patients, 55–84 years old, make art traditionally (non-VR) then make it in VR. From their notes and observations, the researchers established the following three themes for comparing traditional (non-virtual) art-making to virtual art-making: artistic subject matter, aesthetics of materials, and art-making process. Figure 2 shows an example of virtual art created by one of the authors of this review using Google's Tilt Brush [45].

**Figure 2.** Artwork from virtual art setting.

In the traditional art-making setting, the subject matter mostly consisted of landscapes, portraits, and animals while in the VR setting, the subject matter was described as abstract (random shapes and lines), intentional (specific objects), or emergent (inspired by characteristics of the VR paint) [32]. The artistic subject matter in the traditional setting seemed very intentional with most participants using the familiar as inspiration for their art pieces. The subject matter in the VR setting, however, seemed very fluid and less precise, even with the participants who painted specific objects. The participants' inexperience with VR

and lack of control of the VR controllers could explain why the virtual subject matter came across as more abstract and whimsical.

The aesthetic nature of materials in both settings differed in the art mediums available, the color selection process, and the malleability of the medium. Although the VR system was designed specifically for painting, the traditional setting offered a variety of different mediums, including graphite pencils, paint, watercolor, crayons, colored pencils, rollers, sponges, etc. Another difference was seen in color availability. The traditional setting allowed participants to create their own colors, if not already provided, by mixing paints together. The colors in the VR system were luminescent, seen in Figure 2, and restricted to the participants' abilities to successfully select a desired color from the color wheel or from predetermined color choices displayed in small circles below the color wheel [32]. It was observed that in the VR environment some participants had to ask for assistance in navigating the color picker menu or for help with gauging the depth of an object they wanted to erase [32].

Regarding the final theme, art-making process, used for comparing the two interventions, the participants had opposite approaches for the traditional and VR environments [32]. The participants were very socially interactive with other participants and facilitators in the traditional art setting, but when immersed into the virtual environment, they were more focused on creating art. This is likely due to the group setting of the non-VR environment and the virtual intervention being done individually. Another way that the art-making processes appear to be opposites is in the pace that was used to create the art. In the traditional art environment, the participants were situated and reflective. They made careful decisions before committing something to their artwork by taking the time to identify all their options, reflect upon previous choices, view their artwork from different perspectives, practice with the tools, and use different techniques to apply or shape their chosen medium. In the VR setting, however, the participants were more physical and lacked control. Because the virtual environment provided more space for creating, participants used more of their body in the process and were able to create art all around them instead of just right in front of them. The participants seemed out of control because they were very quick to fill the available space and reported that the controller was not doing what they wanted it to do. It was speculated that the participants did not have comparable control of the VR controllers as they had with the traditional tools and that their lack of control might have been from the mid-air movements draining their physical capabilities [32].

In the traditional setting, the participants worked at tables or desks and were able to rest their arms while they created art [32]. In the virtual environment, the participants engaged more of their upper body in the art-making process. The researchers noticed the wider range of motion used in virtual art-making and hinted towards this increase in physical activity having implications for improving motor impairments in stroke patients. They believe VR offers the unique benefit of allowing for adaptability in the scale of movement translation [32]. Changing the movement scale to translate large movements to smaller brush strokes could encourage more physical movement and lead to greater improvements in physical ability. Alternatively, changing the scale in the opposite direction would allow individuals with smaller or shorter ranges of motion to see their brush strokes covering larger areas, potentially helping to overcome the feeling of being physically impaired. This switch in focus from disability to ability is important in promoting progress and recovery [3,4,46].

There were limitations in the speed at which art was made in the VR environment and in the virtual art program that was used. Participants spent only minutes creating artwork in VR but spent hours creating art in the traditional setting. In the short amount of time the participants were in the virtual environment, they would not have been able to experience the benefits of art therapy, such as Csikszentmihalyi and Csikszentmihalyi's (1992) state of flow [8], or even the same benefits experienced in the traditional setting [32]. Using the same virtual art program, participants from Kaimal et al. (2020) made comments about how

navigating the virtual art environment was easier after they adjusted to the controllers [30]. The groups from both studies were taken through an exploratory session to familiarize themselves with the art software, VR controllers, and virtual environment [30,32]. It is unclear whether the participants from Alex et al.'s (2021) [32] study were finding difficulty in the art program itself or in using the VR controllers. It is clear, however, that they did not use the same careful approach to creating art in the virtual environment as they did in the traditional setting.

From the responses made by the participants and their favoritism towards traditional art-making, it is likely that the participants were overwhelmed by their virtual art-making experience [32]. The VR intervention always took place after the traditional non-VR intervention, and some patients participated in more than one session of the traditional art therapy. It would be interesting to see results from a similar study that compares the same number of sessions and counterbalances the art therapy environments. Another limitation, being in the virtual art program, is present in the interaction between the participants and the art mediums. In the traditional (non-VR) art setting, participants gained a sort of physical connection from being able to touch the art mediums and tools and mix the paints. Part of the art therapy experience is the sensory stimulation that physical materials provide. It is especially important for brain injury patients to experience that sensory stimulation as it is known to enhance awareness and focus [47]. Having one controller in place of various art tools takes away that physical connection that was seen in other studies [29,31]. The resulting gap in feeling connected to virtual art-making caused by this limitation has implications to introducing haptic feedback to virtual art therapy. Iosa et al. (2021) tried to rectify the missing tactile information that comes with virtual environments by adding visual feedback of color and shadow to the virtual tool used in their VR program [33].

#### *6.2. Art Improves Performance in Virtual Reality*

Iosa et al. (2021) conducted two experiments, but the first was excluded due to the population used. In the second experiment, four (4) stroke patients with an average age of 60 ± 13 years performed four (4) sessions of virtually interacting with either an artmasterpiece or a piece of control art [33]. The virtual art system consisted of a 2D canvas covered in a white film. Using the VR controller, participants were to "paint" over the canvas, revealing either an art masterpiece or the control art. The illusion of painting was provided by the white film disappearing when the virtual art tool came into contact with the canvas. To add visual feedback to the system, the virtual art tool (a sphere) would turn green when in contact with the canvas but would turn red when the participant moved beyond the canvas. The movement of the virtual sphere and participant's hand were tracked and recorded for performance measures during the sessions. The two participants who interacted with the art masterpiece had significant improvements for all computed parameters compared to the two participants who were assigned the control artwork. The participants also reported high scores of usability for the virtual reality task, hinting at implications of future use for VR-based rehabilitation. Limitations include small sample size and differences in details of the art masterpieces used [33]. Some of the art masterpieces contained humans while others consisted of fluid nature scenes. Artistic subject matter used in art therapy needs to be further studied, as it seems to have made an impact in three of the five studies reviewed so far [31–33]. There are implications that if art therapy can be performed while the brain is monitored using an electroencephalogram (EEG), then certain details and aspects of artwork, such as landscapes versus people in motion, can be used to target specific areas of the brain for rehabilitation [33].

#### *6.3. Digital Art Program in Virtual Reality*

Paczynski et al. (2017) studied the interaction between elderly people and an art program designed for creating digital artwork in VR [28]. Fifteen older adults, ranging from 69 to 96 years of age (average 84 ± 8 years), living in an aged-care facility took turns using the digital art system for six weeks. On average, the participants engaged

in four 11.6 min sessions where they were free to create art without trying to reach a specific goal. Right and left hand movement was tracked along with lower body movement to show changes in performance. To analyze how their digital art program impacted movement, cognitive stimulation, and creativity, Paczynksi et al. (2017) separated the participants into the following categories of impairment: stroke, dementia or memory impairment, and depression [28]. For the purpose of this systematic review, only the stroke and dementia groups' results will be discussed. For the five participants affected by stroke, all showed above average velocities and upper body movements. The majority of the stroke participants enjoyed interacting with the digital art program and felt a positive impact on their physical and cognitive states. The art program allowed for the stroke participants to express themselves creatively, despite their mental or physical impairments. For the group of participants suffering from dementia or memory impairment, four of the nine felt a positive impact on their cognitive health, and five of the nine felt a positive impact on their physical health. Data results for movement and creativity were not provided for this group [28].

Paczynski et al.'s (2017) results revealed that art in VR can be enticing and flexible for many types of users if they can stay engaged long enough to reap the benefits. A trend of growing indifference toward the art program can be seen from the recorded distances of the hands and lower body traveled in the first sessions compared to those traveled in the final sessions. Having seven participants who traveled furthest in their first session implies that the novelty of the art system and the initial excitement and engagement provided a strong motivation for interaction that appears to have slowly faded [28]. If an aspect of sensory stimulation were added to the digital art program, attention might have been more easily sustained [47]. Adding a goal or theme of subject matter to create also might entice participants to stay motivated over several sessions of use. Having only six participants complete four or more sessions raises the question of whether those six were able to reach the state of flow easier than the other participants or if they were the only six to reach the state of flow. The virtual art program presented in Paczynski et al. (2017) afforded accessibility to a creative outlet for multiple disabilities that otherwise might not be able to express themselves [28]. There are implications to cognitive motor repair in the results of the participants who saw an increase in average velocity of one or more body parts. Because this study was for learning about interaction between participants and technology, any future work should investigate if the increased velocity was due to improved motor functioning or due to the excitement created by the new art program.

#### **7. Art Therapy in Virtual Reality: Healthy**

A summary and discussion of results and limitations are included for each study. The two studies being reviewed in this section used healthy participants in virtual art making to examine user experience and interaction with the same virtual art program. Based on the reports from both groups of participants, it is noticeable that the healthy participants had an easier time navigating the virtual art program than the brain injury participants. Table 4 summarizes the data extracted from each study of the studies that used art therapy in a VR setting for healthy people. ¶

#### *7.1. Experiencing Art Therapy in Virtual Reality*

The study performed by Kaimal et al. (2020) was included because of its implications toward using the specified art therapy system on individuals with motor impairment. Kaimal et al. (2020) studied 17 individuals, aged 18–65 years old, to gain an understanding of their experiences with art therapy in VR from one free-form art-making session [30]. From the feedback provided by the participants, the researchers identified key aspects art therapy in VR offers that traditional art-making does not. Creating art in the virtual environment engaged full body movements, which the participants found to be enjoyable. Being able to erase part of the artwork eased the sense of permanence typically associated with traditional art mediums. Participants did not have to worry about making mistakes

and instead were able to focus on exploration and creative expression [30]. Many participants noted that once they familiarized themselves with the controllers, they felt in control and were able to feel the art flow from them without any distractions. They also expressed their enjoyment in the feeling of being transported to an alternative or imagined space away from the constraints, pressure, and stress of the real world [30].

A practical implication that can be drawn from Kaimal et al.'s (2020) study includes using virtual art therapy on individuals lacking fine motor skills [30]. Experimental limitations are seen more from the system used rather than the virtual environment. The art program does not allow for changing colors of the environment or background, and the art tools sometimes came across as the clunky version of traditional tools [30]. Similar comments were made by the participants in Alex et al.'s (2021) study [32]. Although Kaimal et al. (2020) and Alex et al. (2021) used the same art software in VR, they yielded conflicting results. The population in Kaimal et al.'s (2020) study consisted of younger healthy people [30] while Alex et al.'s (2021) study consisted of elderly stroke patients [32]. The younger population reported more enjoyment in regards to art therapy in VR and did not seem to have as much trouble navigating the menus or using the controller(s) [30]. Another difference is the approach to art-making in the virtual environment. The elderly stroke population seemed overwhelmed by their lack of control of the controller and rushed through creating an art piece [32]. The younger, healthy population seemed to take the time to master the controller and move through the space during the creating process to view their artwork from different perspectives [30]. It is unclear if the participants from Alex et al.'s (2021) [32] study underwent the VR intervention on the same day as their last non-VR art therapy session, but, if so, that could have influenced the quick pace seen from those participants in the VR session.

#### *7.2. Expert Art Therapists on Art Therapy in Virtual Reality*

To examine the potential for art therapy in VR, Hacmun et al. (2021) had seven expert art therapists, 42–75 years old, observe art-making and create their own art in a virtual environment [34]. Each participant was introduced to the VR medium prior to the creation and observation sessions. In the creation session, participants were allowed to make 3D art in a 360-degree space. In the observation session, participants simultaneously watched the creator in the real-world environment and viewed the virtual art on a computer screen. In the results from the study, the researchers found that most of the participants were surprised by how much they enjoyed creating art in VR and how user-friendly they found the medium. Participants reported missing the physical contact that traditional art-making provides but described the ability to freely move through the art as fun and unique. Some participants noted that they felt their body's physical movement to be a sort of tactile feedback even though there was a lack of physical substrate in the art created. All of the participants reported that VR was suitable for art therapy, but some stated that it should be used along with other creative media. Most of the participants agreed that the ideal population for using art therapy in VR is adolescents who are already familiar with and attached to technology and screens. They reported that they were unsure whether VR could be beneficial to the elderly or physically disabled [34].

Hacmun et al. (2021) point out that a big limitation in their study is that the participants mainly consisted of digital immigrants who do not consider themselves to be technologically savvy [34]. Another limitation that the authors mention is the participants only performed one session of creating art in VR. They acknowledge that feedback from the art therapy experts might change with more practice and familiarization with the VR medium. In terms of movement during virtual art-making, the participants spoke a lot about the freeing feeling of using their whole body to create art but did not connect this feeling of embodied expression with implications toward motor neurorehabilitation or even physical rehabilitation. However, the researchers associate the participants' reporting on movement with results from other studies that have shown movement enhances the feeling of being present in VR due to the increase in connection between the real and virtual

worlds [34]. Having the connection between reality and VR can provide an alternative point of view for patients to establish a new sense of self or self-awareness [4,15].

#### **8. Discussion**

Brain injuries remain a serious public health concern and leave many individuals with long-lasting disabilities. Attempts to create new, innovative ways of using art therapy to treat and repair disabilities caused by brain injury have been made and show promising results in multiple areas of therapy [15,25,35,46,48] with implications toward using art therapy in neurorehabilitation practices [1,10,49–51]. When using art therapy for motor neurorehabilitation, especially for brain injuries, promoting brain plasticity needs to be considered. Neuroplastic changes of motor areas in the brain are thought to happen from a variety of stimuli, including: creative state of flow [9], subject matter [8], motor imagery, action observation, and action execution [22].

In answering the first research question, consider the results from the studies that were reviewed, see Figure 3. Bolwerk et al. (2014) showed that the process of art-making significantly improves intraregional connectivity strength in the sensorimotor cortex [27], which holds implications for using art therapy for motor neurorehabilitation. With the positive improvements in motor functioning seen from Worthen-Chaudhari et al. (2003) and Paczynski et al. (2017), it seems feasible that art therapy can be used for neurorehabilitation purposes outside VR [26,29] and inside VR [28] for patients suffering from brain injury. Hacmun et al. (2021) revealed that the freedom of movement offered in VR can help establish the connection between the real and virtual worlds [34], which can provide an alternative point of view for patients to establish a new sense of self or self-awareness [4,15].

McDonald (2020) performed art therapy in a traditional (non-VR) setting and began seeing significant improvements in her physical mobility once the artistic subject matter changed to brain-muscle connections and movement visualization was added [31]. From these results, though, the question is raised of whether it was the physical art-making or the combination of subject matter and movement visualization (motor imagery) that improved her motor functioning. If the answer is the latter, then those aspects of art therapy can easily be transferred to a virtual environment. Many researchers are already successfully using motor imagery for neurorehabilitation in VR [22,37,39,42,44]. If the answer is the former (physical art-making), such as that seen in the results from Cucca et al. (2018) [29], then the physical contact and skill required of specific art mediums might play a more significant role in rehabilitation and should be incorporated into virtual art therapy programs in the form of haptic feedback [20,34]. If it turns out to be due to a combination of art-making

and subject matter or motor imagery, then the results from Iosa et al. (2021) show that it is possible to yield motor improvements, based on subject matter, in a virtual setting [33]. Although further research into artistic subject matter and combining art therapy with motor imagery needs to be conducted, it is evident that using art therapy in VR for rehabilitating motor functioning is feasible.

Looking at the studies performed by Alex et al. (2021), Kaimal et al. (2020), and Hacmun et al. (2021), they all used the same VR art program on different populations but yielded varying results and limitations [30,32,34]. The elderly stroke population, though seated in a swivel chair, were quick and chaotic in filling the available space [32] while the younger, healthy population was slow and deliberate with their actions and placements [30]. It can be inferred that the physical limitations of the stroke patients affected their control when having to hold the VR controller in mid-air to paint [32]. Additionally, there likely is a limitation in the older, stroke group being confined to a swivel chair [32] while the two healthy groups were free to walk around the space [30,34]. To attain the freedom to create art in VR using full body movements, like those seen in healthy participants, the available population of brain injury patients would have to be reduced to the physically impaired of a certain degree. Unless a mobility support system is used in conjunction with VR or an alternative way of making art in VR is created, it is unsafe to allow patients, specifically with lower limb impairments, to physically move freely around the virtual environment.

A recurring theme appears in several of the studies that were reviewed. Patients seem to quickly lose engagement when art therapy is performed outside of the traditional setting [26,28,32]. It can be deduced that the participants in Iosa et al.'s (2021) study did not lose interest in the VR art task because of the added visual feedback on the virtual art tool [33]. Adding visual feedback follows the idea that sensory stimulation is engaging and draws focus to the task at hand [47]. Following the same principle, switching from traditional art mediums and tools to a VR controller causes a disconnect between the user and the art-making process. Haptic feedback has the potential to recreate a physical connection between user and art medium or tool in a virtual environment. The participant groups from Alex et al. (2021) and Hacmun et al. (2021) agreed on the missing physical connection to art mediums in VR. However, the group from Hacmun et al.'s (2021) study reported the virtual art program to be user-friendly [34] while the group from Alex et al.'s (2021) study seemed to struggle using the program [32]. Because the two healthy populations had an easier time using the virtual art program than the stroke population, future studies should allow brain injury patients extra time or practice sessions to familiarize themselves with navigating virtual art applications and VR controllers. In addition to balancing the learning curve, implementing alternative modalities of controlling virtual art programs has the potential to establish the missing connection between user and virtual art mediums. Adding that kind of sensory stimulation to virtual art programs might also be effective in helping brain injury patients gain control inside the virtual environment. Comparing the reviewed studies, most of the limitations and differences appear to stem from the experimental design(s) or the virtual art system used rather than from the virtual environment [26,28,30,32,34]. If adjustments can be made to the virtual art software, interactivity of materials, and experimental design to ensure a more usable, accessible, and stimulating VR experience, then there is a high probability that brain injury patients can enter the state of flow and induce neuroplasticity, making it feasible to use art therapy in VR for neurorehabilitation. Future work in virtual art therapy will need to assess the correlation between performance and artistic subject matter, as well as overcome the lack of measurable outcomes for showing performance and motor improvements. Utilizing mobility tests for pre and post study measurements, such as the Functional Independence MeasureTM [26] and the Fugl–Meyer assessment [52] conducted to assess patients for inclusion criteria, is a way of reducing heterogeneity and allowing for comparable results between studies. Another way of producing measurable outcomes is by combining EEG and art therapy. Using EEG during art therapy could reveal how artistic subject matter influences activation in certain brain areas and promotes use-dependent plasticity [9,33]

by studying power levels in different brain regions. It also presents a way for measuring neuroplastic changes [51]. In their study of surveying cortical activation patterns after making art and after performing a physical task, King et al. (2017) revealed a statistically significant difference in cortical activation after art-making compared to baseline data. Their findings have implications toward being able to produce measurable outcomes from art therapy used in neurorehabilitation [51].

#### **9. Conclusions**

The systematic review conducted in this paper defined the terms of feasibly using art therapy in VR for the motor neurorehabilitation of brain injury patients and outlined the need for future research to use post-study assessments to reduce the heterogeneity of results. Although limitations exist, researchers are continually finding ways to advance the use of art therapy in VR. More research involving multiple sessions of art therapy in VR needs to be conducted to study the learnability and usability of virtual art programs. With further research into artistic subject matter and sensory stimulation in virtual art applications, approaches to art therapy in VR can be fine-tuned for targeting and rehabilitating motor areas of the brain to achieve results similar to those observed in more traditional art therapies.

**Author Contributions:** Conceptualization, A.A. and C.L.B.; methodology, A.A. and C.L.B.; validation, A.A. and C.L.B.; formal analysis, A.A.; investigation, A.A. and C.L.B.; resources, A.A.; data curation, A.A.; writing—original draft preparation, A.A.; writing—review and editing, C.L.B.; supervision, C.L.B.; project administration, A.A. and C.L.B. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Acknowledgments:** The authors acknowledge Kasee Gadke-Stratton for her assistance and information regarding the use of VR for art therapy that formed the basis for this research.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Using Augmented Reality and Internet of Things for Control and Monitoring of Mechatronic Devices**

#### **Erich Stark <sup>1</sup> , Erik Kuˇcera 2,\* , Oto Haffner <sup>2</sup> , Peter Drahoš <sup>2</sup> and Roman Leskovský <sup>2</sup>**


Received: 19 July 2020; Accepted: 6 August 2020; Published: 7 August 2020

**Abstract:** At present, computer networks are no longer used to connect just personal computers. Smaller devices can connect to them even at the level of individual sensors and actuators. This trend is due to the development of modern microcontrollers and singleboard computers which can be easily connected to the global Internet. The result is a new paradigm—the Internet of Things (IoT) as an integral part of the Industry 4.0; without it, the vision of the fourth industrial revolution would not be possible. In the field of digital factories it is a natural successor of the machine-to-machine (M2M) communication. Presently, mechatronic systems in IoT networks are controlled and monitored via industrial HMI (human-machine interface) panels, console, web or mobile applications. Using these conventional control and monitoring methods of mechatronic systems within IoT networks, this method may be fully satisfactory for smaller rooms. Since the list of devices fits on one screen, we can monitor the status and control these devices almost immediately. However, in the case of several rooms or buildings, which is the case of digital factories, ordinary ways of interacting with mechatronic systems become cumbersome. In such case, there is the possibility to apply advanced digital technologies such as extended (computer-generated) reality. Using these technologies, digital (computer-generated) objects can be inserted into the real world. The aim of this article is to describe design and implementation of a new method for control and monitoring of mechatronic systems connected to the IoT network using a selected segment of extended reality to create an innovative form of HMI.

**Keywords:** mechatronic devices; Internet of Things; cyber-physical systems; system control; augmented reality; mixed reality; Azure cloud

#### **1. Introduction**

Extended reality, as a modern technology, is used in Industry 4.0 to virtualize the efficient design of optimal production structures and work operations with their effective ergonomic evaluation and design [1,2]. New forms of process monitoring, control, diagnostics and visualization are currently being sought in digital factories [3]. It is the extended reality that brings such forms [4].

Under extended reality we understand virtual, augmented and mixed reality [5]. At present, there is no general consensus on the distinction between augmented and mixed reality. There are several definitions.

The first definition is the definition from The Foundry, which develops software for 3D modeling and texturing [6]. This definition is often used in industrial practice.

Virtual reality (VR) replicates an environment that simulates a physical presence in places in the real world or an imagined world, allowing the user to interact in that world [6]. Devices for virtual reality are Oculus Rift, Oculus Quest, HTC Vive, and so forth [7].

Augmented reality (AR) is a live, direct or indirect view of a physical, real-world environment whose elements are augmented (or supplemented) by computer-generated sensory input such as sound, video, graphics or GPS data [6]. Augmented reality is an overlay of content on the real world, but that content is not anchored to or part of it. The real-world content and the CG content are not able to respond to each other [8].

Mixed reality (MR) is the merging of real and virtual worlds to produce new environments and visualisations where physical and digital objects co-exist and interact in real time. MR is an overlay of synthetic content on the real world that is anchored to and interacts with the real world. The key characteristic of MR is that the synthetic content and the real-world content can react to each other in real time [6]. Technologies for mixed reality are Microsoft HoloLens (Windows Mixed Reality platform), Apple ARKit and Android ARCore [8].

Another definition is often used in scientific teams rather than in industrial practice. In 1994, the authors Milgram and Kishiho [9] introduced the spectrum between real and virtual environment—reality-virtuality continuum (Figure 1). This continuum defines a mix of real and virtual world. They understand mixed reality as anything between real environment and full virtual reality. Between reality and virtual reality, they distinguish between augmented reality, which is practically identical to augmented reality according to The Foundry definition, and augmented virtuality. It can be stated that augmented virtuality corresponds to mixed reality according to The Foundry definition.

**Figure 1.** Reality-virtuality continuum.

#### *1.1. Motivation*

The current emerging trend in the Internet of Things has an impact not only in applications for households [10], smart buildings and services, but also on industries and manufacturing [11,12]. The application of IoT principles in industries is called the Industrial Internet of Things (IIoT). In this case, individual machine parts, sensors and actuators act as interconnected devices [13]. In particular, the interconnection of devices should be wireless and bring about new possibilities for their mutual interaction as well as for their diagnostics, control and provision of advanced services.

The research included several meetings and discussions with industry partners who demanded the Internet of Things to be connected to the augmented or mixed reality. It has shown that it is an obvious fact that integration of augmented and mixed reality technologies into production processes in digital factories is inevitable.

#### 1. **British company dealing with the implementation of Industry 4.0 principles**

One of the modern trends in the industrial field is the use of increasingly powerful, more durable and more affordable mobile devices, such as a smartphone or tablet. Such devices allow the use of modern technologies that have not been widely used in industry so far. We mean augmented and mixed reality and its applications for control and monitoring of devices. The use of augmented or mixed reality creates a qualitatively new and better way of solving HMI. In conventional approaches, it is necessary to select a specific device (sensor, actuator . . . ) on the display device so we need to know its specific location in the production hall or its ID. After selecting the device, the required data (for example in the form of a graph or a table) is displayed on the display unit. When using the augmented or mixed reality application, it is possible to operatively search for individual sensors within the production hall environment and interactively display the required values or change the settings and parameters of the given device via the display unit. An interesting feature would be the advanced functionality that would allow you to see in the environment which device the selected device is connected to or is forwarding data. It is the localization and identification of individual sensors and actuators that is currently an open problem that can be solved by several approaches.

### 2. **Slovak company dealing with tire diagnostics**

The use of augmented or mixed reality in the diagnosis of different devices is also an open question that is being addressed by several companies. During the meeting, one of the industrial partners formulated a request for a tire fault diagnosis system via a headset or mobile device for mixed reality. At the same time, this system should make it possible to display diagnostic information from various devices in the factory, which is a similar requirement as in the previous point.

### 3. **Slovak manufacturer of advanced cutting machines**

The use of augmented or mixed reality for the maintenance and operation of complex machines, the area of which reaches several tens of meters, is currently also an open topic, which requires a comprehensive multidisciplinary approach. International leaders in cutting technologies have already begun to implement such solutions. Therefore, a discussion about the possibilities of implementing maintenance and diagnostic systems using extended reality also took place with a Slovak company in this area (Figure 2).

## 4. **Control of sophisticated industrial devices with limited access**

Another of the industrial partners demanded the use of augmented or mixed reality in the control and diagnostics of various sophisticated devices, to which only a limited group of employees have access. This eliminates the need to implement physical control panels, to which even a regular employee can have access. The requirement is that the device can be controlled only by an employee who has access to a mobile device (smartphone/tablet) with an augmented or mixed reality application. In addition to security, such an application also brings the advantages described in the previous points.

**Figure 2.** Testing of augmented and mixed reality for cutting machine maintenance.

#### *1.2. Related Research*

In the analysis of the state-of-the-art of the solved problem, we focused on searching scientific works and existing solutions in the subject field. The found projects showed the possibilities of using augmented or mixed reality for control and monitoring of mechatronic systems within the IoT networks. Another important aspect considered was whether a project was developed using open source code, and whether it was put into practice.

References [14,15] refer about the use of standard communication protocols, which should be used in the design of an IoT system with the implementation of web standards. This creates the so-called Web of Things. Web of Things is designed for easy integration of systems into the current web. It is therefore an idea to create a common application level for IoT based on web technologies and protocols. Subsequently, this idea was extended in Reference [16] by the term Augmented Worlds. The Augmented World concept can be defined as a software application that adds digital objects to the surrounding physical environment (e.g., city, building, room) that users or software agents can interact with. The combination of Web of Things and Augmented World created the concept of Web of Augmented Things.

The concept of Augmented Things was presented in Reference [17]. The idea is to create a database of digital copies of real objects (typically it can be consumer electronics) and assign various information to them. This can be, for example, maintenance information, instructions for use, and so forth. After capturing a real object (its digital copy is in the database of Augmented Things), information about this real object will be displayed on mobile device's screen in the augmented reality.

Close to the focus of the research is the concept of the author Phillipe Lewicki [18]. He created a demonstration application that could be used to control a Phillips Hue smart light bulb using a Microsoft HoloLens headset. With the help of HoloLens, it was possible to select the color of the light of a given bulb with a simple gesture in augmented/mixed reality. The author realized that today's solutions allow you to control light bulbs through a mobile application. In them, it is then necessary to find a specific room and a light bulb that he wants to control. This may not always be practical, and control with a headset provides greater convenience. However, the described concept has not been further developed.

There is a concept by designer Ian Sterling and engineer Swaroop Pala [19]. This concept demonstrates the control of smart devices using gestures. Microsoft HoloLens is used. The task was to provide a user interface for Android Music Player and an Arduino microcontroller with a connected fan with light. As in the previous case, it is not a complete system, but a single-purpose demonstration application.

The better solution is presented in Reference [20]. The presented AR/MR-IoT framework uses standard and open-source protocols and tools like MQTT (Message Queuing Telemetry Transport), HTTPS (Hypertext Transfer Protocol Secure) or Node-RED. The solution relies on QR codes. The article focuses mainly on the time aspects of communication in the presented framework.

A comprehensive commercial software system for diagnosing and controlling mechatronic systems is Vuforia Studio [21], which was formerly called ThingsWorx Studio. The rebranding took place after the purchase of the library for augmented and mixed reality Vuforia by the technology company PTC. Such an acquisition was a logical step, as PTC reacted very flexibly to the emergence of the Industry 4.0 and Industrial IoT concept. Vuforia Studio uses its closed-source tool, where it is possible to insert 3D and 2D objects, which will be displayed in augmented reality after capturing and recognizing the mechatronic device. This technology does not recognize devices directly, but using its own 2D tags ThingMark, which are actually a conventional technology similar to a QR code. The content is then visualized using Vuforia View.

ŠKODA AUTO has introduced the Smart Maintenance project, which uses augmented reality for maintenance tasks [22]. The Microsoft HoloLens headset is used. It is a relatively simple software application that uses HoloLens cameras to recognize a metal tube with handles. The goal is to diagnose the distances of the handles, which are likely to deviate over time. In the case of a tube detection, the real object is covered by a digital tube with the handles in the right place. Based on the visual information, it is possible to easily identify any displacement and then fix the handle so then it sits with the position of the virtual counterpart. This method of maintenance simplifies and speeds up the work of technicians, as they are relieved of the need to constantly measure the distance. A custom 3D engine was developed for the application. However, after a real test of the application within the solution of Reference [23], it is possible to state that the application reacted badly to the lighting conditions and also suffered from the limitations of the HoloLens headset. The holograms were too pale and did not copy objects correctly. Field of view was limited. The real use of the presented solution is therefore questionable.

Development of methods for control and monitoring of mechatronic systems using new information and communication technologies belongs to modern directions in cybernetics, automation and mechatronics. Based on the analysis of available literature sources and recent research projects it was found out that control and monitoring methods of mechatronic systems connected to IoT using extended reality are implemented in the form of various prototype solutions for selected device types or as closed-source single-purpose application systems. These systems are dedicated and are not easy-to-extend to control and monitor different mechatronic devices without a modification of the client software application. Such systems cannot be considered as generalized and modular solutions. Excursions and discussions with industrial partners have shown that there is interest in such comprehensive solutions. In the context of the ongoing Industry 4.0 industrial revolution, small and medium-sized enterprises are already interested in implementing modern digital technologies such as the Internet of Things, cloud and extended reality, into their manufacturing processes.

#### **2. Materials & Methods**

Control and monitoring of mechatronic systems connected in IoT networks using a selected segment of extended reality brings new challenges, as this concept combines hardware and its mechanical parts, microcontrollers and electronic systems, 3D engine for extended reality, mobile devices and communication protocols within the IoT and the cloud. With the proper design of the methodology of control and monitoring of mechatronic IoT systems and the supporting software module, it is possible to synergistically combine the above digital technologies bringing about a functional, original and modular system applicable for a selected class of mechatronic systems.

Nowadays, mechatronic systems in IoT networks are controlled and monitored mainly via industrial panels or console, mobile or web applications. In the case of using such conventional methods of control and monitoring mechatronic systems in a smaller room, this process can be simple and efficient. The list of devices being on one screen, we can set, monitor and control them almost immediately. However, if there are several rooms, buildings or a large digital factory, sorting these items can already be confusing and cumbersome. In these cases, the developed methodology of control and monitoring of mechatronic IoT systems based on augmented reality can yield effective solutions.

Once the system was developed it was important to determine how to recognize and identify the individual mechatronic devices. These devices are subsequently used to anchor computer-generated elements in augmented reality. There are more alternatives [8]:

• *Using a QR Code*—The name of the QR code comes from Quick Response, as this code has been developed for quick decoding. It is a two-dimensional bar code printed either on paper or in digital form. Using a mobile device camera we can decode the encoded information. The QR code is a square matrix consisting of square modules. The color of the QR code is black and white. The advantages of using QR codes include the rapid generating of a new QR code for application system build and extension. Next advantage is that each device or sensor can have a unique QR code, so using a QR code we can distinguish the objects with the same shape. The drawback is that we need to keep the mobile device parallel to the code when the recognition process is running, and close enough to the device.


After considering the advantages and disadvantages of the above alternatives, we decided to use a 3D model. Although creating and extending the system is more time-consuming, smooth running and more intuitive application design was more important to our case. The use of a three-dimensional model and a three-dimensional map is original in the issue of monitoring and control of mechatronic systems using augmented reality and is one of the benefits of the proposed solution.

Based on the analysis, a concept of the application system is proposed, which is shown in Figure 3.

## 1. **The software application analyzes the image from the camera of the mobile device and recognizes the mechatronic system**

The augmented reality mobile app recognizes a real mechatronic device using a camera and a 3D map created in the Wikitude Studio [26]. The 3D map of the mechatronic device is created using photographs of the device taken from several angles. Subsequently, the Wikitude SDK (software development kit) augmented and mixed reality library can interpret this 3D map from a database. The database is stored in a software application on an Apple iPad tablet. The advantage of this method is the ability to recognize the object from any angle. Consequently, even with less visibility tracking does not have to be interrupted as Wikitude can also store also close surroundings of the object. Thus, the implementation of the proposed solution can do without conventional methods of recognizing objects relying on QR codes.

#### 2. **The mobile device connects to the server and the mechatronic system's device twin in the cloud**

The mobile device is connected to the cloud where the recognized mechatronic system has its digital copy (*device twin*).

3. **The data from sensors of the mechatronic device is sent to the server, and the device twin in the cloud is synchronized**

The mechatronic device automatically sends data under its identifier from sensors to the server where the data is also stored. For this purpose, the InfluxDB database is used [27], designed for time-dependent data which can then be visualized in the Grafana environment [28]. At the same time, the digital copy of the mechatronic device is synchronized at the level of the Microsoft Azure Device Twin, which ensures the visibility of current data even in the cloud environment [29].

4. **The application obtains information about the type of the mechatronic device, downloads the definition of the user interface and draws a graphical interface for control and monitoring of the system**

The proposed system works in such a way that the mobile application recognizes the mechatronic device and—according to its identifier- obtains a unique definition scheme of the user interface for the needs of its monitoring and control. The concept of definition schemes for a dynamic generation of a graphical user interface in augmented reality is one of the pillars of modularity of the implemented solution and at the same time one of the application benefits. The mobile application has access to these definition schemes due to the connection to the database. The connection is realized by means of visual flow-based programming in the Node-RED environment [30], where a suitable scheme is obtained based on the parameter.

## 5. **The user interacts with the mechatronic device through a graphical interface in the augmented reality—a new form of HMI**

Based on a unique definition scheme, the mobile application displays a graphical user interface in augmented reality consisting of two parts. The first part is diagnostics and displays current data from available sensors. The second part is control and shows the control elements directly designed for the mechatronic device. Subsequently, the user is allowed to interact with the mechatronic device through a graphical interface in augmented reality, which is one of the new modern forms of human-machine interface (HMI).

#### 6. **Control commands are sent to the server which sends them to the connected mechatronic device**

Control commands are sent from the mobile device to the server using the MQTT communication protocol. On the server, they are processed and executed. The software application on the mechatronic device listens on the MQTT topic and subsequently sends these requests to sensors and actuators via serial communication.

**Figure 3.** Proposal of mechatronic device monitoring and control using augmented reality.

To achieve the set objectives, it is necessary to design and implement a comprehensive hardware-software system for control and monitoring of mechatronic IoT systems based on the augmented reality and the concept of definition schemes for dynamic generation of graphical user interface. The developed system will be tested on a laboratory mechatronic system connected to the IoT.

#### **3. Results**

The development of such a complex system had to be done in cooperation with other workers and in several parallel lines to cover all four component parts of mechatronics (mechanics, electronics, automation and information-communication technologies). In what follows, the the whole system will be described using the description of its individual parts: server, augmented reality mobile device (Apple iPad), laboratory mechatronic device and cloud Microsoft Azure.

#### *3.1. Server*

The tools, which are implemented on the server side (Figure 4) in the described project, can be run on practically any Linux-based operating system. In the developed solution, Raspberry Pi 3 microcomputer and Raspbian operating system were used as a server. It is based on the Debian distribution with an emphasis on optimization for this type of microcomputer.

The MQTT broker (mosquitto) runs on the server and serves as the main central point through which all communication between the mobile device and the currently recognized mechatronic system takes place. Messages are sent to the broker using publish-subscribe communication on undefined topics. It is therefore not necessary to define them in advance, but the application that wants to obtain data from the address must be subscribed to receive this type of message.

#### 3.1.1. Flow-Based Programming of Communication Interface Using Node-RED

Node-RED consists of a runtime based on Node.js and a visual editor. The program is created in the browser by dragging functional nodes from the palette into the workspace. Then it is necessary to interconnect these nodes. The application is then deployed to production automatically using the *deploy* button. Additional nodes can be easily added by installing new nodes created by the programming community. Flows that are created in the workspace can be exported and shared as JSON (JavaScript Object Notation) files.

First, connection to a local MQTT broker is required. Additional settings, such as login details or additional messages for connections, can also be filled in within the settings. In this case, however, we can do with the IP address 127.0.0.1 and the port 1883.

When we have the connection to the MQTT broker implemented, we can use this broker as an I/O (input/output) node. In Figure 5, there is an input MQTT node connected to a local MQTT broker and subscribed to incoming messages from **makeblock-tank/sensor/ultrasonic** topic. This is followed by the transformation function **ultrasonicTransformDB** (Figure 6), which ensures the transformation of the incoming message into the required format. When editing a block of this type, it is possible to insert classic JavaScript code. This transformation function is performed whenever a message arrives at a given topic. First, the body of the message that came from the mechatronic system is obtained, and then a new object is prepared in the already required format, which is suitable for storage in the database. Said type of coupling is created for all available sensors within one mechatronic system.

**Figure 5.** Connection of the input MQTT node to the output node of the InfluxDB database.


**Figure 6.** MQTT message transformation function.

In Figure 7 it is possible to see the flow in Node-RED (scheme of all interconnected I/O nodes) for mechatronic device Makeblock Tank. Due to the fact that the system is designed with emphasis on modularity, it is possible to add another mechatronic device to the new tab, where the flow for this new device will be located.

**Figure 7.** Node-RED flow for mechatronic device Makeblock Tank.

In case we would like to add another mechatronic device to the system, then it is necessary to add another flow here. It determines which data should be transformed and in what way, and also, if necessary, in which specific database it should be stored. The stream can also be exported and dynamically loaded into Node-RED using the available API (https://nodered.org/docs/api/) (application programming interface).

#### 3.1.2. Definition Scheme for Generating a Graphical User Interface

To generate a customizable graphical interface for control and monitoring a mechatronic device in an augmented reality mobile application, it was necessary to design a way for this interface generation to be implemented. The concept of definition schemes for a dynamic generation of the graphical user interface is one of the pillars of modularity of the proposed system for control and monitoring of mechatronic systems using IoT and augmented reality. This is an original concept developed by the authors of the article. The scheme had to be designed with the emphasis on the solution versatility so that it would be applicable also in case of using other display technology than the augmented reality. The format selected was the JSON document as the JSON data format is supported in every relevant programming language. This means that it can be parsed in software applications and it is easy to work with its objects and attributes. In the first part of the scheme, there are the network settings of the master node (MQTT broker), that is, what IP address and port it is located on. These properties are defined at the top of the JSON document as **url** and **port**. This is followed by the **topic** property, which corresponds to the MQTT topic. This unambiguously determines the path of the mechatronic IoT device within the MQTT channel.

In the next part, there are two types of element labels (**sensors** and **controls**) in terms of their functionality. **Sensors** is used for reading elements—for example for data from sensors. **Controls** is for control elements. Both elements may contain multiple nested GUI (graphical user interface) elements according to the needs of the recognized mechatronic device.

*Electronics* **2020**, *9*, 1272

In the current version of the definition scheme, three different GUI elements are available—**joystick**, **button** and **text**. **Joystick** and **button** belong to the control elements. They contain these *properties*):


The last element is **text**, which contains the same properties as controls. However, it can also have the **unit** property, which adds a unit to the read values.

When this JSON document is parsed, it creates a GUI element and then creates a connection to the MQTT broker. The path will then look like this: **url:port/topic/subTopic**.

In Table 1, there is an overview of elements and their data types for the correct operation of parsing and subsequent generation of a graphical user interface. A question mark for a specific element property indicates that it is optional.


**Table 1.** Overview of elements and their data types in the definition scheme.

Figure 8 shows the definition scheme for the mechatronic device Makeblock Tank. The scheme is written in JSON format. In the first part, the access data (**url**, **port** and **topic**) are defined, which defines the network access for the given device.

Subsequently, depending on the configuration, the mechatronic device may contain two types of elements on the screen: for read (**sensors**) and control (**controls**). The first type is intended only for reading and displaying data from sensors. The controls are those that actively interfere with the device—so they access its actuators.

The specific types of controls and information elements that appear on the screen of a mobile device in augmented reality may be diverse in terms of functionality, so that the system can be expanded in the future. Three types were created for the selected device: **joystick**, **button** and **text**. The first two elements are used to perform the action. The text element is read-only. In the case of this element, there is also the **label** attribute. With this one, it is possible to add a description—for example the type of a specific sensor.

The displaying control elements also depends on the set position (**posX** and **posY**), while the controls are generated from the lower right corner upwards. For reading elements, this is for clarity from the top left corner.

**Figure 8.** JSON definition scheme for mechatronic device Makeblock Tank.

The common attribute for both types is **subTopic**. It determines the network path where the actuator or sensor is located. If you create the entire path from the **url**, **port**, **topic**, **subTopic** attributes (e.g., for a joystick), it would look like this: **192.168.100.72:1880/makeblock-tank/motor**.

Definition schemes with user interfaces for all available mechatronic devices have their own flow in the Node-RED , where they are connected to the communication. At the top of Figure 9, you can see the *inject* nodes, which store the current version of the definition schema. When the scheme is changed, it is necessary to activate the node *inject*, which ensures the sending of the saved definition scheme to the next block. The schema is inserted into the MongoDB database. also shows two different collections—**raspberry-pi** and **makeblock-tank**. In each of these collections, there is a diagram for a given device, but its element layout may be different. For example, there could be two mechatronic devices of the type Makeblock Tank, but each has a different identifier and different equipment with sensors and actuators.

The blocks at the bottom of Figure 9 are used to obtain a definition scheme for an application on a mobile device that provides augmented reality. The input is an HTTP GET node that allows to create an HTTP server for requests without having to program it. This node has *url* set to **/schema/:device**. The IP address and port of the local server must be specified before this entry. The address also contains the HTTP parameter **:device**, which represents the name of the mechatronic device (which we want to obtain the definition scheme for). Then the *switch* node evaluates which device it is and sends requests to the MongoDB database. When using an HTTP node, it is necessary to have an output node of this type. The HTTP output node provides a response for the called service, where the output will be data and HTTP status code. The green nodes are only used to list and check the values in the Node-RED.

#### *Electronics* **2020**, *9*, 1272

**Figure 9.** The flow in Node-RED for JSON definition schemes of mechatronic devices.

## *3.2. Mobile Device and Augmented Reality Application*

The mobile application includes the Wikitude SDK, which provides methods and algorithms for recognizing 3D objects in a real environment using a mobile device camera. A software package in Unity engine format (*.unitypackage*) is available on the official website of this tool.

Real-world 3D objects are detected by Wikitude by trying to match pre-created references in video streams from mobile device's camera. Such a prepared reference is called *Object Target*. It can also be understood as a SLAM map. *Object Target* is created using input images of a real model (mechatronic device). These are then converted into so-called *Wikitude Object Target Collection*, which is saved as a **.wto** file. The procedure for creating a collection is as follows:


In Figure 10 we can see a list of created Object Targets. Each of the objects contains the *Point Cloud*, which can be seen in Figure 11. It represents a 3D map which Wikitude uses for recognition of a 3D object. If the user finds Point Cloud insufficient at some angles, it can be expanded with additional photos.

**Figure 10.** List of objects in the collection.

**Figure 11.** Point Cloud (map of significant points) of the processed 3D object—mechatronic device Makeblock Tank.

#### Generating of User Interface Using Definition Scheme

The generation of a dynamic graphical user interface takes place at the Unity engine level via a definition scheme. Its format has been described in the previous text. A sequence diagram describing the method of generating a GUI in the Unity engine is shown in Figure 12.

After launching the mobile application, all necessary libraries are initialized. Each application created in Unity contains void Start () method containing initialization code. First, an asynchronous connection to the MQTT broker is made using the Connect () method. Now the Unity mobile application is ready to recognize mechatronic devices. If the device is recognized, we get its name in the method named OnObjectrecognized (ObjectTarget). This name is sent as an HTTP GET request to the Node-RED using the GetDeviceSchema (deviceType) method, where a connection is established in a separate thread of the mobile application and a schema is obtained. Subsequently, Node-RED starts the programmed flow (Figure 9). The obtained definition scheme with GUI elements is processed in the mobile application using the JSON data format parser and then the methods InitUIDeviceControls(DeviceSchema) and InitUIDeviceSensors (DeviceSchema) are called.

These methods initialize prefabs of native UI elements in the Unity engine. UI element prefab is created in the hierarchy of game objects and all required visual properties are added to it. When this *GameObject* is set up as needed in the hierarchy, it is necessary to move it to the project structure in the *Prefabs* folder. This saves the prefab and can then be initialized at any time in the Unity application.

The generation itself works in such a way that the parsed definition schema (parsed JSON) is sent to the InitUIDeviceControls method. It determines which elements are available according to the attributes of the object. Depending on the element, it is possible to initialize a prefab named **button** or **joystick**.

During initialization, it is necessary to insert the generated object also on the canvas, which renders the UI objects. Otherwise, they would not appear in the mobile application.

According to the sequence diagram (Figure 12) it is clear how the whole process works in the mobile application. The mobile device is aimed by the user at a mechatronic device/object that the application can recognize. After the whole process of device recognition, obtaining the necessary definition schemes and data, the GUI is displayed, as can be seen in Figure 13. The control part of UI was in terms of *user experience* situated to the right part of the generated GUI on the mechatronic device. In the lower right part, we can use the thumb in a simple way. In the described case of the Makeblock Tank device, only the **joystick** is needed, which is used to control the belts and thus the movement of the device. Elements such as sensor data that does not need to be clicked are visualized in augmented reality at the top of the device.

**Figure 12.** Sequence diagram: How a user interface is generated in Unity engine.

**Figure 13.** Recognized mechatronic device with dynamically generated GUI in augmented reality.

#### *3.3. Laboratory Mechatronic Device*

To implement and verify the method of monitoring and control of mechatronic systems using IoT and augmented reality, it was necessary to build a suitable laboratory physical model. For this purpose, the mechatronic kit Makeblock Ultimate 2 was selected. The Robotic Arm Tank model was built, which best met the requirements for testing and demonstration of the developed software system. A diagram of the laboratory model showing the connected electronic systems, sensors and actuators can be seen in Figure 14.

The main electronic element of the laboratory model is the Makeblock MegaPi development board, which is built on the Arduino MEGA 2560 platform, while supporting programming using the Arduino IDE. The development board contains three ports for connecting motors with an encoder. Sensors for measuring the distance of the vehicle from the obstacle (ultrasonic sensor) and a humidity/temperature sensor were connected. RGB LED (Light-Emitting Diode) was connected as another actuator.

A key element of the laboratory mechatronic system is its wireless control. The package contained a bluetooth communication module, which was not suitable as it was designed to control a device with a ready-made application from Makeblock. So another alternative was chosen—the connection of MegaPi with a Raspberry Pi 3 microcomputer, which has a built-in WiFi module. The MegaPi is ready for connection to the Raspberry Pi, as it has three screw holes in the same places as the Raspberry Pi 3. It also allows serial communication with this microcomputer.

Laboratory mechatronic device Makeblock Robotic Arm Tank can be seen in Figure 15.

#### **Mechatronic device**

**Figure 14.** Illustrative diagram of mechanical and electronic elements of mechatronic device.

**Figure 15.** Laboratory mechatronic device Makeblock Robotic Arm Tank.

After assembling the laboratory mechatronic system, MegaPi had to be programmed. The mBlock tool is available for teaching programming. However, in the described solution, it is necessary to have the complete MegaPi functionality available and to access the device control using the API interface. This is suitable to implement via the Arduino IDE, where the complete service code is loaded to access all sensors and actuators that Makeblock can work with. It is then possible to access the sensors and actuators by calling the API interface, specifically in our case by sending parameters from the Raspberry Pi via a serial line. These parameters can be sent, for example, using a library in JavaScript or Python. In our case we use JavaScript.

Before using the library, it is necessary to upload the service code to the MegaPi microcontroller via the Arduino IDE. This means that it is no longer necessary to program all the functionalities for actuators and sensors manually, but we are making available an interface for higher programming languages, as it was mentioned.

MegaPi is physically connected to the Raspberry Pi 3 and the service code is loaded at the same time, the next step is to install the MegaPi control library. It can be done using the *Node Package Manager* (NPM) and the **npm install megapi** command. During development, it was found that this interface is not sufficiently maintained and some methods are no longer functional. It was therefore necessary to study how the communication works from publicly available source code. Subsequently, it was necessary to make adjustments to the service source code as well as the JavasScript library code.

At the beginning of the JavaScript program, the MegaPi constructor is called, where two parameters are required:


Subsequently, it is possible to call functions for access to ports and sensors. All functions are described on GitHub (https://github.com/Makeblock-official/NodeForMegaPi).

#### *3.4. Device Twin in Azure Cloud*

The system design assumes the synchronization (Figure 16) of current data from the sensors of the mechatronic system to the cloud using *device twin*. This technology is available as part of the Microsoft Azure IoT Hub and is created for each added mechatronic device.

**Figure 16.** Device twin in Azure cloud.

Azure IoT Hub is a managed service that runs in the Microsoft Azure cloud. It serves as a central point for two-way communication between IoT applications and IoT devices. Azure IoT Hub can be used to create Internet of Things solutions with reliable and secure communication between the cloud-based backend and millions of IoT devices. It is possible to connect virtually any device to it.

Device twins are JSON documents that store information about the state of an IoT device, including configurations, metadata, and conditions. The Azure IoT Hub maintains such a device twin for each IoT device that is connected to the IoT Hub [31].

#### **4. Conclusions**

The article deals with a modern form of control and monitoring of mechatronic systems connected within the Internet of Things networks using augmented reality.

Based on the analysis of available literature sources and recent research projects, it was found that methods for control and monitoring of mechatronic systems connected within IoT using extended reality are implemented in the form of various prototype solutions for dedicated devices or as closed-source single-purpose application systems; such systems are not complete and without a modification of the client software application it is not feasible to easily extend the control and monitoring also for different mechatronic devices. Such solutions cannot be considered as generalized and modular ones. Excursions and discussions with industrial partners have shown that there is a growing interest in such comprehensive solutions. In the context of the ongoing Industry 4.0 industrial revolution, small and medium-sized enterprises are already interested in the implementation of modern digital technologies such as the Internet of Things, cloud and extended reality, into their manufacturing processes.

The result of the proposed research is a modular program system with a new form of human-machine interface (HMI) implemented in augmented reality. The graphical user interface uses a new concept of definition schemes for its dynamic generation. In the proposed solution, modern detection and recognition methods of 3D objects in the augmented reality are used instead of conventional methods of control and monitoring of mechatronic IoT systems based on scanning QR codes.

The scientific and application contributions can be summarized in the following points:

• Design of a modern form of control and monitoring of mechatronic systems using the Internet of Things and augmented reality.


An important result of this part of the presented work is the design and development of an application platform of a modular solution with a modern HMI form in augmented reality. The proposed solution complements and improves conventional augmented reality-based methods of control and monitoring of mechatronic IoT systems (relying on QR code scanning). The new application platform represents an original approach based on modern software for detecting and recognizing 3D objects. The generalizability and modularity of the developed solution is supported by the original concept of definition schemes for dynamic generation of a graphical user interface in augmented reality.

It turned out that the Wikitude SDK is sufficiently robust and advanced to scan the mechatronic device even in different lighting conditions though this was not the primary aim of this research.

In the discussion [32], a situation is addressed when it was necessary to recognize marks and objects under very low light conditions (less than 10 LUX). The Wikitude SDK can handle this situation as well. Wikitude is able to recognize objects with a high success [33] even if there are low light, shades, reflections or noisy surroundings in the environment.

If the lighting conditions are very low, it is possible to use the Input Plugins API [34] as suggested in the discussion in Reference [32]. It works so that the input image is preprocessed using for example, the OpenCV library. The resulting image is further processed by Wikitude and evaluated in the real time. In such a case, it is possible to obtain a very acceptable recognition of marks and objects as can be seen in the video in Reference [35].

Possible limitations of the presented system result from the fact that in a digital factory there can be several devices with a similar shape or devices with large dimensions. In this case, it would be possible to use own 3D identifiers, unique for each machine. However, such identifiers are difficult to copy with sufficient accuracy only using a set of photographs. The 3D identifiers open up new ways and opportunities for future research and development of optimal shapes, dimensions and algorithms for generating them using generative modeling. It is expected that 3D identifiers will be produced using the 3D printing technology which allows to create and test identifiers of different shapes and dimensions without the need to manufacture expensive conventional molds for plastic castings.

Scientific and application contribution to the field of Internet of Things and extended reality as declared in the four above points consists in the developed original solution generalizable and modifiable for further research and technical practice in Industry 4.0.

#### **5. Patents**

Proposed "Method for monitoring and control of mechatronic systems using augmented reality" is patent pending under application number 158-2019 in Industrial Property Office of the Slovak Republic [https://wbr.indprop.gov.sk/WebRegistre/Patent/Detail/158-2019].

**Author Contributions:** E.S. and E.K. proposed the idea in this paper and prepared the software application; O.H., P.D. and R.L. designed the experiments, E.S. and E.K. performed the experiments; O.H., and R.K. analyzed the data; E.K. wrote the paper; E.S., O.H., P.D. and R.L. edited and reviewed the paper. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work has been supported by the Cultural and Educational Grant Agency of the Ministry of Education, Science, Research and Sport of the Slovak Republic, KEGA 038STU-4/2018 and KEGA 016STU-4/2020, by the Slovak Research and Development Agency APVV-17-0190, and by the Tatra banka Foundation within the grant programme Quality of Education, project No. 2019vs056 (Virtual Training of Production Operators in Industry 4.0).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Lexicon-based Sentiment Analysis Using the Particle Swarm Optimization**

#### **Kristína Machová 1 , Martin Mikula <sup>1</sup> , Xiaoying Gao <sup>2</sup> and Marian Mach 1,\***


Received: 26 May 2020; Accepted: 13 August 2020; Published: 15 August 2020

**Abstract:** This work belongs to the field of sentiment analysis; in particular, to opinion and emotion classification using a lexicon-based approach. It solves several problems related to increasing the effectiveness of opinion classification. The first problem is related to lexicon labelling. Human labelling in the field of emotions is often too subjective and ambiguous, and so the possibility of replacement by automatic labelling is examined. This paper offers experimental results using a nature-inspired algorithm—particle swarm optimization—for labelling. This optimization method repeatedly labels all words in a lexicon and evaluates the effectiveness of opinion classification using the lexicon until the optimal labels for words in the lexicon are found. The second problem is that the opinion classification of texts which do not contain words from the lexicon cannot be successfully done using the lexicon-based approach. Therefore, an auxiliary approach, based on a machine learning method, is integrated into the method. This hybrid approach is able to classify more than 99% of texts and achieves better results than the original lexicon-based approach. The final hybrid model can be used for emotion analysis in human–robot interactions.

**Keywords:** sentiment analysis; opinion classification; lexicon-based approach; hybrid approach; lexicon generation; lexicon labelling; particle swarm optimization

#### **1. Introduction**

Online discussions generate a huge amount of data every day, which are hard to process manually by a human. The processing of this discourse content of social networks can bring useful information about the opinions of the crowd on some web trend, political event, person, or product. Approaches to sentiment analysis, particularly to opinion classification, can be used in recognition of antisocial behavior in online discussions, which is a hot topic at present. Negative opinions are often connected with antisocial behavior; for example, "trolling" posting behaviors. This approach can also be used in HRIs (Human–Robot Interactions), where a robot can use information about the polarity of an opinion or mood of the human, in order to communicate appropriately. When a robot communicates with a human (e.g., an elder), it must choose one from many answers which are suitable to the situation. For example, it can choose an answer which can cheer up the person, if it has information that the current emotional situation/mood of the person is sad. It can also adapt its movements and choose a movement from all possible movements, in order to cheer up the human. Therefore, understanding of the emotional moods of humans can lead to better acceptance of communication with robots by humans.

Opinion analysis can be achieved using either a lexicon-based approach or a machine learning approach. These approaches are used, in opinion analysis, to distinguish between positive or negative (sometimes also neutral [1]) opinions with respect to a certain subject. Machine learning approaches are most often based on the Naive Bayes classifier, Support Vector Machines, Maximum Entropy, k-Nearest Neighbors [2–4], or Deep Learning (i.e., based on neural network training) [5,6]. The study [5] presented an approach based on a new deep convolutional neural network which exploits characterto sentence-level information to perform sentiment analysis of short texts. This approach was tested on movie reviews (SSTb, Stanford Sentiment Treebank) and Twitter messages (STS, Stanford Twitter Sentiment). For the SSTb corpus, they achieved 85.7% accuracy in binary positive/negative sentiment classification, while for the STS corpus, they achieved a sentiment prediction accuracy of 86.4%. The study [6] proved that deep learning approaches have emerged as effective computational models which can discover semantic representations of texts automatically from data without feature engineering. This work presented deep learning approaches as a successful tool for sentiment analysis tasks. They described methods to learn a continuous word representation: the word embedding. As a sentiment lexicon is an important resource for many sentiment analysis systems, the work also presented neural methods to build large-scale sentiment lexicons.

The study [7] presented an approach for classifying a textual dialogue into four emotion classes: happy, sad, angry, and others. In this work, sentiment analysis is represented by emotional classification. The approach ensembled four different models: bi-directional contextual LSTM (BC-LSTM), categorical Bi-LSTM (CAT-LSTM), binary convolutional Bi-LSTM (BIN-LSTM), and Gated Recurrent Unit (GRU). In this approach, two systems achieved Micro F1 = 0.711 and 0.712. The two systems were merged by assembling, the result of which achieved Micro F1 = 0.7324.

However, machine learning methods have a disadvantage: they require a labelled training data set to learn models for opinion analysis. An interesting fact is that the lexicon approach can be used for the creation of labelled training data for machine learning algorithms.

On the other hand, the lexicon-based approach requires a source of external knowledge, in the form of a labelled lexicon which contains sentiment words with an assigned polarity of opinion expressed in the word. The polarity of opinion has the form of a number that indicates how strong the word is correlated with positive or negative polarity, which is assigned to each word in the lexicon. However, this information is very unbalanced across different languages. In this paper, we focus on adapting and modifying existing approaches to the Slovak language.

This work focuses on an opinion analysis based on a lexicon. In the process of lexicon creation, the lexicon must be labelled to find optimal values for the polarity of words in the lexicon. To assign correct polarity values to words, a human annotator is needed for manual labelling. The manual labelling is time-consuming and expensive. Thus, we tried to replace a human annotator by the Particle Swarm Optimization (PSO) algorithm, as lexicon labelling can be considered an optimization problem. The goal of labelling is to find an optimal set of polarity values; that is, labels for all words in the generated lexicon. These labels are optimized recursively until the opinion classification of texts in data sets using this lexicon with the new labels gives the best results. Therefore, the resulting values of the Macro F1 measure of the opinion classification represent the values of the fitness function in the optimization process. We compare the effectiveness of opinion classification using the lexicon labelled by PSO and using the lexicon annotated by a human.

On the other hand, even when we use the best lexicon, it may still not cover all sentiment words. For this reason, some analyzed posts could not be classified as having positive or negative opinion. To solve this problem, we extend the lexicon approach with a machine learning module, in order to classify unclassified posts using the lexicon-based approach. This module was trained on training data labelled using a lexicon approach for opinion classification. We applied the Naive Bayes classifier to build the module.

The contributions of the paper are as follows:


The proposed approach is focused on lexicon-based sentiment analysis. The effectivity of a lexicon approach to sentiment analysis depends on the quality of the used lexicon. The quality of the lexicon is influenced by selection of words in the lexicon, as well as by a measure of precision of the estimated polarity values of words in the lexicon. Our approach uses PSO and BBPSO for the optimal estimation of polarity values. The deep learning method cannot satisfactorily generate the polarity values of words in the lexicon, as clear information about these weight values is lost in the large number of inner layers involved. On the other hand, deep learning can be successfully used in the auxiliary model for the hybrid approach trained by machine learning methods. It is generally assumed that deep learning can achieve better results than the Naive Bayes method in the field of text processing.

#### **2. Related Works**

Lexicon-based approaches to opinion analysis require a sentiment lexicon to classify posts as having positive or negative opinion. The lexicon can be generated in three ways: manual, automatic, and semi-automatic. Manually generated lexicons are more accurate and usually involve only single words. They can be translated from another language or collected from a corpus of texts. The value of polarity can then be copied from the original lexicon or calculated from the corpus, based on some metrics. However, this approach is time-consuming. A majority of lexicons separate words into positive and negative groups [8] or provide additional types of words, such as intensifiers (words that can shift polarity) [9,10]. On the other hand, lexicons such as the Warriner lexicon [11] or the Crowdsourcing, a word–emotion association Lexicon [12] provide some additional information about the value of polarity for each word. Polarity values allow us to compare the polarities of words and to find more positive and negative words.

Automatically generated lexicons require less human effort. They assign polarity values based on relationships between words in existing lexicons (e.g., SentiWordNet) [13]. These lexicons contain automatically annotated WordNet synsets, according to their degrees of positivity, negativity, and neutrality. In the WordNet-Affect [14], emotional values were added to each WordNet synset. SenticNet [15] includes common-sense knowledge, which provides background information about words. The main weakness of automatically generated lexicons is that they might contain words without polarity or incorrectly assigned polarities. For this reason, the semi-automatic generation of lexicons was introduced. These lexicons are created automatically and are then manually corrected by a human.

Various optimization methods can be used for lexicon labelling. For example, the study [16] presented a global optimization framework which provides a unified way to combine several human-annotated resources for learning the 10-dimensional sentiment lexicon SentiRuc. By minimizing the error function, an optimal labelling of the lexicon can be found. The work also presented a sentiment disambiguation algorithm, in order to increase the flexibility of this lexicon. The experiments of sentiment disambiguation achieved nice results (Accuracy up to 0.987), but the experiments of sentiment classification based on different lexicons achieved an F1 rate value between 0.383 and 0.726.

Several studies have also used nature-inspired algorithms for text classification. In the study [17], Particle Swarm Optimization was applied to find the most useful attributes, which were added as an input for a framework based on Conditional Random Field. PSO has also been used to select attributes and combined with Support Vector Machines to classify reviews [18]. In this paper, PSO is used to generate numbers which represent the polarity values of specific words in the lexicon.

Escalante et al. proposed an approach for increasing the effectiveness of learning term-weighting schemes using a genetic program [19]. The schemes were used to improve classification performance. Standard term-weighting schemes were combined with the new term-weighting schemes, which were more discriminative due to the use of the genetic algorithm. They reported an experimental study comprising a data set for thematic and non-thematic text classification, as well as for image classification. Unlike their approach, we use a genetic program to find not only the weights of words, but values of their opinion polarity as well, which is a different problem and cannot be computed only based on the frequency of words in the text. Nevertheless, the average result of their best-performing approaches for all data sets was F1 = 0.775. Our average result for all data sets was Macro F1 = 0.759, which is comparable with the results in [19].

The study [20] proposed an approach to simultaneously train a vanilla sentiment classifier and adapt word polarities to the target domain. The adaptation was based on tracking wrongly predicted sentences and using them for supervision. On the other hand, our approach builds a domain-independent lexicon of labelled words. In this paper, the best results of testing on the Movie data set was Accuracy = 0.779. Our best results on the Movie data set (MacroF1 = 0.743, see Table 11) were comparable with the results in [20]. It is easier to achieve higher results in Accuracy than in F1 rate, even though both measures of classification effectiveness consider both false positive and false negative classifications.

#### *2.1. Nature-Inspired Optimization*

Nature-inspired algorithms are motivated by biological systems such as beehives, anthills, and swarms of fish, birds, and so on. They investigate the behaviors of individuals in a population, their mutual interactions, and their interactions with an environment. For example, PSO was inspired by a flock of birds searching for food. We suppose that only some birds know about food and where it is situated. Therefore, the best strategy is to follow the individual nearest to the food. Every individual in a population represents a bird and has a fitness value in the search space.

Particle Swarm Optimization is an optimization algorithm which is inspired by a flock of birds. PSO converges to the final solution; in this case, it has the form of the best-labelled lexicon. The possible solutions are called particles, which are parts of the population. Each particle keeps its best solution (evaluated by the fitness function) called *pbest*, while the best value chosen from the whole swarm is called *gbest.* The standard PSO consists of two steps: change velocity and update positions. In the first step, each particle changes its velocity towards its *pbest* and *gbest* [21]. In the second step, the particle updates its position. A new position is calculated, based on previous position and a new velocity. Each particle is represented as a vector in a D-dimensional space. The ith particle can be represented as *X<sup>i</sup>* = (*xi*1,*xi*2, . . . ,*xiD*). The velocity of the ith particle is represented as *V<sup>i</sup>* = (*vi*1,*vi*2, . . . ,*viD*) and the best previous position of the particle is represented as *P<sup>i</sup>* = (*pi*1,*pi*2, . . . ,*piD*). The best particle in the swarm is represented by *g* and *w* is the inertia weight, which balances the exploration and exploitation abilities of the particles. The velocity and position are updated using Equations (1) and (2):

$$w\_{\rm id}^{n+1} = wv\_{\rm id}^n + c\_1 r\_1^n (p\_{\rm id}^n - \mathbf{x}\_{\rm id}^n) + c\_2 r\_2^n (p\_{\rm gd}^n - \mathbf{x}\_{\rm id}^n) \tag{1}$$

$$
\mathbf{x}\_{id}^{n+1} = \mathbf{x}\_{id}^{n} + v\_{id}^{n+1} \tag{2}
$$

#### where


The stopping criteria of the algorithm often depends on the type of the problem. In practice, PSO is run until a fixed number of function evaluations is carried out or an error bound is reached.

PSO uses *pbest* and *gbest* to update the position of a particle. The impacts of these values were studied in [23]. In this work, *pbest* and *gbest* were set as constants and the trajectories of the particles were investigated. These results show that the trajectory can be determined by the difference between *pbest* and *gbest*. These positions can determine the particle's movement. Based on this knowledge, a new PSO method was designed; the so-called Bare-bones PSO (BBPSO). BBPSO uses a Gaussian distribution *N*(µ,σ) with the mean µ and standard deviation σ, as shown in Equation (3):

$$\mathbf{x}\_{\rm id}^{t+1} = \begin{cases} \ N(\mu, \sigma), & \text{rand}() < 0.5\\ p\_{\rm id}^t, & \text{otherwise} \end{cases} \tag{3}$$

where µ is the center of *pbest* and *gbest,* and σ is the absolute difference between *pbest* and *gbest*. The rand() function is used to speed up convergence by retaining the previous best position, *pbest*.

#### *2.2. Naive Bayes Learning Method*

Naive Bayes is a probabilistic classifier based on Bayes' theorem and the independence assumption between attributes.

For n observations or attributes (respectively, words) *x*1, . . . , *xn*, the conditional probability for any class *y<sup>j</sup>* can be expressed as Equation (4):

$$P\{\mathbf{y}\_{\rangle}/\mathbf{x}\_{1},\ldots,\mathbf{x}\_{n}\} = \beta \mathbf{P}\{\mathbf{y}\_{\rangle}\} \prod\_{i=1}^{n} \mathbf{P}\{\mathbf{x}\_{i}/\mathbf{y}\_{j}\} \tag{4}$$

This model is called the Naive Bayes classifier. Naive Bayes is often applied as a baseline for text classification [2]. We used the Naïve Bayes learning method in two ways:


When we used Naive Bayes for labelling words in a lexicon, all labelled words in the lexicon played the role of attributes in the Bayes learning method. We had to calculate the numerical value representing the measure of polarity of each word in the lexicon. This value is based on probability of the word to belong to a class (positive or negative). We needed a training data set to calculate these values (probabilities). A training data set was used to calculate the probability *P* that a word *w* from the post text relates to each class *c* (positive or negative). Labels assigned using this probability were used to build a lexicon. The probability can be calculated by the simple probability method described by Equation (5):

$$P(w\_c) = \frac{\sum w\_c}{\sum w} \tag{5}$$

where


In case that the word is not assigned to a specific class, the probability would be zero; therefore, a method which returns a very low number, instead of zero, was implemented.

#### **3. Lexicon Generation**

There are many approaches for the generation of a lexicon. A lexicon can be generated for a given domain. This lexicon is obviously very precise in this domain, but usually has a weak performance in different domains. Another way is to generate a general lexicon. This lexicon usually has the same effectiveness in all domains, which is mostly not very high.

We generated two lexicons to analyze opinion in Slovak posts using a lexicon approach. We used two different methods of generation: translation and composition from many relevant lexicons. The first (Big) lexicon was translated from English and then extended by a human. It was enlarged by domain-dependent words, in order to increase its effectiveness. The domain-dependent words were words which may be common, but which have different meanings in different domains. For example, the word "long" has a different opinion polarity in the electrical domain (i.e., "long battery life") than in the movie domain (i.e., "too long movie"). Thus, the Big lexicon was domain-dependent, which is its disadvantage. For this reason, we decided to generate another new (Small) lexicon. This lexicon was expected to be domain-independent, as it was extracted from six English lexicons in which only domain-independent words were included. Domain-independent words are words which have the same meaning in different domains. They were analyzed and only overlapping words from all lexicons were picked up. The advantage of the Small lexicon is its smaller size, in comparison with the Big lexicon. This is an important feature, as each particle in our PSO implementation represents the whole labelled lexicon; more precisely, a set of polarity values for all words in the lexicon. A smaller lexicon, thus, means that a smaller set of labels must be optimized. So, the size of the lexicon influences the time needed to find the optimal solution. The words in the lexicon were selected once, but the labels of those words (i.e., polarity values) were found by optimization using PSO and BBPSO in 60 iterations. During optimization, the labels of the words were recursively changed many times, until the fitness function gave satisfactory results.

For both lexicons, three versions were generated: The first version was labelled manually by a human annotator, the second version was labelled by PSO, and the third one using BBSPO. Then, all versions were used for opinion analysis of post texts in the Slovak language and tested. We could engage more annotators in the process of human labelling, but the subjectivity of labels would remain, and averaging the values of labels may obscure the accurate estimation of the word polarities by one expert human.

The Big lexicon was generated by manual human translation from an original English lexicon [10], which consists of 6789 words including 13 negations. The generated lexicon in Slovak was smaller than original lexicon, as some words have in Slovak less synonyms than in English. We translated only positive and negative words to Slovak. Synonyms and antonyms of original words were found in a Slovak thesaurus. The thesaurus was also used to determine intensifiers and negations. The final Big lexicon consisted of 1430 words: 598 positive words, 772 negative words, 41 intensifiers, and 19 negations. The first version of this lexicon was labelled by a human. The range of polarity from −3 (the most negative word) to +3 (the most positive word) was chosen to assign the polarity value to each word. For each word in the lexicon, the English form was searched in a double translation. "Double translation" means that each word from the lexicon was translated into English and then was translated back to Slovak, in the case that the word had the same meaning before and after translation, the final form of the word was added to the lexicon.

The Small *lexicon* was derived from six different English lexicons, as used in the works [10,15,24–27]. The English lexicons were analyzed and only overlapping words were chosen to form the new lexicon. To translate these words to Slovak, the English translations from the Big lexicon were used. Overlapping words were found, and their Slovak forms were added to the lexicon. This new lexicon contained 220 words, including 85 positive words and 135 negative words. Intensifiers and negations were not added, as they were not included in all original lexicons. The first version of the lexicon was labelled manually, with a range of polarity from −3 to 3. The details of the lexicons used for the creation of the Small lexicon are as follows:


## **4. Lexicon Labelling Using Particle Swarm Optimization**

Particle Swarm Optimization (PSO) was chosen as a method for the lexicon labelling, as labelling is an optimization problem where a combination of values of labels for all words in a lexicon has to create the best overall evaluation of the polarity of a given text. PSO is an efficient and robust optimization method, which has been successfully applied to solve various optimization problems. In the optimization process of lexicon labelling using PSO, each particle represents one version of the lexicon for labelling. A lexicon can be encoded as a vector *X<sup>i</sup>* = (*xi*1, *xi*2, . . . , *xiD*). Each word of the lexicon is labelled by a number, representing the measure of polarity from negative to positive *xij* ∈ {−3,3}, *i* = 1, 2, . . . , *N*, where *N* is the number of particles and *j* = 1,2, . . . ,*D*, where *D* denotes the number of words in the lexicon. Thus, the particle size depends on the size of the lexicon.

From the Big lexicon, only positive and negative words were used. Therefore, the particle size was decreased from 1470 to 1370 polarity values. The particle representing the Small lexicon had 220 polarity values. The designed approach is described in the following Algorithm 1:


The goal of labelling a lexicon using PSO is to find an optimal set of polarity values for all words in this lexicon. One position of the particle represents one potential solution (one set of labels of words), which is recursively changed during the process of optimization. Each potential solution can be represented as a vector in D-dimensional space, where D is the number of words in the lexicon. In our approach, the initial population was generated randomly and then, the PSO method was applied. In PSO optimization, each particle was evaluated based on the fitness function (values of MacroF1). For each actual particle, *pbest* (particle best) was set and *gbest* for the whole swarm (global best) was searched. For the next iteration, a velocity of each particle was calculated (1) based on its *pbest* and *gbest*, and then the position of the particle was updated using Formula (2). Then, the particle was evaluated again and *pbest* and *gbest* were updated again. This process was run recursively until a fixed number of iterations was met. For experiments with standard PSO, the following parameters were used:


## *4.1. Labelling by Bare-Bones Particle Swarm Optimization*

Bare-Bones PSO uses a different approach to find an optimal polarity value for each word in a lexicon. BBPSO works with a mean and standard deviation of a Gaussian distribution. The mean and deviation are calculated from *pbest* and *gbest*. The process of labelling is shown in the following Algorithm 2:

#### **Algorithm 2:** BBPSO algorithm

```
generate the initial population
for number of iterations do
     for particle_i do
          φi
             evaluate particle_i using fitness function
          ζi compute value of fitness function for pbest of particle_i
          if φi > ζi
                    then update pbest
          end if
          if pbest > gbest then update gbest
          end if
     end for
     for each particle_i do
          for each dimension d do
               compute new position using Gaussian distribution
          end for
     end for
end for
return the value of gbest particle
```
BBPSO uses a Gaussian distribution *N*(µ*id*,σ*id*) with mean µ*id* and standard deviation σ*id*. These values are calculated using Equations (6) and (7):

$$
\mu\_{id} = \frac{gbest\_d + pbespbest\_{id}}{2} \tag{6}
$$

$$\sigma\_{id} = \left| gbest\_d - pbest\_{id} \right| \tag{7}$$

where


#### *4.2. Fitness Function for Optimization*

The fitness function was based on the lexicon approach used to classify the opinion of post texts in data sets. This classification was provided repetitively with all lexicons generated by PSO (or BBPSO). Every opinion classification using all particular lexicons was evaluated by the F1 rate, which is a harmonic mean between Precision and Recall, calculated by Equation (8). The F1 rate played the role of the fitness function.

$$F1 = \frac{2 \times Precision \times Recall}{Precision + Recall} \tag{8}$$

The opinion classification was implemented in the following way: Each input post text was tokenized. Each word was compared with words in the temporary test lexicon. If the word was found in the dictionary, the polarity value of the post was updated. If the word was positive, the polarity of the post was increased and if the word was negative, the post polarity was decreased, according to Equation (9):

$$P\_p = \sum p\_{vw} \tag{9}$$

where


Precision and Recall were calculated based on the comparison of the automatically assigned labels with the gold-standard class labels. These were applied to calculate the F1 rate. However, the final values of the fitness function were not derived from F1 rate, but instead from the MacroF1 rate (10), which better evaluates the performance on an unbalanced data set. The MacroF1 rate (10) shows the effectiveness in each class, independent of the size of the class:

$$\text{MacroF1} = \frac{F1\_p + F1\_n}{2},\tag{10}$$

where


The use of this fitness function was based on the defined measure of classification effectivity (MacroF1). All words in the lexicon were repeatedly labelled and the effectivity of opinion classification of texts from the data set using this lexicon (with new labels) was evaluated, until the labels were found to be optimal. In this way, the labelled polarity of words in the lexicon could be domain-dependent, if the data set of texts was domain-dependent. Therefore, we used two data sets: one being domain-dependent (Movie), while the other was domain-independent (Slovak-General). The labelling of a lexicon is an optimization problem and, in this case, supervised learning was used only for computing the values of the fitness function. However, supervised learning was also used for model training in the hybrid approach for opinion classification (see Section 7.2). This model was then used for classification of texts with a difficult dictionary.

#### **5. Experiments with Various Labelling**

#### *5.1. Data Sets*

Experiments with different labelling methods were tested on two data sets. The General data set contained 4720 reviews from different websites in the Slovak language. It consisted of 2455 positive and 2265 negative comments. Neutral comments were removed. The reviews referred to different domains such as electronics reviews, books reviews, movie reviews, and politics. The data set included 155,522 words. The Slovak-General data set is available at (http://people.tuke.sk/kristina.machova/ useful/).

The Movie data set [2] contained 1000 positive and 1000 negative posts collected from rottentomatos.com. The data set was pre-processed and translated to Slovak. All data sets were labelled as positive or negative by human annotators. Each data set was randomly split with a ratio of 90:10—90% for training and 10% unseen posts for validation. All results were obtained on the testing set. The same subsets were applied in all experiments, including the human-labelled lexicon.

#### *5.2. Experiments with PSO and BBPSO Labelling*

In the process of optimizing the labelling of the Big and Small lexicons, the initial labelling, a set of polarity values for all words in the lexicon was first found (1370 values for Big and 220 values for Small lexicon; the Big lexicon originally had 1430 words, but only positive and negative words were included in the experiments). This set of values (1370 or 220) represented one particle for PSO optimization. Then, this set of values was changed with the aid of *pbest* and g*best*, until the effectiveness of using the particle (set of labelling values for each word in the lexicon) in the lexicon-based opinion classification was the highest. Within the labelling optimization, not only one but 30 labels for the Small lexicon and 30 labels for the Big lexicon were generated, in order to achieve statistically significant results.

A set of experiments were carried out, where both data sets (General and Movie) were used for testing of labelling for both lexicons (Big and Small). Two methods for labelling these lexicons were tested: using PSO and BBPSO.

Each experiment was repeated 30 times, in order to achieve statistically significant results. The following tables show only the results for the best experiment and the average results of all 30 repeats. The achieved results of these experiments were obtained on the respective validation sets. The results are presented in Tables 1–8. The results of these experiments are measured by Precision in the positive class (Precision Pos.), Precision in the negative class (Precision Neg.), Recall in the positive class (Recall Pos.), Recall in the negative class (Recall Neg.), F1 Positive, F1 Negative, and Macro F1.

Tables 1 and 2 represent experiments on the Movie data set using the Big lexicon. Table 1 shows that using the lexicon labelled by BBPSO was more precise for opinion classification than PSO in all cases, with only one exception. Another observation was that, in all experiments, Precision in classification of positive posts was better than Precision in classification of negative posts; however, for Recall, the observation was opposite. The Macro F1 rate in Table 2 gives us more results. There were no significant differences between classification of positive and negative posts. The important result is that labelling by BBPSO led to a more precise lexicon than labelling by PSO.


PSO best 0.795 0.734 0.780 0.822 PSO average 0.702 0.691 0.687 0.703 BBPSO best 0.814 0.779 0.769 0.822 BBPSO average 0.758 0.730 0.719 0.767




Tables 3 and 4 represent experiments on the Movie data set using the Small lexicon. Comparison of Table 1 with Tables 2 and 3 with Table 4 shows that using the Small and Big lexicons gave very similar results, in terms of Precision, Recall, and Macro F1 rate, on the Movie data set.

**Table 3.** The results of Precision and Recall in positive and negative classes on **Movie** data set using **Small** lexicon labelled by PSO and BBPSO.


**Table 4.** Results of F1 rate in positive and negative classes and Macro F1 rate on **Movie** data set using **Small** lexicon labelled by PSO and BBPSO.


We also provide results for four similar experiments on the General data set, which are presented in Tables 5–8. The results in Table 5 show that experiments on the General data set led to similar results to the experiments on the Movie data set. Precision in classification of positive posts was better than classification of negative posts; however, the observation was opposite in Recall.

The results in Tables 5 and 6 show that using the lexicon labelled by BBPSO was more precise for opinion classification than PSO, in most cases. Table 7 demonstrates that using the Small lexicon on the General data set led to very poor results; only Recall in positive posts gave good results.

**Table 5.** The results of Precision and Recall in positive and negative classes on **General** data set using **Big** lexicon labelled by PSO and BBPSO.


**Table 6.** The results of F1 rate in positive and negative classes and Macro F1 rate on **General** data set using **Big** lexicon labelled by PSO and BBPSO.


**Table 7.** The results of Precision and Recall in positive and negative classes on **General** data set using **Small** lexicon labelled by PSO and BBPSO.


The Macro F1 rate, presented in Table 8, confirms this finding. The reason for this failure could be that the Small lexicon was generated from six English lexicons and only overlapping words from all lexicons were chosen. So, the Small lexicon may have not contained specific words which were important for polarity identification in a given text; that is, it did not contain all necessary words with sentiment polarities needed for the successful sentiment classification of general texts.


**Table 8.** Results of F1 rate in positive and negative classes and Macro F1 rate on **General** data set using **Small** lexicon labelled by PSO and BBPSO.

The significance test is also provided. Paired sample *t*-test was used to prove the statistically significant improvement between PSO and BBPSO. A 95% confidence interval and 29 degrees of freedom were applied. We tested the Macro F1 measure, and the results (see Table 9) showed that BBPSO was significantly better than PSO.

**Table 9.** Results of significance test of Macro F1 rate in experiments on **Movie** and **General** data sets using **Big** and **Small** lexicons.


The *p*-value represents the probability that there is no statistically significant difference between the results. The *p*-values were small in all cases. This means that the probability that there was no statistically significant difference between the results presented in Tables 1–8 is small. Thus, we can say that the difference between the results was statistically significant. This statement is valid with 95% confidence.

The complexity of the automatic labelling of lexicons using the optimization methods PSO and BBPSO is O(IMAX·N·D), where IMAX is the maximum number of iterations, N is the total number of words in the training set, and D is the number of words in the lexicon. This means that the complexity is linear in the size of the training set and the lexicon. In our case, the General data set contained 155,522 words and the Big lexicon contained 1370 words. The complexity of the lexicon approach for opinion classification, which was used for computing the values of the fitness function, was linear in the size of posts in the training set M and the number of words in the lexicon D, such that its complexity is O(M·D).

#### *5.3. Comparison of PSO and BBPSO Labelling with Human Labelling*

In the previous section, it was shown that BBPSO was better than simple PSO. We wanted to also compare this approach to human labelling. Within this experiment, we decided to evaluate results of the opinion classification only in terms of Macro F1 rate. The results are illustrated in Table 10. It was shown that automatic labelling using nature-inspired optimization algorithms, especially BBPSO, was better than human labelling of lexicons for the lexicon approach to opinion classification.


**Table 10.** The comparison of labelling by human, PSO, and BBPSO in Macro F1 rate on **Movie** and **General** (Slovak) data sets using **Big** and **Small** lexicons labelled by PSO and BBPSO. **Table 10.** The comparison of labelling by human, PSO, and BBPSO in Macro F1 rate on **Movie** and **General** (Slovak) data sets using **Big** and **Small** lexicons labelled by PSO and BBPSO.

*Electronics* **2020**, *9*, x FOR PEER REVIEW 13 of 22

The results in Table 10 confirm the findings in Tables 7 and 8: that using the Small lexicon on the General (Slovak) data set led to very poor results, not only when using PSO and BBPSO labelling but also for human labelling. The most important fact is that BBPSO was able to find the best polarity values for the words in the lexicon, independently of the used lexicon (Big or Small) and data set (Movie or General—Slovak). These results are illustrated also in Figure 1a,b for the Big and Small lexicons, respectively. The results in Table 10 confirm the findings in Tables 7 and 8: that using the Small lexicon on the General (Slovak) data set led to very poor results, not only when using PSO and BBPSO labelling but also for human labelling. The most important fact is that BBPSO was able to find the best polarity values for the words in the lexicon, independently of the used lexicon *(*Big or Small*)* and data set (Movie or General—Slovak). These results are illustrated also in Figure 1a,b for the Big and Small lexicons, respectively.

**Figure 1.** The comparison of the **Big** lexicon in part (**a**) and **Small** lexicon in part (**b**) labelling by a human, Particle Swarm Optimization (PSO), and Bare-bones Particle Swarm Optimization (BBPSO) in Macro F1 rate. **Figure 1.** The comparison of the **Big** lexicon in part (**a**) and **Small** lexicon in part (**b**) labelling by a human, Particle Swarm Optimization (PSO), and Bare-bones Particle Swarm Optimization (BBPSO) in Macro F1 rate.

We found seven other approaches which used the Movie data set. Table 11 contains a comparison of our approach to those other related works with experiments on the Movie data set. For our needs, the data set was automatically translated into Slovak language, which had an impact on the overall results of our tests. The last row of the Table 11 contains results of our hybrid approach (Section 7.2.) on Movie data set, but the results of the same approach on General dataset were better (Accuracy = 0.865). We found seven other approaches which used the Movie data set. Table 11 contains a comparison of our approach to those other related works with experiments on the Movie data set. For our needs, the data set was automatically translated into Slovak language, which had an impact on the overall results of our tests. The last row of the Table 11 contains results of our hybrid approach (Section 7.2) on Movie data set, but the results of the same approach on General dataset were better (Accuracy = 0.865).

**Table 11.** The comparison of effectiveness of our approaches with seven other related approaches tested on **Movie** data set. The last row contains results of our hybrid approach (Section 7.2.), which **Table 11.** The comparison of effectiveness of our approaches with seven other related approaches tested on **Movie** data set. The last row contains results of our hybrid approach (Section 7.2), which uses the lexicon approach composed with a machine learning approach (Naive Bayes).


Lexicon app. & Naive Bayes BBPSO labelling 0.807 our approach

#### **6. Distribution of Values of Polarities in Generated Lexicons 6. Distribution of Values of Polarities in Generated Lexicons**  The main purpose of this section is the comparison of human labelling and automatic labelling,

The main purpose of this section is the comparison of human labelling and automatic labelling, in order to answer the following questions: Which integer labels are preferred in PSO and BBPSO labelling, in comparison with human labelling? Can the subjectivity of human labelling cause a decrease of effectiveness of lexicon-based opinion classification? in order to answer the following questions: Which integer labels are preferred in PSO and BBPSO labelling, in comparison with human labelling? Can the subjectivity of human labelling cause a decrease of effectiveness of lexicon-based opinion classification? We worked under the assumption that automatic labelling is not subjective, like human

We worked under the assumption that automatic labelling is not subjective, like human labelling. This is because, in the process of PSO labelling, the effectivity of lexicon use in lexicon-based opinion classification is a decisive factor. Many human annotators can easily agree on whether an opinion is positive or negative, but when determining the intensity degrees of the polarity of opinions, it is difficult to reach an agreement. So, labelling using PSO optimization seemed to be a good solution. This assumption was supported by our results, which are shown in Figures 2–5. labelling. This is because, in the process of PSO labelling, the effectivity of lexicon use in lexicon-based opinion classification is a decisive factor. Many human annotators can easily agree on whether an opinion is positive or negative, but when determining the intensity degrees of the polarity of opinions, it is difficult to reach an agreement. So, labelling using PSO optimization seemed to be a good solution. This assumption was supported by our results, which are shown in Figures 2–5.

We also examined the distribution of polarity values, which were assigned by the automatic labelling in the interval of integers <−3, +3> in labelled lexicons. We wanted to know if there were some differences between labelling by human and automatic labelling (PSO, BBPSO); in other words, we wanted to find some integer values in the interval <−3, +3> which are preferred by a human or an automatic annotator, respectively. Our findings are illustrated in Figures 2–5. We also examined the distribution of polarity values, which were assigned by the automatic labelling in the interval of integers <−3, +3> in labelled lexicons. We wanted to know if there were some differences between labelling by human and automatic labelling (PSO, BBPSO); in other words, we wanted to find some integer values in the interval <−3, +3> which are preferred by a human or an automatic annotator, respectively. Our findings are illustrated in Figures 2–5.

The models of PSO and BBPSO labelling were generated using both General and Movie data sets. Of course, human labelling is independent of any data set. In Figures 2 and 3, the result distributions of the intensity of polarity values in the Big lexicon are shown. Figure 2 illustrates the results of comparison of PSO and human labelling, while Figure 3 illustrates the comparison of BBPSO and human labelling. We can see, in these two figures, that the human annotator avoided labelling words with zero. They expected only positive or negative words in the lexicon and no neutral ones. On the other hand, PSO frequently used a zero-polarity label. BBPSO also used zero polarity values, but they were not applied as often. The models of PSO and BBPSO labelling were generated using both General and Movie data sets. Of course, human labelling is independent of any data set. In Figures 2 and 3, the result distributions of the intensity of polarity values in the Big lexicon are shown. Figure 2 illustrates the results of comparison of PSO and human labelling, while Figure 3 illustrates the comparison of BBPSO and human labelling. We can see, in these two figures, that the human annotator avoided labelling words with zero. They expected only positive or negative words in the lexicon and no neutral ones. On the other hand, PSO frequently used a zero-polarity label. BBPSO also used zero polarity values, but they were not applied as often.

**Figure 2.** Distribution of values of the intensity of polarities acquired in the process of the **Big** lexicon labelling (1370 words) using PSO, in comparison to human labelling. Axis X represents the intensity of a word polarity and axis Y represents the number of words with the given intensity of polarity. **Figure 2.** Distribution of values of the intensity of polarities acquired in the process of the **Big** lexicon labelling (1370 words) using PSO, in comparison to human labelling. Axis X represents the intensity of a word polarity and axis Y represents the number of words with the given intensity of polarity.

polarity.

polarity.

polarity.

*Electronics* **2020**, *9*, x FOR PEER REVIEW 15 of 22

**Figure 3.** Distribution of values of the intensity of polarities acquired in the process of **Big** lexicon labelling (1370 words) using BBPSO, in comparison to human labelling. Axis X represents the intensity of a word polarity and axis Y represents the number of words with the given intensity of **Figure 3.** Distribution of values of the intensity of polarities acquired in the process of **Big** lexicon labelling (1370 words) using BBPSO, in comparison to human labelling. Axis X represents the intensity of a word polarity and axis Y represents the number of words with the given intensity of polarity. labelling (1370 words) using BBPSO, in comparison to human labelling. Axis X represents the intensity of a word polarity and axis Y represents the number of words with the given intensity of polarity.

We ran similar experiments with the Small lexicon. Figures 4 and 5 illustrate the resulting distributions of the intensity of polarity values in the Small lexicon. These results confirm similar findings as for the Big lexicon; in that PSO labelling most often used intensity polarity labels equal to We ran similar experiments with the Small lexicon. Figures 4 and 5 illustrate the resulting distributions of the intensity of polarity values in the Small lexicon. These results confirm similar findings as for the Big lexicon; in that PSO labelling most often used intensity polarity labels equal to zero. BBPSO labelling of the Small lexicon often used extreme (−3 and 3) polarity values. We ran similar experiments with the Small lexicon. Figures 4 and 5 illustrate the resulting distributions of the intensity of polarity values in the Small lexicon. These results confirm similar findings as for the Big lexicon; in that PSO labelling most often used intensity polarity labels equal to zero. BBPSO labelling of the Small lexicon often used extreme (−3 and 3) polarity values.

zero. BBPSO labelling of the Small lexicon often used extreme (−3 and 3) polarity values. We must point out that labelling some words with a zero value means rejecting this word from the lexicon, as the word is not helpful in the process of opinion classification. An interesting discovery is the fact that labelling by nature-inspired algorithms (PSO, BBPSO) achieved very good results, despite the fact that they rejected some words in the process of the opinion classification. PSO labelling rejected from 21% to 25% of all words from the Big lexicon and from 25% to 28% of all words from the Small lexicon. BBPSO labelling rejected approximately 12% of words from the Big We must point out that labelling some words with a zero value means rejecting this word from the lexicon, as the word is not helpful in the process of opinion classification. An interesting discovery is the fact that labelling by nature-inspired algorithms (PSO, BBPSO) achieved very good results, despite the fact that they rejected some words in the process of the opinion classification. PSO labelling rejected from 21% to 25% of all words from the Big lexicon and from 25% to 28% of all words from the Small lexicon. BBPSO labelling rejected approximately 12% of words from the Big lexicon and from 12% to 16% of words from the Small lexicon. We must point out that labelling some words with a zero value means rejecting this word from the lexicon, as the word is not helpful in the process of opinion classification. An interesting discovery is the fact that labelling by nature-inspired algorithms (PSO, BBPSO) achieved very good results, despite the fact that they rejected some words in the process of the opinion classification. PSO labelling rejected from 21% to 25% of all words from the Big lexicon and from 25% to 28% of all words from the Small lexicon. BBPSO labelling rejected approximately 12% of words from the Big lexicon and from 12% to 16% of words from the Small lexicon.

**Figure 4.** Distribution of values of the intensity of polarities acquired in the process of the **Small** lexicon labelling (220 words) using PSO, in comparison to human labelling. Axis X represents the **Figure 4.** Distribution of values of the intensity of polarities acquired in the process of the **Small** lexicon labelling (220 words) using PSO, in comparison to human labelling. Axis X represents the intensity of a word polarity and axis Y represents the number of words with the given intensity of **Figure 4.** Distribution of values of the intensity of polarities acquired in the process of the **Small** lexicon labelling (220 words) using PSO, in comparison to human labelling. Axis X represents the intensity of a word polarity and axis Y represents the number of words with the given intensity of polarity.

intensity of a word polarity and axis Y represents the number of words with the given intensity of

**Figure 5.** Distribution of values of the intensity of polarities acquired in the process of the **Small** lexicon labelling (220 words) using BBPSO, in comparison to human labelling. Axis X represents the intensity of a word polarity and axis Y represents the number of words with the given intensity of **Figure 5.** Distribution of values of the intensity of polarities acquired in the process of the **Small** lexicon labelling (220 words) using BBPSO, in comparison to human labelling. Axis X represents the intensity of a word polarity and axis Y represents the number of words with the given intensity of polarity. **Figure 5.** Distribution of values of the intensity of polarities acquired in the process of the **Small** lexicon labelling (220 words) using BBPSO, in comparison to human labelling. Axis X represents the

intensity of a word polarity and axis Y represents the number of words with the given intensity of

#### polarity. **7. New Lexicon Approach to Opinion Analysis** polarity.

**7. New Lexicon Approach to Opinion Analysis**  The new approach proposes a new means for negation processing, by combining switch and shift negation. It also incorporates a new means for intensifier processing, which is dependent on the The new approach proposes a new means for negation processing, by combining switch and shift negation. It also incorporates a new means for intensifier processing, which is dependent on the type of negation. **7. New Lexicon Approach to Opinion Analysis**  The new approach proposes a new means for negation processing, by combining switch and shift negation. It also incorporates a new means for intensifier processing, which is dependent on the

type of negation. Besides summing polarities of words in analyzing the opinion of posts, according to (9), intensifiers and negations should be processed in the opinion classification. Intensifiers are special words which can increase or decrease the intensity of polarity of connected words. In our approach to opinion analysis, the intensifiers are processed using a special part of the lexicon. In this part of the lexicon, words are accompanied by numbers, which represent a measure of increasing or decreasing the polarity of connected words. This means that words with strong polarity (positive or negative) are intensified more than words with weak polarity. The value of the intensification is set with the value "1" from the beginning. After that, a connected word's polarity is multiplied by the actual value of the intensifier; for example, in the sentence "It is very good solution", the polarity P = Besides summing polarities of words in analyzing the opinion of posts, according to (9), intensifiers and negations should be processed in the opinion classification. Intensifiers are special words which can increase or decrease the intensity of polarity of connected words. In our approach to opinion analysis, the intensifiers are processed using a special part of the lexicon. In this part of the lexicon, words are accompanied by numbers, which represent a measure of increasing or decreasing the polarity of connected words. This means that words with strong polarity (positive or negative) are intensified more than words with weak polarity. The value of the intensification is set with the value "1" from the beginning. After that, a connected word's polarity is multiplied by the actual value of the intensifier; for example, in the sentence "It is very good solution", the polarity P = 1 of the word "good" is increased by the word "very" to the final polarity of P = 2∗[1] = 2. type of negation. Besides summing polarities of words in analyzing the opinion of posts, according to (9), intensifiers and negations should be processed in the opinion classification. Intensifiers are special words which can increase or decrease the intensity of polarity of connected words. In our approach to opinion analysis, the intensifiers are processed using a special part of the lexicon. In this part of the lexicon, words are accompanied by numbers, which represent a measure of increasing or decreasing the polarity of connected words. This means that words with strong polarity (positive or negative) are intensified more than words with weak polarity. The value of the intensification is set with the value "1" from the beginning. After that, a connected word's polarity is multiplied by the actual value of the intensifier; for example, in the sentence "It is very good solution", the polarity P = 1 of the word "good" is increased by the word "very" to the final polarity of P = 2\*[1] = 2.

1 of the word "good" is increased by the word "very" to the final polarity of P = 2\*[1] = 2. In our approach, negations are processed in a new way, using the interactivity of switch and shift negation [26]. Switch negation only turns polarity to its opposite (e.g., from +2 to −2), as illustrated in Figure 6a. Shift negation is more precise than switch negation, only shifting the polarity In our approach, negations are processed in a new way, using the interactivity of switch and shift negation [26]. Switch negation only turns polarity to its opposite (e.g., from +2 to −2), as illustrated in Figure 6a. Shift negation is more precise than switch negation, only shifting the polarity of a connected word towards the direction to opposite polarity, as illustrated in Figure 6b. In our approach, negations are processed in a new way, using the interactivity of switch and shift negation [26]. Switch negation only turns polarity to its opposite (e.g., from +2 to −2), as illustrated in Figure 6a. Shift negation is more precise than switch negation, only shifting the polarity of a connected word towards the direction to opposite polarity, as illustrated in Figure 6b.

(**a**) (**b**) **Figure 6.** Illustration of switch negation (**a**) and shift negation (**b**) of polarity intensity of connected words in analyzed posts. **Figure 6.** Illustration of switch negation (**a**) and shift negation (**b**) of polarity intensity of connected words in analyzed posts.

**Figure 6.** Illustration of switch negation (**a**) and shift negation (**b**) of polarity intensity of connected words in analyzed posts. We designed the interactive so-called 'combined negation processing', where shift negation is applied to extreme values of polarity of connected words (+/−3) and switch negation is used for We designed the interactive so-called 'combined negation processing', where shift negation is applied to extreme values of polarity of connected words (+/−3) and switch negation is used for We designed the interactive so-called 'combined negation processing', where shift negation is applied to extreme values of polarity of connected words (+/−3) and switch negation is used for processing most obvious polarities (with absolute value *1* or *2*). The combined approach significantly increases the effectivity of opinion classification, as illustrated in Table 12.


**Table 12.** The effectivity of composite approach to negation processing of words in analyzed posts, in terms of Macro F1 rate.

After involving intensifiers and negations, the polarity of the whole post is not calculated using the simple approach (9) but, instead, using the new approach (11):

$$P\_p = \sum P\_w (\prod P\_i) (\prod P\_n) \tag{11}$$

where


#### *7.1. Topic Identification in Opinion Classification*

Another method that we used to increase the effectiveness of the opinion classification was involving topic identification. Topic identification can be helpful in increasing the influence of a text of post concentrated on the topic of an online discussion. Polarity of these posts was increased using greater weights. We tested two methods for topic identification: Latent Dirichlet Allocation (LDA) and Term Frequency (TF). LDA is a standard probabilistic method, based on the Dirichlet distribution of probability of topic for each post. In the output of the LDA, there is a list of words accompanied with their relevancy to all texts in the data set (so called topics). An experiment was carried out, with all texts in the data set processed using LDA. The output was a list of 50 words relevant to topics present in the texts. As there were too many "topics", the list was reduced to 15 words with the highest relevancy to the topics of the processed texts.

The second method was topic identification based on the term frequencies of words in posts. We assumed that the words in posts relevant to the topic of online discussion should have higher occurrence in these texts. A disadvantage of this method is the highest occurrence of stop words in texts. For this reason, stop words must be excluded in pre-processing. We did not create this list, instead using a known list of stop words.

First, the opinion polarity of posts was estimated. In the second step, words relevant to the identified topic of discussion were searched for in the post. If such words were found in the post, then the opinion polarity of the post was increased (by multiplication with value 1.5). The value of 1.5 was set experimentally, after experiments with three values: 1.5, 2, and 3. The double and triple changes of polarity led to slight decreases in the quality of the results obtained. Results of experiments with topic identification in the opinion analysis are illustrated in Table 13.

The results of the experiments, as presented in Table 13, show that the implementation of topic identification increased the Precision and Recall of the opinion classification of posts in online discussions. Topic identification using LDA achieved better results than using the term frequency method. People often express negative opinions while talking about the main topic. This negative opinion is usually compensated for by more positive posts related to less important aspects of the discussed problem. In this case, topic identification can significantly increase the precision of opinion classification.


**Table 13.** Results of Precision Recall in positive and negative classes and Macro F1 without Topic Identification (TI) and with TI using Term Frequency (TF) and Latent Dirichlet Allocation (LDA) methods. *Electronics* **2020**, *9*, x FOR PEER REVIEW 18 of 22

#### *7.2. A Hybrid Approach to Opinion Classification* discussed problem. In this case, topic identification can significantly increase the precision of opinion classification.

The proposed hybrid approach to opinion classification combines the advantages of two different techniques for an opinion classification model creation. The first technique is the lexicon-based approach, which is simple and intuitive; however, it can fail when the lexicon is not sufficiently expressive. The second technique—The machine learning approach—Does not depend on the quality of the lexicon, but requires a labelled training set as an input for training an opinion classification model. Therefore, we designed the hybrid approach to increase the effectiveness of the opinion analysis of posts when the lexicon approach fails (see Figure 7). The posts which were successfully classified as having positive or negative opinion using the lexicon (and, in this way, were labelled) were put into the training data set, in order to train a probability model based on the Naive Bayes machine learning method. This model was then able to classify posts that did not contain words from the lexicon and could not be classified using the lexicon approach. *7.2. A Hybrid Approach to Opinion Classification*  The proposed hybrid approach to opinion classification combines the advantages of two different techniques for an opinion classification model creation. The first technique is the lexicon-based approach, which is simple and intuitive; however, it can fail when the lexicon is not sufficiently expressive. The second technique—The machine learning approach—Does not depend on the quality of the lexicon, but requires a labelled training set as an input for training an opinion classification model. Therefore, we designed the hybrid approach to increase the effectiveness of the opinion analysis of posts when the lexicon approach fails (see Figure 7). The posts which were successfully classified as having positive or negative opinion using the lexicon (and, in this way, were labelled) were put into the training data set, in order to train a probability model based on the Naive Bayes machine learning method. This model was then able to classify posts that did not contain words from the lexicon and could not be classified using the lexicon approach.

**Figure 7.** Illustration of new hybrid approach for opinion classification. In this approach, there is co-operation between the lexicon approach for opinion classification (LOC) and the machine learning approach for opinion classification (MLOC), represented by the Naïve Bayes model (NB). **Figure 7.** Illustration of new hybrid approach for opinion classification. In this approach, there is co-operation between the lexicon approach for opinion classification (LOC) and the machine learning approach for opinion classification (MLOC), represented by the Naïve Bayes model (NB).

At first, Lexicon-based Opinion Classification (the "LOC" block) is applied to classify all posts in the data set. Once all posts are classified either successfully (YES) or unsuccessfully (NO), the dataset is split into two groups: labelled and unlabelled posts (Dataset\_2). Labelled posts represent the training set, which is used for Naïve Bayes model training (the "NB" block). The trained Naïve Bayes model is then applied to classify posts in Dataset\_2 which were not classified by Lexicon-based Opinion Classification; this is the "MLOC" (Machine Learning Opinion Classification) block. All classified posts, as classified by LOC and MLOC, are then saved (the "RESULTS" block).

The hybrid approach for opinion classification was tested and compared with the original lexicon approach. The results of testing are presented in Table 14, in the form of F1 rate—particularly, F1 Positive (F1 rate on positive posts), F1 Negative (F1 rate on negative posts), and Macro F1 rates—on the General data set. The results in Table 14 show a strong increase of efficiency, as measured by F1 rate, when the hybrid approach for opinion classification was used. The reason for this increase of efficiency can be explained by decreasing of the number of unclassified posts. The unclassified posts decreased the precision of classification, as positive and negative posts were classified to neutral ones due to absence of positive or negative words in the lexicon. Using the hybrid approach, the number of unclassified posts was reduced from 18% to 0.03%. In future research, we would like to use deep learning [35] for the machine learning-based opinion analysis method in the hybrid approach.

**Table 14.** Effectivity of hybrid approach, in comparison to simple lexicon approach, for labelling by human, PSO, and BBPSO in F1 Positive, F1 Negative, and Macro F1 rates.


The complexity of the lexicon approach for opinion classification is O(M·D), thus being linear in the size of posts in the training set M and the size of lexicon D. The complexity of the machine learning approach for opinion classification using the Naive Bayes algorithm is O(M·N), thus linearly depending on the total number of posts in the training set M and the number of attributes (words) in the training set N. It follows that the complexity of the hybrid approach is O(M·N·D).

#### **8. Discussion**

The main purpose of this paper is to find the best method for labelling a lexicon for a lexicon-based approach to opinion classification. Therefore, it is natural that our baseline was the lexicon-based approach, not a machine learning approach. This is the reason for comparison of the effectiveness of the hybrid approach with the lexicon approach as a baseline. This basic lexicon approach was extended by a machine learning approach, in order to achieve better results in the case when the lexicon could not overlay texts using another dictionary.

It was not our goal to test all machine learning methods but, instead, to discover whether a supplementary model trained by machine learning can decrease the number of failures in the opinion classification of problematic texts. The use of Naïve Bayes was a natural choice, as this method also gives weights to words (i.e., labels) in the form of a conditional probability of the word belonging to a given class in the data. Deep learning also trains the weights of attributes (words, in this case), but clear information about these weights is lost due to the large number of inner layers used. In the field of text processing, we also often use Random Forest or kernel SVM methods. However, these machine learning methods do not provide intuitive and explainable solutions with clear information about the measure of sensitivity of words in a model either.

The findings of the presented work are useful for our research in the field of antisocial behavior recognition in online communities and in the field of human–robot interaction. Our approach can provide the results of opinion and mood analysis of texts for use in these fields. The presented work is focused on opinion classification using a lexicon approach and, so, we needed to generate a high-quality lexicon using effective labeling.

In this paper, an automated method for lexicon labelling was proposed. It used nature-inspired optimization algorithms—Particle Swarm Optimization (PSO) and Bare-bones Particle Swarm Optimization (BBPSO)—to find optimal polarity values for words in the lexicon. The results of numerous tests on two data sets (Movie and General) were provided and presented in the paper. These tests showed that BBPSO labelling is better than PSO labelling, and that both are better than human labelling. Two lexicons (Big and Small) were created, in order to achieve good performance, which were labelled by both PSO and BBPSO. The experiments showed another finding: the human annotator avoided labelling words with a number close to zero, whereas PSO or BBPSO assigned zero values to some words.

We tested the labelling of lexicons using our new lexicon approach. The novelty of this approach comprised a new approach for intensifier processing and an interactive approach for negation processing. This new approach also involved topic identification and a hybrid approach for opinion classification, using not only lexicon-based, but also machine learning-based opinion classification methods. The hybrid approach was applied to classify the posts which were not classified by the lexicon approach.

For the future, we would like to extend our automatic lexicon labelling to learn polarity values representing the concept-domain pair. In some cases, the polarity of the word can be different in different domains. In that case, the polarity value represents the polarity of the word in the given domain. Furthermore, we would like to focus on the statistical analysis of words labelled by PSO and BBPSO, respectively. On one hand, the optimized labels will be compared with human labelling. On the other hand, removed words (i.e., the words labelled with zero) will be analyzed deeper, in order to answer the following questions: Which words were removed from the lexicons? How often are they removed?

The final hybrid model for sentiment analysis can be used in our research in the field of emotion analysis in human–robot interactions, where understanding of human mood by a robot can increase the acceptance of a robot as an assistant.

We are also using our work in sentiment analysis in the field of recognition of antisocial behavior in online communities. We would like to model the sentiment and mood of society in connection with the phenomenon of CoViD-19 [36].

**Author Contributions:** Conceptualization, K.M. and M.M. (Martin Mikula); methodology, K.M.; software, M.M. (Martin Mikula); validation, M.M. (Martin Mikula); formal analysis, X.G.; investigation, X.G.; resources, M.M. (Martin Mikula); data curation, M.M. (Martin Mikula); writing—original draft preparation, K.M.; writing—review and editing, K.M., M.M. (Marian Mach) and X.G.; visualization, M.M. (Marian Mach); supervision, K.M.; project administration, K.M.; funding acquisition, K.M. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the Slovak Research and Development Agency under the contract No. APVV-17-0267 "Automated Recognition of Antisocial Behavior in Online Communities" and the contract No. APVV-16-0213 "Knowledge-based Approaches for Intelligent Analysis of Big Data".

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Emotion Analysis in Human–Robot Interaction**

#### **Martina Szabóová, Martin Sarnovský , Viera Maslej Kreš ˇnáková and Kristína Machová \***

Department of Cybernetics and Artificial Intelligence, Technical University of Košice, Letná 9, 040 01 Košice, Slovakia; martina.szaboova@tuke.sk (M.S.); martin.sarnovsky@tuke.sk (M.S.); viera.maslej.kresnakova@tuke.sk (V.M.K.)

**\*** Correspondence: kristina.machova@tuke.sk

Received: 25 September 2020; Accepted: 20 October 2020; Published: 23 October 2020

**Abstract:** This paper connects two large research areas, namely sentiment analysis and human–robot interaction. Emotion analysis, as a subfield of sentiment analysis, explores text data and, based on the characteristics of the text and generally known emotional models, evaluates what emotion is presented in it. The analysis of emotions in the human–robot interaction aims to evaluate the emotional state of the human being and on this basis to decide how the robot should adapt its behavior to the human being. There are several approaches and algorithms to detect emotions in the text data. We decided to apply a combined method of dictionary approach with machine learning algorithms. As a result of the ambiguity and subjectivity of labeling emotions, it was possible to assign more than one emotion to a sentence; thus, we were dealing with a multi-label problem. Based on the overview of the problem, we performed experiments with the Naive Bayes, Support Vector Machine and Neural Network classifiers. Results obtained from classification were subsequently used in human–robot experiments. Despise the lower accuracy of emotion classification, we proved the importance of expressing emotion gestures based on the words we speak.

**Keywords:** sentiment analysis; human–robot interaction; dictionary approach; machine learning approach; social robotics

#### **1. Introduction**

The population is getting older. According to the World Health Organization (WHO), it is estimated that by the year 2050, the elderly will account for 25% of the world population (35% of the population in Europe) (https://www.un.org/en/development/desa/population/publications/ pdf/ageing/WPA2017\_Highlights.pdf). Caring for these seniors—physically, emotionally and mentally—will be an enormous undertaking, and experts say there will be a shortage of trained professionals and those willing to take on the job. Robots may fill the gap, taking care of older people. The shortage of trained professionals and desire to age-in-place can be solved by social assistive robotics. While there exist assistive robotics [1] (e.g., intelligent walkers, wheelchair robots, manipulator arms and exoskeletons), they lack the social aspect as well as the affective component.

In this situation, it is essential to devote research that goes beyond the concept of assistive robotics, and which will focus on the development of a robot that would also be a companion of an elderly. In this type of robot, the key factor is its acceptance by humans. We need to equip the robot with abilities that would make it a pleasant companion and thus a companion who can at least partially understand the emotional mood of the elderly. This means that based on what the person says, looks like and how the person behaves, the robot will be able to choose the right answers and movements or gestures. We focused on estimating the emotional state of the elderly, mainly from what the person says. We also focused on the analysis of speech, specifically in its written form, as today numerous the speech to text systems able to reliably transform speech into text. We used the text as the

input and analyzed it in terms of emotions, which falls into a very current area of research—analysis of sentiment.

Wada et al. [2] studied the psychological effects of a seal robot, PARO, used to engage seniors at a day service center. Results show moods of elderly people were improved by interaction with the robots over the course of a 6-week period. Šabanovic et al. [3] used PARO in a study with older adults with dementia. They showed that PARO provides indirect benefits for users by increasing their activity in particular modalities of social interaction, including visual, verbal and physical interaction. PARO also has positive effects on older adults' activity levels over the duration of study, suggesting they are not due to short-term 'novelty effects'. Huang and Huang [4] conveyed a study to explore the elderly's acceptance of companion robots from the perspective of user factors. They found that the elderly living with parents, with master's (or doctor's) education, medical professional background and experience in the use of scientific and technological products expressed more positive attitudes in the responses to the items on the constructs of attitude and perceived usefulness, while the attitude of those with primary school education and humanities professional background, with no experience in scientific and technological products, was relatively negative.

The presented studies indicate that the communication of older adults with a robot can be beneficial, it can improve their emotional mood, increase their activity in particular modalities of various kinds of interactions. On the other hand, there is a big obstacle in their negative approach to communication with the robot, especially in the group of people with only primary education and with no experience with scientific and technological products. We focused on this problem and tried to help break down these people's prejudices about robots, for example, by equipping the robot with the ability to be sensitive to the emotions that an older adult expresses in some way. The scenario in which we wanted to verify the achieved results was as follows. A robot can use information about the polarity of a mood of the elderly to communicate with him/her friendly, sensitive and appropriately. When a robot communicates with a human (e.g., an elder), it must choose one from many answers which are suitable for the situation. For example, it can choose an answer which can cheer up the person, if it has information that the current emotional mood of the person is sad. It can also adapt its movements and choose a movement from all possible ones to cheer up this elderly. The robot should have prepared answers and movements for all possible basic emotions of an elderly. Finally, the understanding of the emotional moods of humans can lead to better acceptance of a communication with robots.

The main contributions of the paper can be summarized as follows:


### **2. Background**

## *2.1. Sentiment Analysis*

Sentiment analysis is an interdisciplinary field connecting natural language processing (NLP), computational linguistic and text mining. As we can see from the number of papers published by reputable conferences and journal papers in NLP and computational linguistics, it is an admittedly hot topic. The vital role is to deal with opinion, sentiment and subjectivity in text. It attempts to analyze and take advantage of extensive quantities of user-generated content and enables the computer to 'understand' text.

#### 2.1.1. Research Tasks in Sentiment Analysis

Sentiment analysis involves various research tasks [5], such as:


## *2.2. Emotion Analysis*

Emotion analysis can be viewed as a natural evolution of sentiment analysis and its more fine-grained model. Digging deeper into psychology, we have to differentiate between terms *emotion, mood, feeling*. *Emotion* is an instantaneous perception of a feeling. They can be over in a matter of seconds to minutes, at most [10]. *Mood* is considered as a group of persisting feelings associated with

evaluative and cognitive states which influence all the future evaluations, feelings and actions [11]. Unlike emotions, moods are non-intentional, though they may be elicited by a particular event or things. It is challenging to identify triggers causing mood; however, while in the state of a certain mood, the threshold is lowered for arousing related emotion. *Feeling* is mental associations and reactions to an emotion that are personal and acquired through experience.

How can we determine emotions? To be able to identify emotions in text, firstly, we need emotion models to estimate them.

#### 2.2.1. Emotion's Models

According to Grandjean et al. [12], three major directions in affect computing are recognized: categorical/discrete, dimensional and appraisals-based approaches.



**Table 1.** Listing emotion models and their appertaining emotions.

**Figure 1.** Plutchik's wheel of emotion.

Despite the existence of various other models, the categorical and dimensional approaches are the most commonly used models for automatic analysis and prediction of affect in continuous input.

It is worth mentioning the survey made by Ekman [17]. The authors surveyed 248 scientists working in the field of emotion. Authors looked for the answer if/how the nature of emotion has changed over time. Which proposal—either Darwin's Darwin [18] (emotions are discrete) or Wundt's Wundt [19] (emotions differentiate into dimensions of pleasant–unpleasant and low–high intensity)—is most used nowadays? Findings from this survey indicate that scientists agreed upon five emotions (all of which were described by both Darwin and Wundt): anger (91%), fear (90%), disgust (86%), sadness (80%) and happiness (76%). Shame, surprise and embarrassment were endorsed by 40–50%. Least agreed basic emotions are guilt (37%), contempt (34%), love (32%), awe (31%), pain (28%), envy (28%), compassion (20%), pride (9%) and gratitude (6%).

Recent advances in the field of sentiment analysis and computational linguistics in general, allow us to accomplish more advanced tasks such as emotion detection in documents. To detect emotion, researchers use generally known algorithms created for sentiment analysis. There are three major approaches to detecting emotions in text:

• **Keyword-based methods**—the most intuitive approach. The main goal was to find out patterns similar to emotion keywords and match them. The first task is to find out the word which expresses the emotion in a sentence. This is usually done by tagging the words of a sentence with Parts-Of-Speech tagger and then extracting the Noun, Verb, Adjective and Adverb (NAVA) words—the most probable emotion carrying words. Then these words are matched against a list of words representing emotions according to a specific emotion model. Whichever emotion matches with the keyword is considered as the emotion of the specific sentence. Different approaches can be applied when the word matches with multiple emotions from the list. In some keyword-dictionaries, each word has a probability score for each emotion, and the emotion with the highest score is picked as the emotion of the word. In some other works, the first emotion

matched with the word is picked as the primary emotion of the word. The reference list of keywords or the keyword dictionary differs depending on the researcher.


#### *2.3. Human–Robot Interaction*

Human–robot interaction (HRI) is a study of interaction dynamics between humans and robots, a multidisciplinary field that includes engineering (electrical, mechanical, industrial and design), computer science (human–computer interaction, artificial intelligence, robotics, natural language understanding, computer vision and speech recognition), social sciences (psychology, cognitive science, communications, anthropology and human factors) and humanities (ethics and philosophy) [20].

Robots are poised to fill a growing number of roles in today's society, from factory automation to service applications, medical care and entertainment. While robots were initially used for repetitive tasks where all human direction is given a priori, they are becoming involved in increasingly more complex and less structured tasks and activities, including interaction with the humans required to complete those tasks. The fundamental goal of HRI is to develop the principles and algorithms for robot systems that enable safe and effective interaction with humans [20].

The appearance and function of a robot affect the way that people perceive it, interact with it and build long-term relationships with it [21]. As every person is different, the success of robot acceptance lies in its capability to act as a social entity and its adaptability to differentiate behavior within appropriate response times and tasks.

Interaction, by definition, means "communication with each other or reacting to each another" (https://dictionary.cambridge.org/dictionary/english/interaction). There are several possibilities for robots to communicate with humans. The way of communication is largely influenced by whether the human and robot are in close proximity to each other or not. Therefore, the interaction can be categorized into remote and proximate interaction. Within these two general categories, we can differentiate applications that require mobility, physical manipulation and social interaction [22].

#### 2.3.1. Socially Assistive Robotics

Social interaction includes social, emotive and cognitive aspects of interaction. It involves research areas of assistive robotics, social robotics and socially assistive robotics. Social Assistive Robotics (SAR) is defined as the intersection of assistive robotics and socially interactive robotics. It is a comparatively new field of robotics that focuses on developing robots capable of assisting users through social rather than physical interaction. Social robots have to be able to perceive, interpret and respond appropriately to verbal and nonverbal cues from the human. SAR compared with social robots, focuses on the challenges of providing motivation, education, therapy, coaching, training and rehabilitation through nonphysical interaction. An effective socially assistive robot must understand and interact with its environment, exhibit social behavior, focus its attention and communication on the user, sustain engagement with the user and achieve specific assistive goals. The robot must do all of this in a way that is safe, ethical and effective for the potentially vulnerable user. SAR has been shown to have promise as a therapeutic tool for children, the elderly, stroke patients and other special-needs populations requiring personalized care.

#### 2.3.2. Long-Term Interaction

Many applications with social robots involve only short-term interactions. However, short-term interaction is not enough. Many real-world applications (e.g., education, therapy, companionship and elderly care) call for keeping people interested for longer. We have to maintain human engagement and build relationship and trust between human and robot through adaptation and personalization. An important aspect of long-term interaction is *memory*. As the robot memorizes information, he can better execute personalized behavior. Zheng [23] proposed four types of memory information (factual information: personal facts like names; an intention: knowledge of user's plans and future actions; interaction history: representation of past events; and meta-behavior: metadata of user's behaviors during interactions). Their preliminary results show that meta behavior elicits stronger positive feelings in comparison to the other three memory information. Richards and Bransky [24] performed an experiment about forgetting and recalling information (4 levels: complete recall; total loss of recall; partial recall; and incorrect recall). By exhibit forgetting, either explicitly stating forgetfulness or not mentioning it at all, the believability of the character was raised. The study also suggests that forgetting affects the level of trust the user feels.

Talking about long-term interaction, we have to take into account *novelty effect*. Novelty effect, in the context of HRI, can be explained in such a way that interaction with the robot can be initially highly triggering and engaging but after a couple of interactions, the newness wears off, and people can lose interest in interaction with the robot. To avoid such behavior, the challenge is to keep people engaged in the interaction and motivate them to interact longer (weeks, months or even years). This is not as simple as it may sound.

#### 2.3.3. Personalization

Personalization is closely associated with long-term interaction mentioned above. It is another important research area in SAR. Personalization is an ability of the robot to adapt its behavior to a specific human, context, environment and task. There are numerous studies researching impact of personalization to HRI [25–29].

However, there are studies that contraindicate this claim. Kennedy et al. [30] implemented robot tutoring system. Their idea was to determine how social and adaptive behavior of the robot is desirable to support children in their learning. Task objective was to determine the prime numbers. Participants consisted of 45 children aged 7–8. Four scenarios were introduced—without a robot with a screen only, asocial robot and social personalized robot. Results show that learning with the robot in comparison to without robot (only screen) boosts learning gain, however, learning with the social personalized robot in comparison with a screen only robot does not improve further learning. Gao et al. [31] built a reinforcement learning framework for personalization that allows a robot to select supportive verbal behavior to maximize the user's task progress and positive reactions. Their conclusion was that people preferred robots that exhibited more varied behaviors in comparison to the robot whose behavior converged to the specific (personalized) one over time.

Nevertheless, we implemented personalized robot behavior in our user-case scenario described in Section 6.

#### 2.3.4. Artificial Companionship

So far, robot companions lack many important social and emotional abilities (e.g., recognizing social, affective expressions and states, understanding intentions and accounting for the context of the situation, expressing appropriate social, affective behavior) to engage with humans in natural interaction.

An artificial companion should be capable of evaluating how humans feel about the interaction and how they interpret the agent's actions and use this information to adapt its behavior accordingly [32]. For instance, a robotic companion (Figure 2) should act empathically towards a user if it detects that

she is sad or not willing to engage in an interaction, e.g., it would not disturb them trying to engage them in some activity if they do not approach it.

**Figure 2.** Robot companions. Humanoids in top row—from left to right (1) Zeno (Hanson Robotics), (2) NAO (Aldebaran Robotics), (3) Pepper (Aldebaran Robotics), (4) iCub (Italian Institute of Technology); Middle row—from left to right (1) Leonardo (MIT), (2) Kismet (MIT), (3) iCat (Philips), (4) Buddy (Blue Frog Robotics); Bottom row—from left to right (1) Paro (AIST), (2) TEGA (MIT), (3) New AIBO (Sony).

#### 2.3.5. Affective Loop

Another challenging research task in SAR is endowing the robot with emotional intelligence. It is important that the interaction between human and robot would be affective; thus, it must have the ability to perceive, interpret, express and regulate emotions.

Understanding human emotions by robot and at the same time having the option to express emotion back to human was defined by Höök [33] as affective loop (AL). AL (see Figure 3) is the interactive process in which "the user [of the system] first expresses her emotions through some physical interaction involving her body, for example, through gestures or manipulations; and the system then responds by generating affective expression, using, for example, colours, animations, and haptics" which "in turn affects the user (mind and body) making the user response and step-by-step feel more and more involved with the system" [34].

**Figure 3.** Affective loop adopted from Paiva et al. [34].

Emotion detection is part of the broader area of affective computing (AC) with aims to enable computers to recognize and express emotions [35]. AC defines emotion as playing an essential role in decision making and learning. Emotions influence the mechanisms of rational thinking. Picard [35] highlighted several results from neurological literature that indicate emotions play a necessary role in human creativity and intelligence, as well as rational human thinking and decision-making.

Computers that interact naturally and intelligently with humans need at least the ability to recognize and express affection. Affect plays a crucial role in understanding such phenomena as attention, memory and aesthetics. Emotion is necessary for creative behavior in humans. Neurological studies indicate that decision-making without emotion can be as impaired as that made with too much emotion. Picard [35] argues affective computers should not only provide better performance in assisting humans but also might enhance computers' abilities to make decisions.

Therefore, one of the main goals of AC is enabling computers to understand human emotional state and adjust its response accordingly. Human emotional state can be expressed either non-verbally, verbally or both. Pioneer researcher in body language [36] found that within the realm of interpreting the affect or emotional state of others, we perceive 55% non-verbally (facial expression), 45% verbally out of which 38% by speech (tone of voice, inflection and other sounds) and 7% by words.

Automatic affect recognition is a challenging task due to the various modalities emotions can be expressed with.


On the other hand, how and when, machines should exhibit emotions is also an important research question. Herewith with this is closely linked synthetic emotion. Synthetic emotion is an emotion produced by a robot. Integration of different modalities, when they are congruent and synchronous, leads to a significant increase in human emotion recognition accuracy [47]. However, when information is incongruent across different sensory modalities, integration may lead to a biased percept, and emotion recognition accuracy is impaired [47].

#### **3. Related Work**

There are numerous studies focusing on detecting emotion from text. Desmet and Hoste [48] used Support Vector Machines to differentiate between the 15 different emotions (abuse, anger, blame, fear, forgiveness, guilt, happiness, hopefulness, hopelessness, information, instructions, love, pride, sorrow, thankfulness), using lexical and semantic features (viz. Bags-of-Words of lemmas, Part-of-Speech tags and trigrams) and information from external resources that encode semantic relatedness and subjectivity. In Wicentowski and Sydes [49], they detected the same 15 emotions using maximum entropy classification. In Luyckx et al. [50], the authors presented experiments in fine-grained emotion detection using Support Vector Machine (SVM) into 15 categories. In Pak et al. [51], authors combined machine learning algorithm (SVM with features: n-grams, POS-tags, General Inquirer dictionary, Affective Norms of English Words lexicon, dependency graphs and lastly, heuristic features) with hand-written rules. Bandhakavi et al. [52] proposed a generative Unigram Mixture Model (UMM) to learn a word-emotion association lexicon from an input document. Alm et al. [53] uses Ekman's six basic emotions (fear, joy, sadness, disgust, anger, surprise +/−). Data were classified by linear classifier—a variation of the Winnow update rule—implemented in the Sparse Network of Winnows (SNoW) learning architecture [54] into two categories either emotional/non emotional or positive emotion/negative emotion.

Much attention these days centers on "reinventing" deep learning to solve varied tasks. Emotion detection is no exception, hence we see a burst of research papers in this area. Kratzwald et al. [55] authors proposed bi-directional LSTM networks (BiLSTMs). They proposed an extension of transfer learning called sent2affect—the network is first trained on the basis of sentiment analysis and, after exchanging the output layer, is then tuned to the task of emotion recognition. Khanpour and Caragea [56] detected six Ekman's emotion from Online Health Community messages. They proposed a computational model that combines the strengths of CNNs, LSTMs and lexicon-based approaches to capture the hidden semantics in messages. Kim and Klinger [57] used Plutchik's eight emotions and 'no emotion' as emotion categories. They applied several models: rule-based (as a feature dictionary), multi-layer perceptron (as a feature Bag-of-Words), conditional random fields (POS-tags, National Research Council (NRC) dictionary, English pronounce list), BiLSTM-CRF (as a feature FastText embeddings with dimension 300). Furthermore, it is worth mentioning that besides emotion, also experiences, causes and targets of the emotions were annotated. Gupta et al. [58], Chatterjee et al. [59] proposed deep learning approach called "Sentiment and Semantic LSTM (SS-LSTM)". Detection of emotions was viewed as a multi-classification problem into four classes—happy, sad, angry and others.

Table 2 shows emotion datasets widely used in the research community in emotion analysis. As our aim was to use text data in human–robot interaction (in comparison with works mentioned above), we could not use any of the presented corpuses. The text should be neither long nor very short and intriguing to keep the participants focused. Therefore, we chose fables as they are interesting short stories and compiled our own corpus which will be described in Section 4.1.

We see our problem as a multi-label classification task. Therefore, we decided to use Plutchik's eight emotions as emotional model together with 'no emotion' category. We applied lexicon-based approach (as we are using NRC emotional dictionary for features extraction) with supervised machine learning methods such as Naive Bayes and SVM. Whereas our dataset is small, we also decided to apply semi-supervised k-Means algorithm for expanding our training data.


#### **Table 2.** Overview of datasets used in emotion detection.

<sup>1</sup> https://www.unige.ch/cisa/index.php/download\_file/view/395/296/; <sup>2</sup> https://data.world/crowdflower/ sentiment-analysis-in-text.

#### **4. Methodology**

We propose a learning algorithm based on lexicon methods and machine learning methods. The workflow of our approach is shown on Figure 4. Specifics of each box are explained in the following sections.

**Figure 4.** Emotion detection flow chart .

#### *4.1. Block: Data*

We build our own English corpus consisting of Aesop's fable. Fables were downloaded (http://www.aesopfables.com, http://read.gov/aesop/), cleaned and saved into .txt documents. Each document contained one fable. In total, we have 740 English fables.

We wanted stories to be read in the human–robot experiment scenario. To keep the audience interested and to stay focused, the text should be neither long nor very short and interesting. Therefore we chose fables as they are short stories with moral truth, using animals as the main characters.

Corpus of English fables consisted of 393 annotated sentences and 2999 unannotated sentences. Further, we will discuss only annotated sentences. Sentences were annotated into eight categories (Plutchik's eight emotions: joy, trust, sadness, fear, disgust, anger, anticipation, surprise). The number of emotions chosen for each sentence was arbitrary. In Figure 5, the count of each emotion across the dataset is depicted. Figure 6 displays the number of sentences with the number of emotions they contain. As we can see, sentences were mostly rated by one emotion, followed by neutral sentences. Having more than one emotion for a sentence means that we are dealing with a multi-label classification problem. There is no evidence of a positive/negative relationship between emotion's classes (Figure 7).

**Figure 5.** Number of emotions in annotated dataset.

**Figure 6.** Number of sentences with multiple emotions.

**Figure 7.** Correlation of emotion's classes in the dataset.

#### *4.2. Block: Processing of The Data*

The process of data preparation is shown in Figure 8. The first row in the picture represents the process with a sentence. Second-row displays wherein the process features are extracted (e.g., punctuation is gathered from raw sentences; matching emotional words from a dictionary and Part-of-Speech (POS) tagging is done after tokenization and removing high occurrence words). Fables were formatted as follows: one sentence = one row in a document. Firstly we unified every character to lower case; applied function for dividing shortened forms of words into two words (grammatical contractions—*we*0 *re* → *we are*); and cleaned the text from interpunctuation (a sign of question mark, colon and an exclamation mark were used as features). Every sentence was tokenized into words. Afterwards, the POS tagger was applied. Next, we applied the National Research Council (NRC) dictionary to find out if any given word is a word from the vocabulary. In case the word was contained in the vocabulary, we assigned emotion to the word. Finally, we performed stopwords removal and lemmatization of the words (keeping words in their root form).

**Figure 8.** Process of cleaning and preparing data for vectorization.

## *4.3. Block: Feature Extraction and Word Embeddings*

We used vector space representation of the text and very sentence was represented by a vector of features. Each sample in the dataset was described as follows:

	- **–** noun: NN noun, singular, NNS noun, plural;
	- **–** adjective: JJ adjective, JJR adjective, comparative, JJS adjective, superlative;
	- **–** verb: VB verb, base form, VBD verb, past tense, VBG verb, gerund/present participle, VBN verb, past participle, VBP verb, sing. present, non-3d, VBZ verb, 3rd person sing. present;
	- **–** adverb: RB adverb, RBR adverb, comparative, RBS adverb, superlative.
	- **–** Bag-of-Words (BoW) representation (number of features was dependent on thresholding occurrence of tokens in input): each sentence was represented as a number of occurrence of given words in the vocabulary. Vocabulary was generated from all tokens in sentences.
	- **–** Term Frequency-Inverse Document Frequency (TF-IDF) (number of features was dependent on thresholding occurrence of tokens in input): similar to BoW, but instead of the number of occurrences, each token was represented as a proportion between the number of occurrence in given sentence and occurrence in the whole corpus.
	- **–** sentence embeddings (300 features): every word (token) in a sentence is represented by its vector obtained from pretrained ConceptNet Numberbatch model. We used word embeddings to create sentence embeddings. Sentence embeddings are basically averaged sum of word embeddings vectors appertaining to the sentence.

#### *4.4. Block: Clustering*

Annotation of sentences is exhausting and time-consuming; therefore, we decided to utilize k-Means algorithm to annotate additional data. We have selected the k-Means, as it represent the reliable and fast clustering algorithm, frequently adopted in many real-world applications. In addition to the performance, another aspect was fast processing of new, unknown samples by the trained model, which was important factor during the run-time.

*k-Means* clustering algorithm is well-known algorithm that approximates the maximum-likelihood solution for determining the locations of the means of a mixture density of component densities.

$$E(em\_1, \ldots em\_K) = \frac{1}{S} \sum\_{k=1}^{K} \sum\_{w\_n \in EM\_k} ||w\_n - em\_k||^2 \tag{1}$$

where:


The outcomes of the algorithm are clustered data annotated according to the centroid where they belong.

Our usage of k-Means can be described as follows: we randomly chose five representatives of each class (e.g., in-class joy—5 representatives for "0" category and five representatives for "1" category) and calculated centroid. Centroids were calculated as an average of the sum of vectors (from the vector representation of the data). We ended up with 18 centroids. Before every pair of centroid was fed into the k-Means algorithm, we calculated the distance of every sentence from given centroids and removed the furthest and closest one. After that, labels for every class were predicted. Acquired data gave us the option to expand the training dataset if needed.

## *4.5. Block: Model Learning*

While working with multi-label classification problem we give a brief overview of three methods. In general, we focused on selection of the stable methods which are able to provide reliable results while also perform well from the run-time aspects. We can approach to multi-label classification problem in these ways:


The following sections will describe the methods used and evaluated in our methodology.

## 4.5.1. Support Vector Machine Model

SVM is a classification model based on the idea of support vectors. The models separate the sample space into two or more classes with the widest margin possible. SVM is originally a linear classifier; however, it can relatively efficiently perform non-linear classification by using a kernel function [65]. Kernel is a method which maps features into higher dimensional space specified by the used kernel function. For the model building, we need training samples labeled −1 or 1 for each class. SVM then attempts to divide the classes with a parameterized (non)linear boundary in such a way to maximize the margin between given classes. A parameterized linear equation is defined as in formula (6). Values of *z*(*x*) for each class are represented in the following way. If given a sample of

class 1, values should be greater or equal to one, if given sample of class −1, values should be equal or smaller than −1, respectively:

$$w\mathbf{x}\_{+} + b > = \mathbf{1}, w\mathbf{x}\_{-} + b < = -1\tag{2}$$

Both of these conditions are ensuring that samples are on the correct side of the 'street'. Continuing to complete the solution, creating the widest margin between samples, it was observed that only two nearest points to the separating street determines its width. It can be expressed as a difference vector of these points multiplied by the vector of the street *W* and its magnitude ||*W*||.

$$width = (\mathbf{x}\_{+} - \mathbf{x}\_{-}) \frac{w}{||w||}\tag{3}$$

The objective is to maximize the width of the street, which is known as the primal problem of SVM. In our case, we used Radial Basis Function (RBF) as kernel.

#### 4.5.2. Multi-Class Naive Bayes Model

Naive Bayes classifier is a probabilistic classifier based on Bayes' theorem and independence assumption between the features. Let us assume that event A and event B are independent and their conditional probability is defined according Bayes' theorem:

$$P(A|B) = \frac{P(A) \* P(B|A)}{P(B)}\tag{4}$$

In practice, *P*(*B*) can be an estimated constant calculated from the dataset. Replacing *P*(*B*) with a constant *β* −1 , the previous formula is then expressed as:

$$P(A|B) = \beta \ast P(A) \ast P(B|A) \tag{5}$$

Let us assume that *A* represents class and *B* represents a feature relating to the class *A*. This equation then handles only one feature. Let us extend the rule with more features. Then the conditional probability of class *A* on features *B*, *C* is the following:

$$P(A|B,\mathcal{C}) = \beta \ast P(A) \ast P(B,\mathcal{C}|A) = \beta \ast P(A) \ast P(B|A) \ast P(\mathcal{C}|A) \tag{6}$$

That assumes that features B and C are independent of each other. Then, simplifying the above expression is possible using the replacement of *P*(*B*, *C*|*A*) with *P*(*B*|*A*)*P*(*C*|*A*). For *n* observations—features *x*1, . . . , *xn*—the conditional probability for any class *y<sup>j</sup>* can be expressed as below:

$$P(y\_j|\mathbf{x}\_1, \dots, \mathbf{x}\_n) = \beta \* P(y\_j) \prod\_{i=1}^n P(\mathbf{x}\_i, y\_j) \tag{7}$$

This classification model is called Naive Bayes classifier. Naive Bayes is often applied as a baseline for text classification [66]. In this work, we used multi-class Naive Bayes classifier.

#### 4.5.3. Feed-Forward Neural Network Model

Another popular models used in the text classification tasks are neural networks [67,68]. In our experiments, we used a feed-forward neural network model. It proved to be the most suitable neural network model for a given task, as the more advanced neural models (CNN, LSTM) require significantly more data to train them properly. Neural networks are flexible models composed of computational units—neurons, arranged in interconnected layers. Connections between neurons correspond to numerical parameters of the model—weights. The primary predictive model is feed-forward neural network [69], which consists of the following layers:


The calculation for all neurons on the hidden and output layers is identical—the output value of each neuron (activation) is calculated as a weighted sum of inputs of the neurons transformed using the activation function. On the hidden layers, we used ReLU activation function [70]. The output of the ReLU function can be represented as:

$$f(\mathbf{x}) = \max(0, \mathbf{x}). \tag{8}$$

On the output layer, we used the sigmoid activation function [71], which transforms the output into a probability estimations:

$$f(\mathbf{x}) = \frac{1}{1 + \mathbf{e}^{-\mathbf{x}}}.\tag{9}$$

We used Adaptive Moment Estimation (Adam) [72] as an optimization method during the training. RMSProp [73] and Momentum [74] methods are based on different approaches. Momentum accelerates the training in the direction of the minimum, while RMSProp reduces the oscillations by adaptive change of the learning rate. Adam algorithm combines both Momentum and RMSProp heuristics.

The loss function expresses the magnitude of the loss that the model will make in the prediction. By minimizing the loss function, we can obtain the weights for all network layers. In our work, we used Binary Cross-Entropy (BCE):

$$BCE = -(y\log(\mathcal{Y}) + (1 - y)\log(1 - \mathcal{Y})),\tag{10}$$

where *y* is the actual value and *y*ˆ is the predicted value.

Based on the prediction and weights, we obtain an output loss which propagates back to the previous layers using the backpropagation algorithm [75]. The weights are then modified to minimize the output error.

In the experiments, we used a feed-forward neural network. The architecture of the network comprised of the input layer, four hidden fully connected layers with 32, 64, 128 and 256 neurons and the ReLU activation function. The output layer contained nine neurons, each representing a particular class and a sigmoid activating function. The model included 55,881 trainable parameters.

#### *4.6. Block: Classification*

Our approach to the classification lies in transforming our problem into 9 separate problems (8 emotion classes and one class without emotion). Based on the fact that emotions are not dependent on each other (Figure 7), we trained the classifiers for each emotion separately. When a new sample comes into the classification, all of the classifiers estimate the probability for each class. Each classifier has only one vote. The threshold is set to probability of 50% for accepting the label.

#### *4.7. Block: Evaluation of Results*

To evaluate results, we used statistical metrics usually used in text classification: precision, recall, F1 score, Matthews Correlation Coefficient and subset accuracy. The dataset was split into training and testing sets in a 70/30 ratio. We used stratified sampling for the multi-label classification implemented in scikit-multilearn (http://http://scikit.ml/stratification.html) library.

Firstly, we define the confusion matrix. The confusion matrix summarizes the classification performance of a classifier with respect to test data. It is a two-dimensional matrix, where one dimension represents the true class of a document and the second dimension represents class label predicted by the classifier. Table 3 presents an example of confusion matrix.

**Table 3.** Confusion matrix for two classes.


• **Precision**—defined as the fraction of the number of texts correctly labeled as belonging to the positive class among the total number of retrieved texts annotated as belonging to the positive class.

$$Precision = \frac{TP}{TP + FP} \tag{11}$$

• **Recall**—defined as the fraction of the number of texts correctly annotated as belonging to the positive class among the number of the retrieved text belonging to the positive class

$$Recall = \frac{TP}{TP + FN} \tag{12}$$

• **F1 score**—the weighted average of precision and recall. This score takes both false positives and false negatives into an account.

$$F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall} \tag{13}$$

• **Matthews Correlation Coefficient** (MCC)—in comparison with F1 score, it is a more reliable statistical rate which produces a high score only if the prediction obtained good results in all of the four confusion matrix categories (true positives, false negatives, true negatives and false positives), proportionally both to the size of positive elements and the size of negative elements in the dataset [76]. It returns a value between −1 and +1. A coefficient of +1 represents a perfect prediction, *0* no better than random prediction and −1 indicates total disagreement between prediction and actual class.

$$\text{MCC} = \frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}} \tag{14}$$


#### **5. Experiments with Text Data**

#### *5.1. Baseline*

Our baseline model consisted of NRC dictionary and 393 annotated sentences. We matched every word against the dictionary and assigned the number of appertaining occurrences to each emotion. Later we transformed the number of occurrences to binary representation ("0" if an emotion is not present, "1" if emotion has more than one occurrence). Table 4 shows that out of eight emotion, *Joy* is classified most accurately and *Disgust* with *Trust* the worst. The reason for it lies in our data. Looking back at Figure 5, we can see that trust and disgust are the least represented classes.


**Table 4.** Accuracy of emotion dictionary, lexicon approach.

#### *5.2. Building of the Naive Bayes Model Using Bag-of-Words*

We firstly begin by testing our data against the Bag-of-Words representation (Table 5). As we can see, the precision is rather low. Above-average results are obtained only in case of *No emotion* class. *Trust and Disgust* got 0, however looking at the subset accuracy we see that they achieve scores 97% and 91%, respectively. That means, even though we did not classify a positive case, we got a good estimate on the overall class. We experimented with several model's setups such as:


**Table 5.** Accuracy of Bag-of-Words representation, Multi-nomial Naive Bayes classifier.


Fine-tuning with different pre-processing settings such as stopwords removing/not removing, uni-grams/bi-grams and the threshold for minimal/maximal count of a word to be excluded from vocabulary we improved *Joy* and *Anticipation* precision (Table 6).


**Table 6.** Accuracy of fine-tuned settings in Bag-of-Words representation, Multi-nominal Naive Bayes classifier.

Then, we extended the features with emotions from NRC dictionary, POS tags, punctuation and continued tuning our model. We saw improvement on *No emotion* and *Anger* classes (Table 7.)

**Table 7.** Accuracy of fine-tuned setting in Bag-of-Words representation and added features, Multi-nomial Naive Bayes classifier.


We tried also every feature individually. We noticed increase in accuracy of *Fear* to 25%, *Surprise* to 50%, *Sadness* to 50% and *Disgust* to 25%. To increase the accuracy, we needed to use different setup for every class.

Adding k-Means annotated data to the training set we observe *Disgust* accuracy to rise to 67%. All other accuracy metrics remained at the same level.

#### *5.3. Building of the Naive Bayes Model Using TF-IDF*

Foundation of experiment 2 was the TF-IDF representation of sentences. Results from our experiment can be seen in Table 8. The highest score was obtained in the *Joy* class. The lowest were in *Trust* and *Disgust* classes.


**Table 8.** Accuracy of Term Frequency-Inverse Document Frequency (TF-IDF) representation, Multi-nomial Naive Bayes classifier.

After fine-tuning the parameters of our model, we trained the model and compared the results. Table 9 summarizes the results of the Multi-nomial Naive Bayes classifier with TF-IDF after fine-tuning, Table 10 summarizes the fine-tuning of the model trained using the extended set of features.


**Table 9.** Accuracy of fine-tuned settings in TF-IDF representation, Multi-nomial Naive Bayes classifier.



Adding more semi-automatically labeled data further raised the accuracy of the *Sadness* class to 67%.

#### *5.4. ConceptNet Numberbatch Converted to Sentence Embeddings*

The base of this experiment was to use the sentence embeddings. On top of that, we added NRC emotional dictionary, punctuation and POS tags. Lastly, we used word embeddings—ConceptNet Numberbatch and converted them to the *sentence embeddings*. We can see from Table 11, that accuracy in classes is low but it covers all classes except one—*Trust*.


**Table 11.** ConceptNet Numberbatch—sentence embeddings.

Adding features to the model did not help to raise its accuracy significantly. Adding data labeled by k-Means helped to improve accuracy in the class *No emotion* to 68% by using SVM classifier. The average accuracy for the rest of the classes was 20%.

#### *5.5. Neural Network Classifier*

In this experiment, we trained feed-forward neural network classifier to compare the performance of the neural network approach with standard machine learning methods used in the previous experiments. The architecture of the network is described in Section 4.5.3. The performance of the model is summarized in Table 12. As we can see from the results, neural network classifier gained slightly better performance (when considering averaged metrics) to standard machine learning models. However, the lack of the training data caused that the more advanced deep learning approaches (such as CNN or LSTM models) or more advanced popular language models (e.g., BERT) could not be properly trained to solve this task.


**Table 12.** Feed-forward neural network.

#### *5.6. Ensemble Classifier*

We combined the best-obtained models for each class and integrated them into the ensemble classifier, as shown in Table 13. We can see an increase in exact accuracy, which is the most strict metric and expresses how many completely correct rows (all labels are correct) we obtained from the classifier. We did not include the neural network model in the ensemble. The ensemble members were selected as a binary classifiers for each of the particular class, which in case of the neural network would require its re-training in one-vs-rest approach. Therefore, neural network was primarily used to compare the performance of the ensemble model.

**Table 13.** Ensemble of binary classifiers. NB: Multi-nominal Naive Bayes, SVM: Support Vector Machine, NRC: emotion dictionary, POS: Part-of-Speech tags, PUNC: punctuation, SW: stop words.


During the experiments, besides the initial base classifiers, we compared the ensemble model performance with some other machine learning algorithms. For the comparison purposes, we used the feed-forward neural network model described in Section 4.5.3. and also with the frequently used models from the popular Python machine learning library scikit-learn. In comparison, we included baseline classifiers (Logistic Regression, SVM, Decision Trees, k-NN) and also other ensemble models (e.g., Adaboost). As the proposed ensemble model combines different ensemble members, trained on different feature subsets, or expanded set of attributes, we compared the ensemble with other machine learning models trained on both, TF-IDF representation and on TF-IDF extended with expanded attributes. Following Table 14 summarizes the performance of the ensemble and other ML models. The results represent the averaged values of the 10-fold cross-validated models on the testing set. Inclusion of the extended set of features to TF-IDF representation brings a slight improvement to some of the models. In general, the performance of the base models is rather poor, in comparison to the ensemble model.


**Table 14.** Comparison of the ensemble model with other machine learning (ML) models.

#### **6. Experiments with Humanoid Robot NAO**

We propose scenarios with humanoid robot NAO and humans (either kids, or adults). The controlled group was the same for each experiment. The group consisted of 8 participants (7 adults and 1 child). The age of the participants ranged from 3 to 50 years. In these experiments, we focused on the creation of the small, yet diverse group of subjects, represented by participants within different age groups. The participant was interacting with a robot alone; thus, it was one-on-one interaction. They were not accustomed with humanoid robot, thus it was their first interaction. All except one were educated people. The experimenter was behind the wall. During the experiments, we paid attention to two variables: length of the interaction, number of fables red.

Throughout the experiments, we used NAO robot v.5. NAO is a humanoid robot often utilized in HRI experiments. He can move with hands, walk, talk, listen. Taking into account its' very limited facial expression, he can make use of his eye's led lights to signal to blink, even changing color can suggest different emotional states (e.g., red led = anger). A pre-trained classifier was running on a server (standard desktop PC configuration) connected to the NAO robot. During the run-time, the classifier processed the sentences/fables. A computer was used to invoke the scripts for speech and moves to NAO.

#### *6.1. Experiment 1A—Basic Setup*

Setup of the first experiments is straightforward (Figure 9). NAO is presented as a "Narrator". He greets the participant of the experiment and asked him to sit down, facing him. Subsequently, he offers to tell a story. He starts narrating as soon as he hears "yes". Input to NAO is the fable without any emotional markup; thus, NAO is reading the fable without any expression (either movement or vocal). The recipient is facing NAO and listening to the story. After telling the whole story, NAO gives the option either to continue with another story or to finish. The number of stories is fully dependent on the participant. At the end of the experiment, we give every participant the questions shown in Table 15.


**Table 15.** Survey about robot performance in the first two scenarios.

**Figure 9.** Setup for the experiment 1A.

We can break down our system to the following parts:


#### *6.2. Experiment 1B—Setup with Emotional Movements and Gestures*

Setup for the second experiment (Figure 10) is the same as for the first experiment with three exceptions. Number one: The input to the NAO is Aesop's fable marked with emotion. Second is closely connected to the first: NAO is narrating the story with movements and changes in pitch. The third difference is in case the participant wants to hear another story. After requesting a second story, NAO is telling that he is tired and asks if the participator really wants to hear another story. If he gets a positive response, he continues, otherwise he thanks, and the experiment is finished. At the end of the experiment, the participant fills in the survey with the same question as before (Table 15).

**Figure 10.** Setup for the experiment 1B.

We can break down our system to the parts similar to experiment 1. On top of the used block we added:


#### *6.3. Experiment 1C—Setup with Random Movements*

We took setup from experiment 2, removed classification block and modified block *Generating script for NAO text-to-speech and emotional gestures* to generate any gestures, incongruent to the emotions in written text (Figure 11).

**Figure 11.** Setup for the experiment 1C.

#### *6.4. Results of the Experiment 1*

The results of the experiments are shown in Table 16. For responses we used a five-point Likert scale with options: 5—I agree extremely; 4—I agree very; 3—I agree moderately; 2—I agree slightly; 1—I do not agree. We took an average of scores for each question. The average length of the interaction was measured from the point where NAO robot greeted the person until he finished narrating his last fable rounded to the minutes. The average number of fables read indicates how many fables were read during one session.

**Table 16.** Results from the experiment 1.


From the results above, we can conclude that robot with emotional/random cues (experiments 1B, 1C) achieved better overall rating in comparison to the robot without emotional cues (experiment 1A). We demonstrate that there is a difference in perceiving text from robot to human by adding emotional/random manners to the robot. However, now the question is if it is really necessary to add emotional cues to the robot or any cues would be sufficient, i.e., randomly generated movements. Hence, we adjust the experiment 1B, where gesture generated by the robot were assigned randomly. Experiments 1B and 1C show that the difference between emotional movements and random gestures is not marginal; however, emotional movements are giving slightly better results. Only in (Q4) random gestures topped emotional. We assume, the reason for it was the randomness of generating movements. Participants were surprised by sudden movements and thus saw the robot as interesting.

#### *6.5. Experiment 2—Robot Interaction to Human Spoken Words*

Setup for the second experiment (Figure 12) is as follows: the participant is greeted by NAO and asked to sit down. After that, he tells the participant to tell him a story. The participant is given beforehand the story to read. While reading a story to the robot, Google Cloud Speech to Text Service is used to transcribe the text into a written format. Afterwards, our emotion classifier detects emotion in a given text. Text is processed into sentences; emotional gestures are automatically annotated to the text based on present emotion. NAO executes the script and makes emotional gestures. After reading the fable, the robot asks if you would like to read him another story. If he gets a negative response, he says thanks and says that he is looking forward to the next session. At the end of the experiment, participants fill in the survey (Table 17).

**Figure 12.** Setup for experiment 2.

#### *6.6. Results of the Experiment 2*

Results from Table 17 suggest that the robot reacting to the human spoken words had positive impact of robot perception (Q5). The robot even appeared as he was capable to understand what he was told (Q2). What surprised us was the low score of Q3, but it can be explained in two ways: either participant did not see the point in reading to the robot or they would like to tell the robot their text Q4. Despite this, in the current scenario, participants enjoyed reading to the robot. Q3 was also reflected in average numbers of read fables outcome and length of the interaction.


**Table 17.** Results from experiment 2.

#### **7. Conclusions**

The presented work connects two big areas of research namely sentiment analysis and human–robot interaction. We saw a gap in HRI years ago that SA could fulfill. Usually, there is no automation in HRI whatsoever while processing texts spoken by a robot. If a robot is able to speak, everything a robot says is scripted beforehand. Two problems arise from this. Firstly, script making is tedious work and you can not handle every possibility. Secondly, robot can not react adequately if surprised unexpectedly, thus it lowers its positive perception by humans. As a result that we are heading to the era of socially assistive robotics (such as artificial companions), we need to incorporate emotion detection from text in comparison to other modalities (face, voice, gestures) that get more attention from the scientific community.

To demonstrate our claim for emotion detection in text within HRI, we conducted experiments with humanoid robot NAO. We proposed quantitative research with surveys and trackable variables during the experiment (length of interaction and number of fables read) and qualitative research by asking our participants about the experiment to measure improved robot to human interaction. The results of the experiments show there is indeed positive feedback on the human side. From the questionnaire results, it is obvious adding gestures to robot increase positivity in interaction.

We used a lexicon approach and a machine learning approach for the emotion detection. Models for emotion classification were trained using various machine learning methods, as Naïve Bayes classifier, ConceptNet Numberbatch and feed-forward neural network using various data representations as Bag-of-Words, TF-IDF and sentence embeddings. Finally, the ensemble classifier, which consisted of the nine best models for each emotion, was used in scenarios with the humanoid robot NAO.

The results from emotion detection in text using machine learning approaches show an increase in precision and accuracy for each label. Adding additional features from emotional dictionary raised accuracy in some classes more, in some classes less. The biggest increase of accuracy can be seen in class *Disgust = 90%*, followed by *Joy = 83%*, *Anger = 80%*, *No emotion = 71%*, *Anticipation = 67%*, *Sadness = 67%*. The rest of the classes have accuracy equal to or lower than 50%. In comparison to baseline, it is negligible, but still present. Lastly, we observed a change in testing precision and accuracy when we added new data, annotated by K-means algorithm.

We see potential based on the obtained results in utilization of automatic emotion detection from text in human–robot interaction. As experiment 1C showed, the system did not have to be 100% accurate to arouse a positive response from the human. We can take a look from another angle as well: not showing happy gestures when the perceived emotion should be sad and vice-versa. That can transform into a classification as a problem where no occurrence of emotion should be observed.

**Author Contributions:** Conceptualization, M.S. (Martina Szabóová) and K.M.; methodology, M.S. (Martina Szabóová); software, M.S. (Martina Szabóová) and V.M.K.; validation, M.S. (Martina Szabóová) and V.M.K.; formal analysis, M.S. (Martina Szabóová) and M.S. (Martin Sarnovský); investigation, M.S. (Martina Szabóová), M.S. (Martin Sarnovský) and K.M.; resources, M.S. (Martina Szabóová) and M.S. (Martin Sarnovský); data curation, M.S. (Martina Szabóová); writing—original draft preparation, M.S. (Martina Szabóová), M.S. (Martin Sarnovský) and V.M.K.; writing—review and editing, M.S. (Martina Szabóová), K.M., M.S. (Martin Sarnovský) and V.M.K.; visualization, M.S. (Martina Szabóová); supervision, K.M.; project administration, K.M.; funding acquisition, K.M. All authors have read and agreed to the published version of the manuscript.

**Funding:** The work was supported by the Slovak Research and Development Agency under the contract No. APVV-16-0213 Knowledge-based approaches for intelligent analysis of big data and No. APVV-17-0267 Automated Recognition of Antisocial Behaviour in Online Communities.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Research of HRV as a Measure of Mental Workload in Human and Dual-Arm Robot Interaction**

**Shiliang Shao 1,2,\* , Ting Wang 1,2, Yongliang Wang 1,2,3 , Yun Su 1,2,3, Chunhe Song 1,2 and Chen Yao 1,2**


Received: 3 November 2020; Accepted: 13 December 2020; Published: 17 December 2020

**Abstract:** Robots instead of humans work in unstructured environments, expanding the scope of human work. The interactions between humans and robots are indirect through operating terminals. The mental workloads of human increase with the lack of direct perception to the real scenes. Thus, mental workload assessment is important, which could effectively avoid serious accidents caused by mental overloading. In this paper, the operating object is a dual-arm robot. The classification of operator's mental workload is studied by using the heart rate variability (HRV) signal. First, two kinds of electrocardiogram (ECG) signals are collected from six subjects who performed tasks or maintained a relaxed state. Then, HRV data is obtained from ECG signals and 20 kinds of HRV features are extracted. Last, six different classifications are used for mental workload classification. Using each subject's HRV signal to train the model, the subject's mental workload is classified. Average classification accuracy of 98.77% is obtained using the K-Nearest Neighbor (KNN) method. By using the HRV signal of five subjects for training and that of one subject for testing with the Gentle Boost (GB) method, the highest average classification accuracy (80.56%) is obtained. This study has implications for the analysis of HRV signals characteristic of mental workload in different subjects, which could improve operators' well-being and safety in the human-robot interaction process.

**Keywords:** human-robot interaction; mental workload; heart rate variability; machine learning

#### **1. Introduction**

In unstructured environments, robots replace humans to perform some complex tasks, which expends the scope of human work [1,2]. The dual-arm robot, a kind of typical robot, has been widely studied [3,4]. Dual-arm robots can simulate the movement of two arms of human, making an important step towards humanoid operation. Studies based on dual-arm robots have always moved towards the operation of humanization. In this paper, a dual-arm robot is studied as the operating object, which is controlled by a wearable exoskeleton controller in master-slave mode. The dual-arm robot's performance is not only limited by the performance of the system, but also related to the current state of the operator closely. Sometimes, a large mental workload can still lead to improper or wrong operation even when the system is stable and the operator has a good sense of presence. Therefore, it is crucial to monitor the mental workload of the operator. On this basis, the human-robot task assignment could be dynamically adjusted based on the mental workload. This kind of research improves human-robot system performance and safety and refine the subjective experience of operators. Therefore, it is of

great theoretical significance and practical value to study the mental workload measurement of the operators of dual-arm robots.

In recent years, mental workload has gradually become a hot research topic. The concept was first proposed in the 1940s [5]; its purpose was to optimize the human–machine system. There are various definitions of mental workload; however, the primary content of the definitions is the relationship between 'requirement of resources for tasks' and 'ability of the operator to provide those resources' [6]. In reality, the traditional methods of evaluating mental workload are mainly subjective. However, the main defect of the subjective scale method is the lack of objectivity and continuity of measurement. Undeniably, the evaluation of mental workload by physiological signals, such as electroencephalogram (EEG) [7,8], respiration rate (RR) [9], blood pressure (BP) [10], skin temperature (ST) [11], galvanic skin response (GSR) [12], blink frequency (BF) [13], and heart rate variability (HRV) [14], has achieved some progress. Although more effective information can be obtained by using multi-sensors fusion to analyze mental workload, it causes great inconvenience to operators because they have to use a large number of electrodes, sensor units, and so on. HRV is the physiological phenomenon of fluctuation in the time interval between heartbeats. It is the most convenient and common physiological measurement method for mental workload. Thus, in this paper, HRV is studied as a measure of mental workload in human and dual-arm robot interaction.

The traditional mental workload measuring method using HRV is considered to be on the basis of time domain and frequency domain features. About the time domain features, the better performing ones are the standard deviation of the R-R interval (SDNN), the root mean square of the successive R-R interval difference (RMSSD), the proportion of the beats with a successive R-R interval difference exceeding 50 ms (PNN50), and the sum of all R-R intervals divided by the maximum density distribution (HRVTi) [15,16]. Moreover, in the frequency domain analysis method, the HRV signal is always decomposed into multi-frequency components. In fact, the power spectral of each frequency component and the sum of power spectral of all frequency bands are regarded as features for mental workload measurement. In detail, these features include power spectrum of very low frequency band (VLF: 0.003–0.040 Hz), low frequency band (LF: 0.04–0.15 Hz), high frequency band (HF: 0.15–0.4 Hz), and total power spectrum (TP: ≤ 0.4 Hz) [17,18]. However, the time domain indices cannot show the time-varying characteristics of HRV. Thus, it is limited for the response to the autonomic nervous system. Meanwhile, the frequency domain indices can only provide global frequency information, lacking in coupling information between local and different frequencies. A human body can be abstracted into a complex nonlinear system. Nevertheless, the time domain and frequency domain features of HRV signals are unable to express the nonlinear characteristics of HRV signals completely [19,20]. At present, relevant studies have used nonlinear analysis methods to analyze HRV signals for mental workload. Castaldo et al. [21] extracted the nonlinear features of HRV signals to achieve psychological load measurement analysis while playing games. Specifically, it includes Poincare plot, de-trending fluctuation analysis, recurrence plot, sample entropy, approximate entropy, and Shannon entropy, among others. Tiwari et al. [22] proposed an improved multi-scale permutation entropy analysis method to measure and analyze HRV signals. Finally, they accomplish the classification of mental workload in the process of MATB. Delliaux et al. [23] analyzed a variety of nonlinear features of HRV signals through statistical analysis.

However, there is no research on HRV as a measure of mental workload in human and dual-arm robot interaction. In this paper, HRV is studied as a measure of mental workload in human and dual-arm robot interaction. The main contributions of this work are summarized as follows: First, this paper extracts time domain features, frequency domain features, and nonlinear features of HRV signals, exploring the hidden layer of neural activity information deeply. Then, the mapping relationship between the HRV signal and mental workload is analyzed. In addition, models trained with the same subject data and across different subjects are researched, respectively.

The rest of the paper is structured as follows: In Section 2, the process of ECG data acquisition is described and the HRV signal extraction algorithm is presented. Additionally, the features extraction method is presented. Section 3 shows the experimental results, which reflect the statistical analysis of features and mental workload measures. The discussion of results are present in Section 4. In Section 5, the conclusion of this paper is presented. analysis of features and mental workload measures. The discussion of results are present in Section 4. In Section 5, the conclusion of this paper is presented.

*Electronics* **2020**, *8*, x FOR PEER REVIEW 3 of 18

#### **2. Data and Methods 2. Data and Methods**

Firstly, the process of mental workload recognition in this paper is presented and shown in Figure 1. Then, the subjects that participated in the data acquisition are introduced, respectively. Subsequently, the data acquisition process is introduced and the features are extracted. Finally, the mental workload identification results based on the extraction features are presented. Firstly, the process of mental workload recognition in this paper is presented and shown in Figure 1. Then, the subjects that participated in the data acquisition are introduced, respectively. Subsequently, the data acquisition process is introduced and the features are extracted. Finally, the mental workload identification results based on the extraction features are presented.

**Figure 1.** The processes of the mental workload classification based on HRV. **Figure 1.** The processes of the mental workload classification based on HRV.

#### *2.1. Participants 2.1. Participants*

Subjects were, on average, 25.16 years old, and the study employed a total of six male participants, as shown in Table 1. They were selected from the Shenyang Institute of Automation, Chinese Academy of Sciences. They have normal or corrected vision, right-handedness, good health, and no heart, cerebrovascular, or nervous system problems. All participants were informed of the experiment, and participants were asked to wear loose and comfortable clothing. Subjects were, on average, 25.16 years old, and the study employed a total of six male participants, as shown in Table 1. They were selected from the Shenyang Institute of Automation, Chinese Academy of Sciences. They have normal or corrected vision, right-handedness, good health, and no heart, cerebrovascular, or nervous system problems. All participants were informed of the experiment, and participants were asked to wear loose and comfortable clothing.


**Table 1.** A description of the subjects. **Table 1.** A description of the subjects.

#### *2.2. Data Acquisition and Processing*

*2.2. Data Acquisition and Processing*  The dual-arm robot utilized in this paper is shown in the Figure 2a. The robot has six independent driving wheels. Therefore, it can adapt to various complex topographic structures. Moreover, the robot is equipped with double arms, both with seven degrees of freedom, to imitate the number and structure of a human. The end of the arm is an open-close clamp, which can be used The dual-arm robot utilized in this paper is shown in the Figure 2a. The robot has six independent driving wheels. Therefore, it can adapt to various complex topographic structures. Moreover, the robotis equipped with double arms, both with seven degrees of freedom, to imitate the number and structureof a human. The end of the arm is an open-close clamp, which can be used for precision operation. At the same time, the robot is equipped with a binocular camera, which can be used to enhance the

for precision operation. At the same time, the robot is equipped with a binocular camera, which can be used to enhance the operator's sense of presence. In order to facilitate operation, the manipulator

Subject6 Male 178 72.5 25 22.9

operator's sense of presence. In order to facilitate operation, the manipulator of the dual-arm robot adopts a wearable controller, which is shown in Figure 2b. Obviously, the wearable controller has the same structure as arms of the dual-arm robot. Between the wearable controller and the dual-arm robot, the master-slave control mode is used, as shown in Figure 2c. *Electronics* **2020**, *8*, x FOR PEER REVIEW 4 of 18 of the dual-arm robot adopts a wearable controller, which is shown in Figure 2b. Obviously, the wearable controller has the same structure as arms of the dual-arm robot. Between the wearable controller and the dual-arm robot, the master-slave control mode is used, as shown in Figure 2c.

(**a**) Dual-arm robot.

(**b**) Wearable robot controller.

(**c**) Master-slave control mode.

**Figure 2.** Dual-arm robot and wearable controller. **Figure 2.** Dual-arm robot and wearable controller.

The ECG signal acquisition sensor and software in this paper are shown in Figure 3a,b. The sensor is a portable chest strap that can be attached to the operator's chest. Additionally, The sensor is based on the BMD101 chip, which is the most widely used ECG signal acquisition sensor at present and can avoid interfering with the operator's normal operation. Then, the ECG data is transmitted via Bluetooth to a computer for collecting and displaying the ECG signals. The ECG signal acquisition sensor and software in this paper are shown in Figure 3a,b. The sensor is a portable chest strap that can be attached to the operator's chest. Additionally, The sensor is based on the BMD101 chip, which is the most widely used ECG signal acquisition sensor at present and can avoid interfering with the operator's normal operation. Then, the ECG data is transmitted via Bluetooth to a computer for collecting and displaying the ECG signals.

**Figure 3.** The operator's ECG signal acquisition system. **Figure 3.** The operator's ECG signal acquisition system.

The flow chart of data acquisition is shown in Figure 4. Firstly, subjects read and sign the informed consent. Then, they are trained in operating the robot professionally. Only after passing the set assessment indicators can they participate in the experiment. Before the beginning of experiment, the ECG acquisition device needs to be placed on the subject's chest. Then the Karolinska Sleepiness Scale (KSS) is filled to determine the operator's sleepiness state. The KSS needs to be filled once the operator has completed their mission. After giving the operator a minute to concentrate, the experiment starts. ECG signals of each operator in two mental workload states are collected. The tasks performed under each level of mental workload are defined as follows: (1) The task of mental workload level 1: The operator does not perform any task and maintains a relaxed state. (2) The task of mental workload level 2: The operator operates the arms of robot to follow a specified trajectory. ECG signals of the operator are collected at each task for 10 min. At the end of the task, the data records are checked and the ECG acquisition equipment on the subject is removed. The experiment ends. A 3 min sliding window is used to process the data, which slides for 10 s each time. The sliding window segments the 10-min data of each state of each subject. Furthermore, the three-minute segments obtained are used for the identification and classification of the two mental workload states. The flow chart of data acquisition is shown in Figure 4. Firstly, subjects read and sign the informed consent. Then, they are trained in operating the robot professionally. Only after passing the set assessment indicators can they participate in the experiment. Before the beginning of experiment, the ECG acquisition device needs to be placed on the subject's chest. Then the Karolinska Sleepiness Scale (KSS) is filled to determine the operator's sleepiness state. The KSS needs to be filled once the operator has completed their mission. After giving the operator a minute to concentrate, the experiment starts. ECG signals of each operator in two mental workload states are collected. The tasks performed under each level of mental workload are defined as follows: (1) The task of mental workload level 1: The operator does not perform any task and maintains a relaxed state. (2) The task of mental workload level 2: The operator operates the arms of robot to follow a specified trajectory. ECG signals of the operator are collected at each task for 10 min. At the end of the task, the data records are checked and the ECG acquisition equipment on the subject is removed. The experiment ends. A 3 min sliding window is used to process the data, which slides for 10 s each time. The sliding window segments the 10-min data of each state of each subject. Furthermore, the three-minute segments obtained are used for the identification and classification of the two mental workload states. **Figure 3.** The operator's ECG signal acquisition system. The flow chart of data is shown in Figure Firstly, subjects read and sign the informed consent. Then, they are trained in operating the robot professionally. Only after passing the set assessment can they the experiment. Before the beginning of experiment, the ECG acquisition device needs to be placed on the subject's chest. Then the Karolinska Sleepiness is filled to determine operator's sleepiness state. The KSS to filled the operator has completed their mission. After giving the operator a minute to concentrate, the ECG signals each operator in two mental states collected. The tasks performed under each level of mental workload are defined as follows: (1) The task of mental workload level operator does not perform any task and maintains a relaxed state. The task of mental workload level 2: The operator operates the arms of robot to follow a specified trajectory. ECG signals of operator are collected at for 10 min. At the end of the task, the data records are checked and the ECG acquisition equipment on the subject is removed. The experiment 3 window used process the data, which slides 10 s each window segments the 10-min data of each state of each subject. Furthermore, the three-minute obtained are for identification and classification of two mental workload

**Figure 4.** Flow chart of sample data collection. **Figure** Flow chart data collection. **Figure 4.** Flow chart of sample data collection.

The HRV is shown in Figure 5, which is obtained by ECG signal collected by sensor. In reality, the HRV signal is defined as the fluctuation in continuous RR intervals. Hence, for the sake of getting the HRV sequence from the ECG signal, a QRS wave group detection method is utilized to detect the Q wave, R wave, and S wave [24]. Nevertheless, the abnormal point maybe present in the HRV signal that is output by the QRS wave group detection method. In order to remove the exception value, a median filtering method is utilized [25]. The HRV is shown in Figure 5, which is obtained by ECG signal collected by sensor. In reality, the signal defined as the continuous RR intervals. Hence, for the sake of getting the HRV sequence from the ECG signal, a QRS wave group detection method is utilized to detect the Q R and S [24]. the abnormal maybe present in the HRV signal that is output by the QRS wave group detection method. In order to remove the exception value, a is utilized [25]. The HRV is shown in Figure 5, which is obtained by ECG signal collected by sensor. In reality, the HRV signal is defined as the fluctuation in continuous RR intervals. Hence, for the sake of getting the HRV sequence from the ECG signal, a QRS wave group detection method is utilized to detect the Q wave, R wave, and S wave [24]. Nevertheless, the abnormal point maybe present in the HRV signal that is output by the QRS wave group detection method. In order to remove the exception value, a median filtering method is utilized [25].

**Figure 5.** The ECG signal and HRV signal. (**a**) The obtained ECG signal. (**b**) The extracted HRV signal. **Figure 5.** The ECG signal and HRV signal. (**a**) The obtained ECG signal. (**b**) The extracted HRV signal.

#### *2.3. Feature Extraction 2.3. Feature Extraction*

In this sub-section, extracting features from the HRV data obtained is presented. During the operation of the dual-arm robot operation tasks, the change of mental workload of operator is related to the volatility of sympathetic and parasympathetic nerve closely. In fact, the time domain features of the HRV signal reflect the overall volatility of the autonomic nervous system reaction. Additionally, frequency domain features of high frequency are related to the intensity of the modulation of parasympathetic nerve. Nevertheless, the low frequency band is influenced more by sympathetic nervous regulation. In addition, nonlinear features are expressed the chaotic and dynamic characteristics of HRV signal. In this sub-section, extracting features from the HRV data obtained is presented. During the operation of the dual-arm robot operation tasks, the change of mental workload of operator is related to the volatility of sympathetic and parasympathetic nerve closely. In fact, the time domain features of the HRV signal reflect the overall volatility of the autonomic nervous system reaction. Additionally, frequency domain features of high frequency are related to the intensity of the modulation of parasympathetic nerve. Nevertheless, the low frequency band is influenced more by sympathetic nervous regulation. In addition, nonlinear features are expressed the chaotic and dynamic characteristics of HRV signal.

#### 2.3.1. Linear Features 2.3.1. Linear Features

features.

1. Time domain features 1. Time domain features

The main features used in time domain is shown in Table 2. They are SDNN, RMSSD, RMSSD, PNN50, and HRVTi. In addition, the mean and median of the HRV signal are also extracted as The main features used in time domain is shown in Table 2. They are SDNN, RMSSD, RMSSD, PNN50, and HRVTi. In addition, the mean and median of the HRV signal are also extracted as features.


PNN50

( ) <sup>1</sup> 50

*i i num RRs RRs N* <sup>+</sup> − > <sup>=</sup> <sup>−</sup>

#### 2. Frequency domain features

The all frequency features used in this paper are based on the power spectra density. In this paper, a Lomb–Scamble periodic graph is used to calculate the power spectral density, which has a higher estimation accuracy than the FFT-based method [26]. The detailed description and definition are shown in Table 3.


**Table 3.** Statistical features in the frequency domain.

#### 2.3.2. Nonlinear Features

1. Sample Entropy (SaEn):

SaEn is a method that can be used for the measurement of physiological signal complexity. SaEn is a probability of two HRV signals matching at a length of *m* + 1 if they match at *m*. In addition, a tolerance parameter *r* will determine the match result. In this paper, the value of *m* is set to 2, and the value of *r* is defined as 0.2 × *std.* The *std* in this paper represents the standard deviation of the input HRV data [27].

2. Detrended Fluctuation Analysis (DFA):

DFA can be used for the statistical self-affinity of physiological signal, which is used for removing the trend of a series of events. Especially, it can reflect the information about the long-term correlation in the HRV signal. Furthermore, it has been widely used in HRV signal analysis [28]. The fluctuations of the HRV signal can express as a function of time intervals: *F*(*n*) = *pnAlpha* where *p* is a constant and *Alpha* is a scale factor. *F* represents the fluctuations of HRV and *n* is time intervals. The HRV signal fluctuations will be altered by changing the parameter *n*. Two parameters of *Alpha1* and *Alpha2* are defined as the slop of *F*(*n*), which is a function of log*n* in different time range.

#### **3. Results**

Using the time domain, frequency domain and nonlinear analysis method above, the HRV signals are analyzed when the subjects are in performing the task and relaxing state, respectively. Firstly, a t-test is used and the statistical significance of the extracted time domain, frequency domain, and nonlinear features are analyzed. Then, the features with statistical differences are selected for the classification of mental workload. Furthermore, for the sake of excluding the effects of classifier performance differences, six classifier algorithms are selected to identify and classify the mental workload, which are Support Vector Machine (SVM), Linear Discriminant Analysis (LDA), K-Nearest Neighbor (KNN), Decision Tree (DT), Gentle Boost (GB), and Naive Bayes (NB). The default parameters are selected as the parameters of the six classification algorithms in this paper. In addition, the HRV signals under different mental workload are divided into testing set and training set based on 10-fold

cross-validation. Furthermore, the performance of mental workload levels are classified and evaluated by three indicators, which are defined as follows:

$$\text{Accuracy}: \text{Acc} = \frac{TP + TN}{TP + FP + TN + FN} \times 100\%;$$

$$\text{Sensitivity}: \text{Sen} = \frac{TP}{TP + FN} \times 100\%;$$

$$\text{Specificity}: \text{Spe} = \frac{TN}{FP + TN} \times 100\%.$$

where *TP* is defined as those samples in which the predicted and actual values are both positive. *FP* is defined as those samples that are classified as positive samples, but they are actually negative samples. *FN* is defined as those samples that are predicted to be negative samples, but their actual values are positive. Additionally, *TN* is defined as the actual values of samples that are positive but that are predicted to be negative. In this paper, the performing task state samples are defined as positive samples and the relaxing state samples are defined as negative samples.

#### *3.1. Statistical Analysis of Features*

#### 3.1.1. Statistical Difference Analysis of Features from the Same Subject

Using the t-test, the statistical differences of time domain, frequency domain, and nonlinear features are analyzed in the same subject at different states (performing task state and relaxing state). Defining the sample set of subject1's performing task state as S1-M, sample set of subject1's relaxing state as S1-R. Meanwhile, the sample set of subject2<sup>0</sup> s, subject3's, subject4's, subject5's, and subject6's different lengths of time is defined by this rule.

Table 4 shows the statistical differences among 6 subjects. Moreover, each subject has two different mental workload states (performing task state and relaxing state). In detail, Table 4 shows the statistical differences of time domain, frequency domain and nonlinear features. It can be seen that there are total 87 features that are most significant differences (*p* < 0.001) between two different mental workload states from Table 4.

Among them, subject1 has 20 features with most significant differences (*p* < 0.001), which consist of six time domain features, 10 frequency domain features, and four nonlinear features.

Subject2 has 13 features with most significant differences (*p* < 0.001), which consist of six time domain features, five frequency domain features, and two nonlinear features.

Subject3 has 15 features with most significant differences (*p* < 0.001), which consist of six time domain features, seven frequency domain features, and two nonlinear features.

Subject4 has 16 features with most significant differences (*p* < 0.001), which consist of five time domain features, seven frequency domain features, and four nonlinear features.

Subject5 has nine features with most significant differences (*p* < 0.001), which consist of five time domain features, and four frequency domain features.

Subject6 has 14 features with most significant differences (*p* < 0.001), which consist of five time domain features, five frequency domain features, and four nonlinear features.


**Table 4.** Statistical analysis results of HRV time domain, frequency domain, and nonlinear features.

\*, \*\*, \*\*\* represent *p* < 0.05, *p* < 0.01, *p* < 0.001, respectively.

#### 3.1.2. Statistical difference analysis of features cross the different subject

Using the t-test, the statistical differences of time domain, frequency domain, and nonlinear features are analyzed cross the different subject at different states (perform task state and relaxed state). The sample set of subject1's, subject2's, subject3's, subject4's, subject5's, and subject6's performing task state is defined as the CM group and the sample set of subject1's–subject6's in the relaxing state is defined as CR group.

Table 5 shows the statistical differences between two different mental workload state sample sets. Table 5 shows time domain and nonlinear features and Table 5 shows frequency domain features. It can be seen from Table 5 that there are 18 most significant difference (*p* < 0.001) features in the two groups of CM and CR.

**Table 5.** Statistical analysis results of HRV time domain, frequency domain, and nonlinear features.


\*, \*\*, \*\*\* represent *p* < 0.05, *p* < 0.01, *p* < 0.001, respectively.

#### *3.2. Mental Workload Classification Based on the Same Subject*

The classification and identification of mental workload are carried out on six subjects, respectively. Additionally, the features with statistical differences are selected for the classification of mental workload. The sample datasets for each experiment are divided into training set and testing set. In order to verify the classification performance of features, a total of six classification algorithms are used in this paper so each subject has trained six models. In this paper, there are six experimental subjects and 6 × 6 = 36 models are trained. The average value of 10-fold cross-validation is used as the final experimental result. In order to ensure the reliability of the experimental results, the 10-fold cross-validation is repeated 100 times.

Figure 6 and Table 6 are the classification results for each subject using different classifiers, as can be seen from Figure 6a and Table 6. SVM, KNN, and GB show better classification results for subject1. In addition, the KNN classification algorithm shows the highest Spe, Sen, and Acc: 99.26%, 98.86%, and 98.91%, respectively. It can be seen from Figure 6b and Table 6 that SVM, KNN GB, NB, and DT show better classification results for subject2. In addition, the KNN classification algorithm shows the highest Spe, Sen, and Acc: 99.99%, 98.94%, and 99.95%, respectively. It can be seen from Figure 6c and Table 6, for the subject3. LDA shows the worst classification effect and the KNN classification algorithm shows the highest Spe, Sen, and Acc: 99.15%, 99.07%, and 98.84%, respectively. As Figure 6d and Table 6 demonstrate, SVM, KNN, GB, and DT show better classification results for subject4. The SVM classification algorithm shows the highest Spe (98.43%) and KNN classification algorithm shows the highest Sen and Acc: 97.61% and 96.45%, respectively. As can be seen from Figure 6e and Table 6, for the subject5, all five classification algorithms, except LDA, show good performance of classification. The SVM classification algorithm shows the best Spe, Sen, and Acc: 99.97%, 99.99%, and 99.97%, respectively. It can be seen from Figure 6f and Table 6 that the KNN classification algorithm shows the highest Spe, Sen, and Acc: 98.61%, 99.34%, and 98.64%, respectively.

**Figure 6.** The classification results of each subject under different classifiers.


**Table 6.** The classification results of each subject under different classifiers.

Finally, the Spe, Sen, and Acc of the six subjects under different classification are presented in box plots (Figure 7). Box plots not only show the average values, but the distribution of the computed values can also be given. Additionally, the abnormal values are given by red points. As can be seen from the figure, while using the KNN classifier, all 6 subjects exhibit highest Spe, Sen, and Acc, with the least overall discreteness. However, in Spe and Acc, outliers appear. While using the SVM classifier, the six subjects perform higher Spe, Sen, and Acc, and the data are less discrete. Comparing with KNN and SVM classifiers, the GB classifier shows a large degree of discreteness but the classification results are stable. The performance of classification of the DT classifier is slightly worse than GB. The classification results of LDA and NB classifiers are the least satisfactory, with Spe, Sen, and Acc of LDA being lower, while the Spe, Sen, and Acc of NB classifier are the most discrete.

**Figure 7.** The box plots of Spe, Sen, Acc for six subjects.

#### *3.3. Mental Workload Classification Cross Subject*

In this sub-section, the performance differences of cross-subject mental workload classification are analyzed. The features with statistical differences are selected for the classification of mental workload. Samples of five subjects are used as a training set and samples of the leave-out subject who is not involved in the training are used as the testing set. Since there are six subjects, the validation process is performed six times.

Figure 8 and Table 7 are cross-subject classification results using different classifiers. As can be seen from Figure 8 and Table 7, for subject1, the KNN classification algorithm shows the highest Sen (100%), and the GB method shows the highest Spe (100%) and Acc (91.18%). For subject2, SVM and GB methods show the highest Spe (100%). At the same time, the GB method also shows the highest Spe (100%) and Acc (100%). For subject3, the LDA classification algorithm shows the best classification performance. The Spe, Sen, and Acc are 78.43%, 100.00%, and 89.22%, respectively. For subject4, SVM shows the highest Sen (98.43%). The KNN method shows the highest Acc (95.1%). Te NB method shows the highest Spe (100%). For subject5, SVM shows the highest Acc (81.76%). GB and DT show the highest Spe (100%) and the NB method shows the highest Spe (100%). For subject6, both SVM and KNN methods show the highest Spe (84.31%). SVM shows the best Acc (91.18%) and the NB method shows the highest Sen (100%).

**Figure 8.** The cross-subject classification results under different classifiers.


**Table 7.** The classification results of cross-subject under different classifiers.

Finally, the results of cross-subject classification under different classifiers are presented in box plots (Figure 9). As can be seen from the figure, there are higher maximums of Spe, Sen, and Acc regardless of the classifier used. However, the figure also shows a more discrete distribution result and the red points represent abnormal values. The difference between the maximum and minimum values is large. In addition, for each subject, there is a classifier that achieves better classification results.

**Figure 9.** The box plots of Spe, Sen, and Acc for cross-subject classification.

#### **4. Discussion**

To the best of our knowledge, this is the first work to measure the operator's mental workload in human and dual-arm robot interaction process based on wearable exoskeleton controller. At present, many of the studies on mental workload are aimed at the n-back paradigm, simulated driving scenarios, and so on. In the process of interaction between human and dual-arm robot of this paper, the operator adopts the wearable controller. Additionally, the two arms of the dual-arm robot imitate the arms of human. This control mode of master-slave aims to reduce the operator's burden in the process of human and dual-arm robot interaction as much as possible. In addition, this control mode also excludes the operator's limb coordination ability differences, which significantly focuses the operator on the task. The study of mental workload in the process of human and dual-arm robot interaction has not been found. In addition, there is no corresponding public datasets. Thus, in this paper, the ECG signal data is collected. According to the ECG signals, the HRV for analysis is extracted.

Studies have shown that a stress response occurs [29] when the mental workload of the human increases. First, the sympathetic nervous system will be activated. Then the entire nervous system will respond to the increase of mental workload and improve human alertness. Furthermore, blood is transferred from the internal organs and skin to the skeletal muscles. Then the heart rate and heart contraction increase rapidly. These changes allow the body to accumulate large amounts of energy in a short period of time to prepare for external threats. Furthermore, the HRV signal contains information about the regulation of the cardiovascular system by body fluid factors, which can reflect fluctuations of the autonomic nervous system. Therefore, it is feasible to use the HRV signal for mental workload analysis.

More specifically, the existing studies show that the aTotal feature reflects the whole activity of the autonomic nervous system. LF-relative features are thought to be associated with sympathetic activity. HF-relative features are thought to have correlation between the parasympathetic activity. The physiological significance of the VLF-relative features have been identified with long-period rhythms. The relationship between LF components and HF components (LF/HF) is an important indicator of the sympathetic and parasympathetic balance in the body [30,31]. The SDNN index and HRVTi feature are believed to primarily measure autonomic influence on HRV [32]. Both RMSSD and PNN50 reflect parasympathetic (vagal) activity. Nonlinear features represent the fluctuation characteristics of the autonomic nervous system [33].

In this paper, the time domain features, frequency domain features, and nonlinear features between two mental workload states of the same subject or across different subjects, most features show statistical differences. Only individual features do not show statistical differences, which may be due to personalized differences between subjects. This does not affect the classification of the two mental workload states. Firstly, this paper analyzes the different mental workload states of the same subject. The results show that, for subject1–subject6, the highest Acc are 98.91% (KNN), 99.95% (KNN), 98.84% (KNN), 96.45% (KNN), 99.97% (SVM), and 98.64% (KNN), respectively. The KNN classifier has the highest average recognition accuracy (98.77%) when using the same classifier to identify six subjects separately. The SVM and GB classifiers also show good classification, with the Acc being 97.54% and 95.90%, respectively. None of the remaining three classifiers (LDA, NB, DT) have a classification accuracy rate of more than 90%. Therefore, the KNN algorithm is more suitable for the human and dual-arm robot interaction, using the sample data training model of the same subject and classifying the mental workload of the subject. Then, the different mental workload states cross-subject are classified. The results show that, for subject1–subject6, the highest Acc are 91.18% (GB), 100% (GB), 89.22% (LDA), 95.10% (KNN), 81.76% (SVM), and 91.18% (SVM). Thus, the average classification accuracy of the six subjects classifying using different classifiers is 91.41%. In the case of using the same classifier for the six subjects, the average accuracy of cross-subject identification is 80.56% (GB). Additionally, SVM and KNN also show good classification results, with classification accuracy of 78.51% and 73.53%, respectively. When identifying across subjects, each subject has a classifier that

makes it better classified. Therefore, in the future, multiple classifiers should be considered for use and use the voting method to select the best classifier's classification results.

The analysis of mental workload is related to specific tasks and the study of mental workload in the process of master-to-slave interaction between a wearable controller and a dual-arm robot have not been reported. Therefore, this paper chooses to compare the studies related to mental workload or stress in other scenarios. In [34], a pilot study is conducted on whether machine learning can predict stress decrease after relaxation on the basis of a wearable sensor. The status before and after relaxation is classified using the ECG and GSR signals for 79.2% classification accuracy. In [35], detection of drivers' anxiety based on physiological signals is studied. The results show that classification on the basis of EEG alone shows the best accuracy, it is 77.01%. In [36], the cross-subject mental workload classification is studied on the basis of kernel spectral regression and transfer learning techniques. An average Acc of 72.66% is obtained for six subjects, the Acc of six subjects are 73.15%, 77.32%, 78.63%, 65.40%, 71.08%, and 70.36%, respectively. In [37], using wearable sensors, the mental workload of human and robot collaboration is analyzed. However, it is only the statistical analysis of HRV signals in different mental workload states. In addition, there is no study of classification and identification. In this paper, the data of two different mental workload states are collected and 20 kinds of HRV features are extracted. Then, the statistical significance of HRV signal features are analyzed in different states. The features with statistical differences (*p* < 0.05) are selected for the identification and analysis of mental workload. Models trained with the same subject data and models trained across different subjects all obtained higher Acc compared with [34–37].

In addition, in this paper, the heart beat data collection device is a custom one. Its functionality can be modified based on demand. Furthermore, it is cheap. However, with the rapid development of consumer electronics devices, most of the existing smart watches have heart beat monitoring capabilities. This will be more conducive to long-term detection. Thus, in the future, smart watches will be considered as the heart beat data collection device for research.

#### **5. Conclusions**

A human remote-controlled robot performs complex or dangerous tasks in unstructured environments, which expends the scope of human work. In the process of completing the tasks, the mental workload of the operator will change based on the different tasks of the robot. However, too much mental workload will not only affect the robot's working efficiency and safety, but also impact human physical and mental health. In order to assess the mental workload during human interaction with a dual-arm robot, in this paper, HRV is the measure that is studied. Firstly, the ECG signals of two kinds of mental workload states (performing task state and relaxing state) are collected. The ECG signals are collected from six subjects based on a custom device. Based on the ECG signal, the HRV signal is obtained. Then, 20 kinds of HRV features (time domain, frequency domain, and nonlinear features) are extracted. Finally, six different classifications are used to mental workload classification. The results are that, firstly, using each subject's HRV signal training model, the subject's mental workload is classified. The average classification accuracy of 98.77% is obtained using the KNN method. Then, using the HRV signal of five subjects for training, and the remaining one subject for testing, the GB method can obtain the highest average classification accuracy, with the average classification accuracy of six subjects being 80.56%. This study has demonstrated that the HRV can be used to measure the mental workload during human interaction with a dual-arm robot.

**Author Contributions:** Conceptualization: S.S. and T.W.; methodology: S.S., T.W., C.S. and C.Y.; software: S.S.; formal analysis: S.S. and C.S.; investigation: S.S., Y.W. and Y.S.; data curation: S.S., Y.W. and Y.S.; writing—original draft preparation: S.S. and C.S.; writing—review and editing: S.S. and C.S. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research is funded by the National Natural Science Foundation of China (grant number U20A20201), the Doctoral Scientific Research Foundation of Liaoning Province (grant number 2020-BS-025), the LiaoNing Revitalization Talents Program (grant number XLYC1807018), and the National key research and development program of China (grant number 2016YFE0206200).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Pediatric Speech Audiometry Web Application for Hearing Detection in the Home Environment**

**Stanislav Ondáš 1,\* , Eva Kiktová 2 , Matúš Pleva <sup>1</sup> , Mária Oravcová 2 , Lukáš Hudák 1 , Jozef Juhár <sup>1</sup> and Július Zimmermann <sup>2</sup>**


Received: 8 May 2020; Accepted: 11 June 2020; Published: 13 June 2020

**Abstract:** This paper describes the development of the speech audiometry application for pediatric patients in Slovak language and experiences obtained during testing with healthy children, hearing-impaired children, and elderly persons. The first motivation behind the presented work was to reduce the stress and fear of the children, who must undergo postoperative audiometry, but over time, we changed our direction to the simple game-like mobile application for the detection of possible hearing problems of children in the home environment. Conditioned play audiometry principles were adopted to create a speech audiometry application, where children help the virtual robot Thomas assign words to pictures; this can be described as a speech recognition test. Several game scenarios together with the setting condition issues were created, tested, and discussed. First experiences show a positive influence on the children's mood and motivation.

**Keywords:**pediatric speech audiometry; hearing tests; conditioned play audiometry; human–computer interaction

#### **1. Introduction**

Audiometry is generally aimed at measuring the perception of the audio signal by the human auditory system. Audiometric tests can be divided into two main categories: pure tone audiometry and speech audiometry. Speech audiometry is a standard part of the audiological test battery/collection. It is usually done after the pure tone audiometry and helps the audiologist to answer questions regarding a patient's ability to be involved in speech communication. In other words, speech audiometry enables to test the speech processing abilities at different levels within the auditory system [1].

Speech audiometry contains several types of speech tests which focus on different aspects of speech perception and processing. Tests can measure the patient's most comfortable and uncomfortable listening levels, their range of comfortable listening, and their ability to recognize and discriminate speech sounds. A typical setting for the speech audiometry is similar to the pure tone audiometry. Usually, a two-room setting is needed. A two-channel audiometer can be used to present the stimulus through a microphone (monitored live voice) or through an external device, in case of recorded speech. The patient's role in speech audiometry is to react to the provided stimuli. They can repeat proposed words, write down their response or point to the picture.

Speech audiometry tests can be divided into two main categories: threshold level testing and suprathreshold testing [2]. In the case of threshold level testing, audiologists try to find the lowest level of speech where patients can detect and recognize the speech stimulus. The speech detection threshold (SDT) can be seen as the basic parameter. SDT is the lowest level of speech that a patient is able to detect for at least 50% of the time. The next important measure focuses on the speech recognition threshold (SRT), which is the lowest level that a person recognizes and repeats speech back to the audiologist.

The next group of speech audiometry tests falls under suprathreshold tests. After finding the SDT and SRT, in suprathreshold testing the speech material is provided to the patient at a normal conversation level and we try to identify speech recognition and understand the ability of that person. There are several suprathreshold speech tests that can be done. One is the most comfortable loudness (MCL) test, which tries to identify the most comfortable speech level of the patient. Another important measure is the uncomfortable loudness level (UCL). The UCL imagines the maximum level in which you can perform word recognition testing, and together with the SDT enables you to determine the dynamic range for speech.

The next important speech audiometry tests are word recognition tests or speech recognition tests. Their purpose is to determine the person's ability to understand and repeat words presented at a conversation level. Speech recognition testing is performed with a specific set or list of words known as phonetically balanced words. These lists consist of commonly occurring words in their normal proportion in everyday speech. There exist a few standardized lists. The most known are the PB-50 (phonetically balanced 50 words), CID (Central Institute of Deaf) W-22 list or Northwestern University (NU-6) list. Words from these lists can be presented by a live voice, but the usage of a prerecorded voice is a better choice. Another advantage of pre-prepared sounds is that they allow identical measurements to be made repeatedly, thus they are preferred in audiometry. Typically, 25 or 50 words are presented to the patient and they are instructed to repeat them or point to the correct picture. Words are usually presented on a fixed level (sound pressure level (SPL)) at about 30 dB to 40 dB above the patient's SDT. The result is expressed in the percentage of words they get correct. The person without any hearing impairment should have a word recognition score (WRS) of 90–100%.

The current research in the pediatric audiometry domain is mainly focused on phonetically balanced sentence/word lists in different languages, such as Greek [2], German (German Oldenburg Sentence Test for Children [3] or Mainz speech test for children 3–7 years old [4]), Thai [5] or Chinese [6]. However, these word lists are not combined with pictures and audio samples to build an automated speech application for the home environment. On the other hand, a very interesting THear framework for mobile audiometry [6] is designed mainly for adults, where written text and Chinese traditional character pictures are used to choose the word heard. This is not suitable for pediatric patients.

When we tried to review the latest pediatric audiometry applications in Slavic languages, as there is no Slovak word list to our knowledge, we found the following: Polish hearing screening of school children is tone-based [7]; Serbian speech audiometry authors published a good speech audiometry word re-evaluation lately [8] but it is not oriented to the pediatric domain and has no picture set associated; Czech preschool children testing [9] was based on a whispered voice performed by pediatricians; Russian speech audiometry materials and SRT tests recorded high-quality audio samples [10] but are still not suitable for child audiometry as they do not have a picture set associated; Ukrainian phonetic tables [11] are not designed for speech audiometry and no word list for pediatric audiometry was found; for Bulgarian language, we found only newborn screening program results [12] with no Bulgarian phonetically-balanced word list available.

In the present paper we focus our attention on the pediatric speech audiometry, which has its own specifics. Children have a limited, specific vocabulary, which must be considered. A closed set of words instead of an open set are preferred. That means that the patient can select a correct result from a set of options. This can be done using picture cards, where a child can point to a picture related to what he or she heard. For children in the kindergarten age, phonetically-balanced kindergarten word lists exist (e.g., [13]).

During the audiometry testing of pediatric patients, we need to consider their abilities and limitations in the language area, which relates to their age. The above-mentioned audiometry tests need to be modified to adapt to pediatric patients. Pediatric audiometry is performed in case of hearing loss and the deafness of children in prelingual and post-lingual age, in range of postoperative, medication, and compensatory therapy. A high level of stress and distrust of a pediatric patient towards the therapist and the therapy is a commonly observed issue that arises during the application of the therapy, which results in a situation when checking the sound perception during an interview with the child becomes ineffective or even precluded. The level of stress in a child patient, low motivation, and involvement in therapy can be positively influenced using game-like approaches, smart technologies, or by involving robotic systems. Generally, studies indicate the positive effect of robots on therapy (see [14–17]).

Therapists often report problems with children concerning motivation and involvement during audiometry after implantation of a cochlear implant [18]; this thus became the motivation behind the current study. The initial idea lied in the use of a robot during the audiometry process, in order to positively influence the therapy by decreasing the stress and distrust of the pediatric patient towards the therapist/therapy. Our goal was to prepare a research platform that would enable us to study aspects of virtual robot-supported audiometry. After collecting first experiences we extended our focus to involve other smart devices, like smartphones and tablets, into speech audiometry. Our attention was focused on the development of a simple game-like mobile application for the detection of possible hearing loss problems in children at home conditions. Conditioned play audiometry (CPA) principles were adopted to create speech audiometry applications, where children help a virtual robot known as Thomas to assign words to pictures. We previously described a similar web-based audiology application [19] in which a telemetry application is presented with remote audiology measuring devices. The advantage of our proposed system is that it is not dependent on special devices and the purpose of home environment testing is different. We mainly propose a simple child-acceptable application that is easy to play with and can provide a short everyday testing of the current state of hearing with cochlear implants or Otitis media (middle ear inflammation).

Our initial aim to support child audiometry using robots was extended to use modern technologies for the above-mentioned purpose. One of the main reasons for addressing this problem was the experiences of therapists from post-operative therapy after cochlear implant implantation, where they described problems with fear and low motivation of children. Another reason was no commonly accepted pre-recorded speech stimuli for kindergarten children existed, nor any tool for diagnostics of children with hearing problems in the Slovak language, particularly in the home environment. Therefore, we started to develop the audiometry application, which can serve both therapists and parents and can be easily used in the home environment. Testing the application brought many ideas and findings, which were used to further improve the application.

Our work consists of several tasks. The first task was to select a child audiometry method, which is well suitable for pediatric speech audiometry and the desired application. The second task was to define the scenario of audiometry tests and prepare resources in the Slovak language. The next tasks were focused on the design and development of the research platform for robot-assisted child audiometry and speech audiometry application for smart devices.

The paper is organized as follows: Section 2 describes the design and development of the pediatric speech audiometry application including speech stimuli preparation. Section 3 presents details about the performed experiments, their results, and collected observations.

#### **2. Development of Speech Audiometry Application for Pediatric Patients**

#### *2.1. Speech Stimuli*

Speech stimuli are the most fundamental part of speech audiometry testing. To test speech perception and processing a therapist provides speech stimuli to the patient. Speech stimuli can be provided by a live voice or by a pre-recorded voice (recordings). Both live and pre-recorded voices have their own advantages and disadvantages. In the case of a live voice, the rapport between a patient

and therapist can be reached easier. One the other hand, such measurement will be difficult to repeat at the same conditions. In the case of using a pre-recorded voice, the measurement can be easily repeated in the future. patient and therapist can be reached easier. One the other hand, such measurement will be difficult to repeat at the same conditions. In the case of using a pre-recorded voice, the measurement can be easily repeated in the future. patient and therapist can be reached easier. One the other hand, such measurement will be difficult to repeat at the same conditions. In the case of using a pre-recorded voice, the measurement can be easily repeated in the future. patient and therapist can be reached easier. One the other hand, such measurement will be difficult to repeat at the same conditions. In the case of using a pre-recorded voice, the measurement can be easily repeated in the future.

*Electronics* **2020**, *9*, x FOR PEER REVIEW 4 of 15

*Electronics* **2020**, *9*, x FOR PEER REVIEW 4 of 15

*Electronics* **2020**, *9*, x FOR PEER REVIEW 4 of 15

Speech stimuli need to cover the phonetical set of the language, to be able to assess the speech understanding ability of the patient. Moreover, in the case of pediatric audiometry, they need to consider significant differences in speech and language ability in comparison to adult patients. Speech stimuli need to cover the phonetical set of the language, to be able to assess the speech understanding ability of the patient. Moreover, in the case of pediatric audiometry, they need to consider significant differences in speech and language ability in comparison to adult patients. Speech stimuli need to cover the phonetical set of the language, to be able to assess the speech understanding ability of the patient. Moreover, in the case of pediatric audiometry, they need to consider significant differences in speech and language ability in comparison to adult patients. Speech stimuli need to cover the phonetical set of the language, to be able to assess the speech understanding ability of the patient. Moreover, in the case of pediatric audiometry, they need to consider significant differences in speech and language ability in comparison to adult patients.

Speech audiometry can be performed with very young children (from approx. two and half years) and pediatric audiometry methods are substituted by adult audiometry in the case of ten- or twelve-year-old patients. Pediatric audiometry methods can also be used for adult patients with specific kinds of mental disabilities or for elderly patients. The child's mental processes up to two years depend on his or her experiences—what they see, what they hear, what they touch. This period is based on sensorimotor thinking and development of practical intelligence [20,21]. The vocabulary of a child of this age is very limited. A two-year-old child can actively use around 200–300 of words. Of course, his or her understanding capacity is larger [22]. Speech audiometry can be performed with very young children (from approx. two and half years) and pediatric audiometry methods are substituted by adult audiometry in the case of ten- or twelveyear-old patients. Pediatric audiometry methods can also be used for adult patients with specific kinds of mental disabilities or for elderly patients. The child's mental processes up to two years depend on his or her experiences—what they see, what they hear, what they touch. This period is based on sensorimotor thinking and development of practical intelligence [20,21]. The vocabulary of a child of this age is very limited. A two-year-old child can actively use around 200–300 of words. Of course, his or her understanding capacity is larger [22]. Speech audiometry can be performed with very young children (from approx. two and half years) and pediatric audiometry methods are substituted by adult audiometry in the case of ten- or twelveyear-old patients. Pediatric audiometry methods can also be used for adult patients with specific kinds of mental disabilities or for elderly patients. The child's mental processes up to two years depend on his or her experiences—what they see, what they hear, what they touch. This period is based on sensorimotor thinking and development of practical intelligence [20,21]. The vocabulary of a child of this age is very limited. A two-year-old child can actively use around 200–300 of words. Of course, his or her understanding capacity is larger [22]. Speech audiometry can be performed with very young children (from approx. two and half years) and pediatric audiometry methods are substituted by adult audiometry in the case of ten- or twelveyear-old patients. Pediatric audiometry methods can also be used for adult patients with specific kinds of mental disabilities or for elderly patients. The child's mental processes up to two years depend on his or her experiences—what they see, what they hear, what they touch. This period is based on sensorimotor thinking and development of practical intelligence [20,21]. The vocabulary of a child of this age is very limited. A two-year-old child can actively use around 200–300 of words. Of course, his or her understanding capacity is larger [22].

Speech stimuli often form word lists, which contain words that are pronounced by the therapist or played through an audiometer or CD player. Several word lists exist, which were developed for other languages. The most well-known word list for children's speech audiometry is the Phonetically balanced Kindergarten List (PBK) defined by Haskins in 1949 [13]. It consists of 50 phonetically balanced word items, which were selected from the spoken vocabulary of normal-hearing kindergarten children [1]. There are also other lists, such as the Isophonemic Word Lists designed by Boothroyd in 1968 [23] or the Northwestern University Children's Perception of Speech (NU-CHIPS), which consists of 50 words with pictures [24]. Speech stimuli often form word lists, which contain words that are pronounced by the therapist or played through an audiometer or CD player. Several word lists exist, which were developed for other languages. The most well-known word list for children's speech audiometry is the Phonetically balanced Kindergarten List (PBK) defined by Haskins in 1949 [13]. It consists of 50 phonetically balanced word items, which were selected from the spoken vocabulary of normal-hearing kindergarten children. [1]. There are also other lists, such as the Isophonemic Word Lists designed by Boothroyd in 1968 [23] or the Northwestern University Children's Perception of Speech (NU-CHIPS), which consists of 50 words with pictures [24]. Speech stimuli often form word lists, which contain words that are pronounced by the therapist or played through an audiometer or CD player. Several word lists exist, which were developed for other languages. The most well-known word list for children's speech audiometry is the Phonetically balanced Kindergarten List (PBK) defined by Haskins in 1949 [13]. It consists of 50 phonetically balanced word items, which were selected from the spoken vocabulary of normal-hearing kindergarten children. [1]. There are also other lists, such as the Isophonemic Word Lists designed by Boothroyd in 1968 [23] or the Northwestern University Children's Perception of Speech (NU-CHIPS), which consists of 50 words with pictures [24]. Speech stimuli often form word lists, which contain words that are pronounced by the therapist or played through an audiometer or CD player. Several word lists exist, which were developed for other languages. The most well-known word list for children's speech audiometry is the Phonetically balanced Kindergarten List (PBK) defined by Haskins in 1949 [13]. It consists of 50 phonetically balanced word items, which were selected from the spoken vocabulary of normal-hearing kindergarten children. [1]. There are also other lists, such as the Isophonemic Word Lists designed by Boothroyd in 1968 [23] or the Northwestern University Children's Perception of Speech (NU-CHIPS), which consists of 50 words with pictures [24].

To perform children's speech audiometry for Slovak children, Dr. Hapˇco and Dr. Bargár designed a Slovak set of words, which contains 80 words; this set is well suitable for older children (school-age). Another set also exists, which is used by audiologists during behavioral audiometry, but it is not standardized or publicly available. Behavioral audiometry is usually performed with very young pediatric patients (from 6-months-old) and it is very interactive and subjective. To perform children's speech audiometry for Slovak children, Dr. Hapčo and Dr. Bargár designed a Slovak set of words, which contains 80 words; this set is well suitable for older children (school-age). Another set also exists, which is used by audiologists during behavioral audiometry, but it is not standardized or publicly available. Behavioral audiometry is usually performed with very young pediatric patients (from 6-months-old) and it is very interactive and subjective. To perform children's speech audiometry for Slovak children, Dr. Hapčo and Dr. Bargár designed a Slovak set of words, which contains 80 words; this set is well suitable for older children (school-age). Another set also exists, which is used by audiologists during behavioral audiometry, but it is not standardized or publicly available. Behavioral audiometry is usually performed with very young pediatric patients (from 6-months-old) and it is very interactive and subjective. To perform children's speech audiometry for Slovak children, Dr. Hapčo and Dr. Bargár designed a Slovak set of words, which contains 80 words; this set is well suitable for older children (school-age). Another set also exists, which is used by audiologists during behavioral audiometry, but it is not standardized or publicly available. Behavioral audiometry is usually performed with very young pediatric patients (from 6-months-old) and it is very interactive and subjective.

Due to the lack of an appropriate word list for Slovak kindergarten children with hearing disabilities, we decided to develop a new list, which will be well suited for children, two-years-old and older, although predominantly for kindergarten children. The newly designed unique Slovak kindergarten word list (SKWL) for child audiometry consists of 50-word items, related pictures, and audio files with recorded speech stimuli. The acquaintance criterion was the most important in the process of word selection. Words were separated into the following groups: transport/vehicles (5), colors (4), animals (10), toys/things (8), human body (5), food (5), and combinations (13). More information about the SKWL word list can be found elsewhere [25]. The group of animals is the biggest group (10) because animals are usually the first words in a child's vocabulary [26]. In early childhood, children imitate animal sounds, play with animal toys, and they are happy to watch them; thus, this category is representative. The category of the human body contains only one- and two-syllable words, so we consider it to be the least demanding for perception. Conversely, the food category contains one-, two-, three- and four-syllable words with Slovak phonemes *ˇc*, *dž*, *´ l*, *ch*, whose teaching is in the second part of the primer [27]. For these reasons, we consider this category to be the most challenging. A specific category is a combination of words and phrases. We graded the terms as two-word, three-word, and sentence, so that we can gradually distinguish what the patient hears and understands, and what is already too difficult for him/her, by the audiometric measurement. Due to the lack of an appropriate word list for Slovak kindergarten children with hearing disabilities, we decided to develop a new list, which will be well suited for children, two-years-old and older, although predominantly for kindergarten children. The newly designed unique Slovak kindergarten word list (SKWL) for child audiometry consists of 50-word items, related pictures, and audio files with recorded speech stimuli. The acquaintance criterion was the most important in the process of word selection. Words were separated into the following groups: transport/vehicles (5), colors (4), animals (10), toys/things (8), human body (5), food (5), and combinations (13). More information about the SKWL word list can be found elsewhere [25]. The group of animals is the biggest group (10) because animals are usually the first words in a child's vocabulary [26]. In early childhood, children imitate animal sounds, play with animal toys, and they are happy to watch them; thus, this category is representative. The category of the human body contains only one- and twosyllable words, so we consider it to be the least demanding for perception. Conversely, the food category contains one-, two-, three- and four-syllable words with Slovak phonemes *č*, *dž*, *ĺ*, *ch*, whose teaching is in the second part of the primer [27]. For these reasons, we consider this category to be the most challenging. A specific category is a combination of words and phrases. We graded the terms as two-word, three-word, and sentence, so that we can gradually distinguish what the patient hears and understands, and what is already too difficult for him/her, by the audiometric measurement. Due to the lack of an appropriate word list for Slovak kindergarten children with hearing disabilities, we decided to develop a new list, which will be well suited for children, two-years-old and older, although predominantly for kindergarten children. The newly designed unique Slovak kindergarten word list (SKWL) for child audiometry consists of 50-word items, related pictures, and audio files with recorded speech stimuli. The acquaintance criterion was the most important in the process of word selection. Words were separated into the following groups: transport/vehicles (5), colors (4), animals (10), toys/things (8), human body (5), food (5), and combinations (13). More information about the SKWL word list can be found elsewhere [25]. The group of animals is the biggest group (10) because animals are usually the first words in a child's vocabulary [26]. In early childhood, children imitate animal sounds, play with animal toys, and they are happy to watch them; thus, this category is representative. The category of the human body contains only one- and twosyllable words, so we consider it to be the least demanding for perception. Conversely, the food category contains one-, two-, three- and four-syllable words with Slovak phonemes *č*, *dž*, *ĺ*, *ch*, whose teaching is in the second part of the primer [27]. For these reasons, we consider this category to be the most challenging. A specific category is a combination of words and phrases. We graded the terms as two-word, three-word, and sentence, so that we can gradually distinguish what the patient hears and understands, and what is already too difficult for him/her, by the audiometric measurement. Due to the lack of an appropriate word list for Slovak kindergarten children with hearing disabilities, we decided to develop a new list, which will be well suited for children, two-years-old and older, although predominantly for kindergarten children. The newly designed unique Slovak kindergarten word list (SKWL) for child audiometry consists of 50-word items, related pictures, and audio files with recorded speech stimuli. The acquaintance criterion was the most important in the process of word selection. Words were separated into the following groups: transport/vehicles (5), colors (4), animals (10), toys/things (8), human body (5), food (5), and combinations (13). More information about the SKWL word list can be found elsewhere [25]. The group of animals is the biggest group (10) because animals are usually the first words in a child's vocabulary [26]. In early childhood, children imitate animal sounds, play with animal toys, and they are happy to watch them; thus, this category is representative. The category of the human body contains only one- and twosyllable words, so we consider it to be the least demanding for perception. Conversely, the food category contains one-, two-, three- and four-syllable words with Slovak phonemes *č*, *dž*, *ĺ*, *ch*, whose teaching is in the second part of the primer [27]. For these reasons, we consider this category to be the most challenging. A specific category is a combination of words and phrases. We graded the terms as two-word, three-word, and sentence, so that we can gradually distinguish what the patient hears and understands, and what is already too difficult for him/her, by the audiometric measurement.

There are several words in the database, which can serve as a distinctive element for stratifying the patient's audio capabilities [28], for example: There are several words in the database, which can serve as a distinctive element for stratifying the patient's audio capabilities [28], for example: There are several words in the database, which can serve as a distinctive element for stratifying the patient's audio capabilities [28], for example: There are several words in the database, which can serve as a distinctive element for stratifying the patient's audio capabilities [28], for example:


same category.

same category.

A total of 23 words have their diminutive equivalent (e.g., *krava*-*kraviˇcka*; *pes*-*psík*-*havo*), so that we can get as close as possible to the child's speech in each household, where the same subject may be named differently. A total of 23 words have their diminutive equivalent (e.g., *krava*-*kravička*; *pes*-*psík*-*havo*), so that we can get as close as possible to the child's speech in each household, where the same subject may be named differently. A total of 23 words have their diminutive equivalent (e.g., *krava*-*kravička*; *pes*-*psík*-*havo*), so that we can get as close as possible to the child's speech in each household, where the same subject may be named differently. A total of 23 words have their diminutive equivalent (e.g., *krava*-*kravička*; *pes*-*psík*-*havo*), so that we can get as close as possible to the child's speech in each household, where the same subject may be named differently. Due to our focus on the conditioned play audiometry and behavioral audiometry, we decided A total of 23 words have their diminutive equivalent (e.g., *krava*-*kravička*; *pes*-*psík*-*havo*), so that we can get as close as possible to the child's speech in each household, where the same subject may be named differently. Due to our focus on the conditioned play audiometry and behavioral audiometry, we decided

*Electronics* **2020**, *9*, x FOR PEER REVIEW 5 of 15

Due to our focus on the conditioned play audiometry and behavioral audiometry, we decided to prepare a picture card for each word in the Slovak kindergarten word list. These pictures were carefully drawn by the artist for this research to be kind and suitable for pediatric patients. Five types of picture tests from our speech recognition test are depicted in Figures 1–5. Each of them focuses on a specific task connected with hearing capability. Due to our focus on the conditioned play audiometry and behavioral audiometry, we decided to prepare a picture card for each word in the Slovak kindergarten word list. These pictures were carefully drawn by the artist for this research to be kind and suitable for pediatric patients. Five types of picture tests from our speech recognition test are depicted in Figures 1–5. Each of them focuses on a specific task connected with hearing capability. Due to our focus on the conditioned play audiometry and behavioral audiometry, we decided to prepare a picture card for each word in the Slovak kindergarten word list. These pictures were carefully drawn by the artist for this research to be kind and suitable for pediatric patients. Five types of picture tests from our speech recognition test are depicted in Figures 1–5. Each of them focuses on a specific task connected with hearing capability. to prepare a picture card for each word in the Slovak kindergarten word list. These pictures were carefully drawn by the artist for this research to be kind and suitable for pediatric patients. Five types of picture tests from our speech recognition test are depicted in Figures 1–5. Each of them focuses on a specific task connected with hearing capability. to prepare a picture card for each word in the Slovak kindergarten word list. These pictures were carefully drawn by the artist for this research to be kind and suitable for pediatric patients. Five types of picture tests from our speech recognition test are depicted in Figures 1–5. Each of them focuses on a specific task connected with hearing capability.

**Figure 1.** Different category: Played sound: *červené auto* (red car), same color, different pictures. **Figure 1.** Different category: Played sound: *ˇcervené auto* (red car), same color, different pictures. **Figure 1.** Different category: Played sound: *červené auto* (red car), same color, different pictures.

**Figure 2.** Category: Played sound: *mačka* a *pes* (cat and dog), two words. **Figure 2.** Category: Played sound: *mačka* a *pes* (cat and dog), two words. **Figure 2.** Category: Played sound: *maˇcka* a *pes* (cat and dog), two words. **Figure 2.** Category: Played sound: *mačka* a *pes* (cat and dog), two words. **Figure 2.** Category: Played sound: *mačka* a *pes* (cat and dog), two words.

**Figure 3.** Phonic differentiation: Played sound—*vlak*/*vláčik* (train), similarly sounding words in Slovak **Figure 3.** Phonic differentiation: Played sound—*vlak*/*vláčik* (train), similarly sounding words in Slovak (*vlak*, *vták*), different category. **Figure 3.** Phonic differentiation: Played sound—*vlak*/*vláčik* (train), similarly sounding words in Slovak (*vlak*, *vták*), different category. **Figure 3.** Phonic differentiation: Played sound—*vlak*/*vláˇcik* (train), similarly sounding words in Slovak(*vlak*, *vták*), different category. **Figure 3.** Phonic differentiation: Played sound—*vlak*/*vláčik* (train), similarly sounding words in Slovak (*vlak*, *vták*), different category.

**Figure 4.** Difficulty of phone: Played sound—*lietadlo*/*lietadielko* (airplane), 4 syllable sound, [dˇ]; -dl-**Figure 4.** Difficulty of phone: Played sound—*lietadlo*/*lietadielko* (airplane), 4 syllable sound, [dˇ]; -dl-**Figure 4.** Difficulty of phone: Played sound—*lietadlo*/*lietadielko* (airplane), 4 syllable sound, [dˇ]; -dlsame category. **Figure 4.** Difficulty of phone: Played sound—*lietadlo*/*lietadielko* (airplane), 4 syllable sound, [dˇ]; -dlsame category. **Figure 4.** Difficulty of phone: Played sound—*lietadlo*/*lietadielko* (airplane), 4 syllable sound, [dˇ]; -dlsame category.

*Electronics* **2020**, *9*, x FOR PEER REVIEW 6 of 15

**Figure 5.** Combinations: Played sound—*bábika spí* (doll is sleeping), three pictures, high similarity. **Figure 5.** Combinations: Played sound—*bábika spí* (doll is sleeping), three pictures, high similarity.

A picture identification speech recognition test is suitable for children up to 10-years-old. The pediatric patient correctly marks the appropriate image representation based on the heard sound stimuli. In this case, it is a closed set test. A picture identification speech recognition test is suitable for children up to 10-years-old. The pediatric patient correctly marks the appropriate image representation based on the heard sound stimuli. In this case, it is a closed set test.

#### *2.2. Previous Experiences with HRI Audiometry 2.2. Previous Experiences with HRI Audiometry*

As mentioned in the abstract, the first motivation behind the presented work was to improve the user acceptance level and user experience and reduce stress and fear of children, who must undergo postoperative audiometry, which was induced by the experience of therapists. Our first idea was to involve real robots in the speech audiometry process, which is repeatedly performed after cochlear implant surgery. Previously, we started to design and develop a small application with a humanoid robot, where the robot prompts a child to help him to put together pictures on the table and sounds. This robot-assisted speech audiometry ran on VoMIS system (see [29]) and is described in detail elsewhere [25]. During the experiments with the robot in this role, we collected new ideas and experiences. One of the key findings was that healthy children liked to interact with the robot. The next experiments also brought several drawbacks: As mentioned in the abstract, the first motivation behind the presented work was to improve the user acceptance level and user experience and reduce stress and fear of children, who must undergo postoperative audiometry, which was induced by the experience of therapists. Our first idea was to involve real robots in the speech audiometry process, which is repeatedly performed after cochlear implant surgery. Previously, we started to design and develop a small application with a humanoid robot, where the robot prompts a child to help him to put together pictures on the table and sounds. This robot-assisted speech audiometry ran on VoMIS system (see [29]) and is described in detail elsewhere [25]. During the experiments with the robot in this role, we collected new ideas and experiences. One of the key findings was that healthy children liked to interact with the robot. The next experiments also brought several drawbacks:


In other words, the idea to use robots looks very nice, but obtained experiences showed that such a system was not usable or helpful for the therapy. Therefore, we turned to something more usable, simple, and helpful. We focused our attention to hearing detection in the home environment without any humanoids needed. We developed the idea to prepare a simple game-like mobile application, which can be easily used by parents, when they have some doubts about the speech and sound perception of their child. Due to the fact, that in home conditions, users will not be able to set the accurate acoustic conditions, the application focuses rather on suprathreshold speech tests instead of threshold levels testing. The designed application falls into the category of word recognition audiometry tests with the closed set. In other words, the idea to use robots looks very nice, but obtained experiences showed that such a system was not usable or helpful for the therapy. Therefore, we turned to something more usable, simple, and helpful. We focused our attention to hearing detection in the home environment without any humanoids needed. We developed the idea to prepare a simple game-like mobile application, which can be easily used by parents, when they have some doubts about the speech and sound perception of their child. Due to the fact, that in home conditions, users will not be able to set the accurate acoustic conditions, the application focuses rather on suprathreshold speech tests instead of threshold levels testing. The designed application falls into the category of word recognition audiometry tests with the closed set.

#### *2.3. Web-Based Application for Children Speech Audiometry 2.3. Web-Based Application for Children Speech Audiometry*

Conditioned play audiometry principles were adopted to create a speech audiometry application, where children help robot Thomas to assign words (sounds) to pictures, which can be marked as a kind of speech recognition. The selected test is a part of the behavioral audiometry. Conditioned play audiometry principles were adopted to create a speech audiometry application, where children help robot Thomas to assign words (sounds) to pictures, which can be marked as a kind of speech recognition. The selected test is a part of the behavioral audiometry.

The designed application was prepared as a web application, which enabled us to run it on each device with Internet connection and a web browser without any other requirements. The application can be used for free field speech audiometry and with a headset. The design is currently a simple

case of a bad score.

The designed application was prepared as a web application, which enabled us to run it on each device with Internet connection and a web browser without any other requirements. The application can be used for free field speech audiometry and with a headset. The design is currently a simple HTML code with short task description and pictures to choose from, based on the heard voice command. The pictures were designed and completely sketched by one of the co-authors, and they are a unique and significant contribution to the Slovak audiology clinicians' community together with the SKWL word list and audio recordings. *Electronics* **2020**, *9*, x FOR PEER REVIEW 7 of 15 HTML code with short task description and pictures to choose from, based on the heard voice command. The pictures were designed and completely sketched by one of the co-authors, and they are a unique and significant contribution to the Slovak audiology clinicians' community together with the SKWL word list and audio recordings.

The application is organized into levels. In each level five screens with a set of pictures are presented to the patient with randomly generated speech stimuli. After performing all levels, the word recognition score is computed. Speech stimuli are presented on the supposed most comfortable loudness level, which is around 50 dB. The application offers the introduction, which helps the therapist or parent to set required acoustic conditions (MCL). Cold running speech is used to set the MCL. Parent/therapist together with the patient can set the MCL by adjusting the volume while listening to a short story. The application is organized into levels. In each level five screens with a set of pictures are presented to the patient with randomly generated speech stimuli. After performing all levels, the word recognition score is computed. Speech stimuli are presented on the supposed most comfortable loudness level, which is around 50 dB. The application offers the introduction, which helps the therapist or parent to set required acoustic conditions (MCL). Cold running speech is used to set the MCL. Parent/therapist together with the patient can set the MCL by adjusting the volume while listening to a short story.

The first screens of the application contain the story description, basic setting page, and entry form. The story (Figure 6, screen 2) about robot Thomas was designed to motivate a child to undergo the audiometry. A child patient is invited to help robot Thomas to organize his collection of pictures and sounds, which is broken. We tried to engage emotions by placing on the screen gif-animation with the sad robot. The next screen offers the basic setting instructions, which help parents/therapists to set the MCL. The last initial screen is an entry form, where the user fills in his/her name (or nickname), gender, and age. He/she can also provide information about the sound level which was set for the experiment (in the case when audiometry is done on the most comfortable loudness level). Then, the application is ready to start the game. The first screens of the application contain the story description, basic setting page, and entry form. The story (Figure 6, screen 2) about robot Thomas was designed to motivate a child to undergo the audiometry. A child patient is invited to help robot Thomas to organize his collection of pictures and sounds, which is broken. We tried to engage emotions by placing on the screen gif-animation with the sad robot. The next screen offers the basic setting instructions, which help parents/therapists to set the MCL. The last initial screen is an entry form, where the user fills in his/her name (or nickname), gender, and age. He/she can also provide information about the sound level which was set for the experiment (in the case when audiometry is done on the most comfortable loudness level). Then, the application is ready to start the game.

**Figure 6.** Initial screens of the speech audiometry application. **Figure 6.** Initial screens of the speech audiometry application.

The game is divided into several levels. In each level a set of pictures is provided with an appropriate randomly generated audio file with speech stimuli. On each screen with pictures, a word is played, which belongs to one of the pictures (see Figure 7). The task of the patient is to select the

The game is divided into several levels. In each level a set of pictures is provided with an appropriate randomly generated audio file with speech stimuli. On each screen with pictures, a word is played, which belongs to one of the pictures (see Figure 7). The task of the patient is to select the correct picture. After each level, the overall score is calculated, but it stays hidden from the patient. We decided to hide a partial score because initial testing showed that children stay demotivated in case of a bad score. *Electronics* **2020**, *9*, x FOR PEER REVIEW 8 of 15

**Figure 7.** One level of the audiometry game. **Figure 7.** One level of the audiometry game.

Two versions of the game were developed: The first is for free field audiometry and the second one for testing each ear separately. Although free field audiometry is more comfortable and less stressful for pediatric patients, it brings less precise results. If it is possible, better results can be obtained using a headset, when each ear can be tested separately. A cross hearing problem may prevent proper diagnosis [30,31], therefore a special version of speech recordings was prepared for right and left ear testing. During testing of one ear, the masking noise is played into the second ear with the 10 dB distance to the speech stimuli. Two versions of the game were developed: The first is for free field audiometry and the second one for testing each ear separately. Although free field audiometry is more comfortable and less stressful for pediatric patients, it brings less precise results. If it is possible, better results can be obtained using a headset, when each ear can be tested separately. A cross hearing problem may prevent proper diagnosis [30,31], therefore a special version of speech recordings was prepared for right and left ear testing. During testing of one ear, the masking noise is played into the second ear with the 10 dB distance to the speech stimuli.

#### **3. Experiments and Results 3. Experiments and Results**

during the experiments:

close as possible to the sound source);

Several game scenarios together with the setting condition issues were created, tested, and discussed. First experiences show a positive influence on the children's mood and motivation. Several game scenarios together with the setting condition issues were created, tested, and discussed. First experiences show a positive influence on the children's mood and motivation.

Eleven child participants were involved in the interaction with the audiometry application. Nine were healthy children mainly around kindergarten age. The two testing subjects were a 4 and 16 year old boy and girl with hearing impairment, respectively. The last participant was an elderly patient (72-year-old woman with a hearing aid in the right ear and hearing problems in both ears). The total number of test participants was 12. Eleven child participants were involved in the interaction with the audiometry application. Nine were healthy children mainly around kindergarten age. The two testing subjects were a 4 and 16 year old boy and girl with hearing impairment, respectively. The last participant was an elderly patient (72-year-old woman with a hearing aid in the right ear and hearing problems in both ears). The total number of test participants was 12.

All tests were performed in the home environment in a relatively quiet place. One of the parents played the role of a therapist. He set the sound pressure level (SPL) and read motivation stories and instructions to the child. Before testing, the child was not affected by any louder sound. Each test consists of three game levels. The role of the child is to pick up the correct picture from the provided set of pictures according to provided speech stimuli in the form of prerecorded words from the Slovak kindergarten word list. The audiometry application ran on mobile devices (Samsung Galaxy A70 and Xiaomi Redmi 4X) and tablets (Huawei MediaPad M5 lite). In the beginning, the application requires adjusting the SPL volume before playing the game. All tests were performed in the home environment in a relatively quiet place. One of the parents played the role of a therapist. He set the sound pressure level (SPL) and read motivation stories and instructions to the child. Before testing, the child was not affected by any louder sound. Each test consists of three game levels. The role of the child is to pick up the correct picture from the providedset of pictures according to provided speech stimuli in the form of prerecorded words from the Slovak kindergarten word list. The audiometry application ran on mobile devices (Samsung Galaxy A70 and Xiaomi Redmi 4X) and tablets (Huawei MediaPad M5 lite). In the beginning, the application requires adjusting the SPL volume before playing the game.

• variable position between the child and the sound source/smartphone (the child tends to be as

• a movement or even the presence of other person causes a disturbance;

• impossible to test the left and right ear separately and occurrence of cross hearing;

The first version of our audiometry application used another mechanism to set the SPL and each level was played with a different SPL. To set the desired sound level, the therapist or parent needs to use the second device (smartphone) with a sound meter application to measure and set the correct

The first version of our audiometry application used another mechanism to set the SPL and each level was played with a different SPL. To set the desired sound level, the therapist or parent needs to use the second device (smartphone) with a sound meter application to measure and set the correct SPL. Achieving stable acoustic conditions was very difficult. Additional problems were identified during the experiments:


Therefore, in the second version of the application, we decided to perform testing on the most comfortable loudness level, which can be easily set at the beginning. According to analyzed literature around speech audiometry, we abandoned the strict adherence to acoustic conditions, because the appearance of some noises in the background can lead to more realistic results of audiometry, which closely reflects situations in the real environment.

Each game level has a different difficulty and allows us to test various aspects of the cognitive ability of patients. Tests can evaluate several distinctive levels of hearing and subsequent understanding (e.g., it includes phonetic similarity of words, the visual similarity of presented pictures, the same word base, different word length, etc.). All mentioned aspects focus on a specific task connected with the hearing capability and each of them can influence the perception results.

## *3.1. Experiments with Healthy Children*

In these experiments, we considered as healthy children those who were not clinically diagnosed with any hearing problems before. From the testing of healthy children, two main observations were collected:


When we decreased the SPL to approx. 30 dB, the word recognition score decreased to 65% for child #1 and #2, which is still higher than the threshold score for the healthy patients (WRS = 50%) [1].

According to the obtained observations we decided to remove the backchannel after each picture's set. Instead of a negative backchannel, the application provided in each situation a positive backchannel after each level.

The test routine in the second version of the application consisted of setting the MCL and SRT volume levels and selecting the test method (via the loudspeaker, so-called free field, or via headphones for the right and left ear). Both volume levels were adjusted by the parent in cooperation with the child subjectively. The precondition for such a setting is that the parent has no hearing impairment. MCL level is set correctly if the sound stimuli are well audible (not too loud or less loud). The child completes the test and based on the final score the parent obtains information about the child's hearing abilities; in cases where the parent performed the test too, he/she can compare the achieved results. The minimum audible level (SRT) was set again by the parent. He/she continuously increased the volume of the presented sounds from the zero level while observing the child's reactions and ability to repeat the proposed sound. This setting can be simplified by the fact that SRT is usually the lowest level that can be heard through the device used (computer, mobile phone, tablet). Similarly, if a parent completed the test, he/she could compare the obtained results with his/her child's results to get an idea of his/her hearing abilities.

#### *3.2. Experiment with Hearing-Impaired Child*

The third testing subject was a 4-year-old boy with hearing impairment. He interacted with the audiometry application several times both in the free field scenario and with headphones (see Figure 8). The first interactions were made on the MCL level interactively set in cooperation between the child and his parent. In these tests, all speech stimuli were recognized correctly and a WRS equal to 100% was achieved. Then we decided to change the sound pressure level in the range from 70 dB to the lowest possible level, which can be reached by the device (Samsung Galaxy A70). This level was around 35 dB. Recognition problems started to occur at such a low level and word recognition score declined below 50%. According to our observations, the incorrectly recognized words were those from the group of short words and phonetically similar words. *Electronics* **2020**, *9*, x FOR PEER REVIEW 10 of 15 declined below 50%. According to our observations, the incorrectly recognized words were those from the group of short words and phonetically similar words.

Since hearing problems were suspected, we decided to continue in the audiometry testing with headphones (Marshall Major III Bluetooth closed headphones). The same game scenario was performed. The tested subject was able to perform individual levels of the game without errors for SPL from 70 dB to 30 dB. All presented recordings were in mono mode, and although the sound was present on one side (for one ear), both ears participated in the process of perception of the sound stimulus via vibrations through the bone conduction. Since hearing problems were suspected, we decided to continue in the audiometry testing with headphones (Marshall Major III Bluetooth closed headphones). The same game scenario was performed. The tested subject was able to perform individual levels of the game without errors for SPL from 70 dB to 30 dB. All presented recordings were in mono mode, and although the sound was present on one side (for one ear), both ears participated in the process of perception of the sound stimulus via vibrations through the bone conduction.

**Figure 8.** Hearing-impaired 4-year-old boy interacting with the audiometry application. **Figure 8.** Hearing-impaired 4-year-old boy interacting with the audiometry application.

The last part of the experiment with the hearing-impaired boy was performed with in-ear headphones, which enabled us to partially reduce sound stimulation of the healthy ear via vibrations through the bone conduction. Speech stimuli in this scenario were provided only into the tested ear without precise masking of the untested ear. The last part of the experiment with the hearing-impaired boy was performed with in-ear headphones, which enabled us to partially reduce sound stimulation of the healthy ear via vibrations through the bone conduction. Speech stimuli in this scenario were provided only into the tested ear without precise masking of the untested ear.

The results for the left ear were very good. We obtained a WRS higher than 90%. A completely different situation occurred in the case of the right ear, where the word recognition score was very low, also for higher sound pressure level (higher than 50 dB was only 30%). When we decreased the sound pressure level below 50 dB, he became angry, demotivated, did not want to continue, and demanded to increase the volume. The results for the left ear were very good. We obtained a WRS higher than 90%. A completely different situation occurred in the case of the right ear, where the word recognition score was very low, also for higher sound pressure level (higher than 50 dB was only 30%). When we decreased the sound pressure level below 50 dB, he became angry, demotivated, did not want to continue, and demanded to increase the volume.

For reliable evaluation of hearing in each ear separately it is necessary to mask the untested ear with noise. Therefore, later, we performed testing where the healthy ear was masked by cocktailparty noise. This noise pressure level was set to 10 dB below the speech stimuli provided into the For reliable evaluation of hearing in each ear separately it is necessary to mask the untested ear with noise. Therefore, later, we performed testing where the healthy ear was masked by cocktail-party noise. This noise pressure level was set to 10 dB below the speech stimuli provided into the tested ear.

tested ear. During testing of the hearing-impaired child, we also focused our attention on observing the During testing of the hearing-impaired child, we also focused our attention on observing the mood and motivation of the patient. The result was that during the audiometric game, the child was

mood and motivation of the patient. The result was that during the audiometric game, the child was very motivated and really enjoyed the game. Some disappointment was observed when the child was

Table 1 contains results from all tested children who participated in our research. Most of the tested children managed both MCL and SRT levels very well in all tested scenarios. In child #3, the deterioration of hearing quality in the case of the right ear was confirmed. Child #7 (with cochlear implants (CI)) achieved very good results in the tests, which indicate the correct functioning of her

or her speech detection threshold (SDT).

cochlear implant.

very motivated and really enjoyed the game. Some disappointment was observed when the child was unable to hear and correctly label multiple consecutive test sound items. The volume of the presented sound stimulus, when a child starts to become disappointed from failures in the game, is close to his or her speech detection threshold (SDT).

Table 1 contains results from all tested children who participated in our research. Most of the tested children managed both MCL and SRT levels very well in all tested scenarios. In child #3, the deterioration of hearing quality in the case of the right ear was confirmed. Child #7 (with cochlear implants (CI)) achieved very good results in the tests, which indicate the correct functioning of her cochlear implant.

**Table 1.** Results of the speech audiometry with the developed web-based audiometry application. CI \* = cochlear implants; SPL = sound pressure level; MCL = most comfortable loudness; SRT = speech recognition threshold.


#### *3.3. Experiments with Hearing-Impaired Elderly Woman*

The last participant of the audiometry tests was a 72-year-old hearing-impaired woman. She has detected 60% hearing loss and wears a hearing aid in her right ear. We decided to involve her in the experiment due to several facts. The first was that we did not have any other participant available currently with a hearing aid or cochlear implant. The second supporting idea was that we supposed that pediatric speech audiometry methods could be beneficial for testing elderly patients too. It was interesting to test a subject from this group and collect first observations.

Two main testing scenarios were conducted with this patient: with hearing aid and without hearing aid. We performed tests in a free field environment and with closed headphones. Results of all experiments are presented in Table 2.

**Table 2.** Results of the speech audiometry (word recognition score) tests of the elderly patient.


First tests were performed in the free field scenario by using a hearing aid. In the case of presentation of speech stimuli on MCL, we obtained word recognition score around 75% WRS. When we decreased the sound pressure level to 35 dB, the recognition score decreased to 50% WRS. It is necessary to note that she needed to listen to speech stimuli several times to be able to recognize the word. Phonetically similar words (e.g., "*vlak*" and "*vták*") were the most difficult for her to recognize. In the case of the words, where some part of them was the same, she anticipated the correct answer from the combination of pictures and listened to part of the word. This situation was observed and reported in the case of the words "*auto*" and "*autobus*", where "*auto*" is part of both.

When we tested without her hearing aid the situation was completely different. She was able to detect the sound only with a very loud stimuli around 70 dB and the word recognition score was very poor—under 20% WRS. Tests with closed headphones were performed too, but only with a hearing aid and with SPL equal to MCL. The result of this testing was 89% WRS.

These results show that the hearing aid works at an acceptable level when the lowest acceptable WRS equal to 50% is already reached near the SRT. The overall impression of using the designed audiometry application was interesting for us. Initially there was a reluctance to participate. After overcoming the initial rejection, she passed the whole testing without any problems, also for testing without her hearing aid. The overall length of the test was acceptable for her, but the provided pictures seemed to her too childish.

#### *3.4. Results Summary*

The evaluation was performed several times with healthy children, two children with hearing impairment, and one elderly (72 years) individual with a hearing aid. In these experiments, we considered as healthy children those who were not clinically diagnosed with any hearing problems before. The children's age was 4 and 16 year. In this study we instructed the parents to contact the clinician when the results of the test fell under 50%, as described elsewhere [1]. More accurate results can be obtained using headphones when each ear is measured separately, which eliminates the problem of cross-hearing.

Testing the app with an elderly person shows us that it can be easily used for speech audiometry testing in this group of patients. Both children and the elderly were able to easily interact with the application thanks to pointing gestures on the touchscreen. The large size of the pictures seems to be important too. The selection of words, which cover words known by children, is also suitable for testing elderly patients with reduced mental capabilities.

#### **4. Conclusions**

In this work the web-based pediatric speech audiometry application for hearing impairment detection was described and evaluated. The designed speech audiometry application is suitable for use in the home environment. It enabled us to measure the word recognition score (WRS) in a free field scenario and also to measure each ear separately using headphones. The application adopts conditioned play audiometry principles and can be classified as a speech recognition test. Recordings from the newly designed Slovak kindergarten word list (SKWL) were used as speech stimuli. SKWL meets all requirements for audiometric data and, together with the corresponding images and speech audio recordings, creates a unique novel database suitable especially for pediatric ontological patients during long-term therapy with high user acceptance level among pediatric and elderly patients.

The evaluation shows that the designed application can detect hearing problems at an early stage to support better intervention. The more accurate results can be obtained using headphones when each ear is measured separately, which eliminates the cross-hearing problem. Children accepted the application very well. They liked the application and did not want to stop playing it. Some portion of stress was observed when the child was not successful several times in a row or in situations when he or she perceived the presentation volume level as too low. In comparison with the classical speech audiometry methodology using live speech as a stimulus, the designed application removes the problem of lip reading. The application can be used to measure different levels and to evaluate the hearing loss or to verify the functionality of the hearing aid. Even though we initially intended to develop the application to support speech audiometry performed by therapists, experimentations with the application show us many other cases where the application can be used:


In the future we plan to improve the application in several areas, by extending the number of levels, adding more phonetically similar word pairs, enabling parents to identify words which are unknown by their child. We also plan to add other types of tests, such as testing of speech detection and speech recognition threshold and to develop an application for the Ling 6-word test. We developed an Android-based application following the proposed web application and it will soon be available on Google play for free. The next idea is to use an automatic speech recognition system and natural language processing tools (see [32]) to enable the child to react using his/her voice or to prepare more sophisticated audiometric games. We plan to test the application with autistic pediatric patients and with a larger group of elderly patients. We already started a collaboration with the Bulgarian Academy of Sciences and EPU University for the Bulgarian version of this application for elderly people [33].

**Author Contributions:** Conceptualization, S.O. and J.Z.; data curation, E.K. and L.H.; formal analysis, E.K., M.O., and J.J.; funding acquisition, S.O., M.P., and J.Z.; investigation, M.P.; methodology, M.P. and M.O.; project administration, J.Z.; resources, E.K.; software, S.O. and L.H.; supervision, S.O.; visualization, L.H.; writing—original draft, S.O. and E.K.; writing—review and editing, M.P. and J.J. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the Slovak Research and Development Agency project numbers APVV-0077-11, APVV-15-0492, APVV-15-0731, the Cultural and Educational Grant Agency of the Slovak Republic project number KEGA 009TUKE-4-2019, and Scientific Grant Agency of the Ministry of Education, Science, Research and Sport of the Slovak Republic project number VEGA 1/0753/20.

**Conflicts of Interest:** The authors declare no conflicts of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Lex-Pos Feature-Based Grammar Error Detection System for the English Language**

#### **Nancy Agarwal, Mudasir Ahmad Wani \* and Patrick Bours**

Department of Information Security and Communication Technology, Norwegian University of Science and Technology (NTNU), 2815 Gjøvik, Norway; nancy.agarwal@ntnu.no (N.A.); patrick.bours@ntnu.no (P.B.)

**\*** Correspondence: mudasir.a.wani@ntnu.no

Received: 6 September 2020; Accepted: 1 October 2020; Published: 14 October 2020

**Abstract:** This work focuses on designing a grammar detection system that understands both structural and contextual information of sentences for validating whether the English sentences are grammatically correct. Most existing systems model a grammar detector by translating the sentences into sequences of either words appearing in the sentences or syntactic tags holding the grammar knowledge of the sentences. In this paper, we show that both these sequencing approaches have limitations. The former model is over specific, whereas the latter model is over generalized, which in turn affects the performance of the grammar classifier. Therefore, the paper proposes a new sequencing approach that contains both information, linguistic as well as syntactic, of a sentence. We call this sequence a Lex-Pos sequence. The main objective of the paper is to demonstrate that the proposed Lex-Pos sequence has the potential to imbibe the specific nature of the linguistic words (i.e., lexicals) and generic structural characteristics of a sentence via Part-Of-Speech (POS) tags, and so, can lead to a significant improvement in detecting grammar errors. Furthermore, the paper proposes a new vector representation technique, Word Embedding One-Hot Encoding (*WEOE*) to transform this Lex-Pos into mathematical values. The paper also introduces a new error induction technique to artificially generate the POS tag specific incorrect sentences for training. The classifier is trained using two corpora of incorrect sentences, one with general errors and another with POS tag specific errors. Long Short-Term Memory (LSTM) neural network architecture has been employed to build the grammar classifier. The study conducts nine experiments to validate the strength of the Lex-Pos sequences. The Lex-Pos -based models are observed as superior in two ways: (1) they give more accurate predictions; and (2) they are more stable as lesser accuracy drops have been recorded from training to testing. To further prove the potential of the proposed Lex-Pos -based model, we compare it with some well known existing studies.

**Keywords:** Natural Language Processing; deep learning; grammar error detection; word embedding

#### **1. Introduction**

With the advent and continuous advancement in Natural Language Processing (NLP) that aims to enable a machine to understand the human language, the problem of designing a grammar error detector for the natural language is also gaining much attention from researchers [1–4]. The non-native speakers of a language find a hard time in writing grammatically correct sentences. For example, there is a large section of English language learners who need a tool to check if their written content contains grammatical errors [3]. The primary task of a grammar classifier is to predict whether a sentence is grammatically valid or not. The automatic grammar detector can also be applied to grade the writing style of a person by counting the incorrect sentences in their content [1]. Furthermore, a grammar detector can be employed to evaluate the output of Machine Translation (MT) systems which

are designed to produce grammatically correct sentences, by highlighting the translated sentences which contain errors [2].

The language error detection problem is mostly considered as a sequence labeling task where a supervised learning approach is adopted to predict whether the input sequence is grammatically correct or not. Most of the existing studies use one out of two approaches to convert an English sentence into a sequence for the classification task. In the first approach, the sentence is processed as a sequence of words as they appear in the text [1,5]. We refer to this sequence as a lexical sequence. For example, the sentence *"I am reading a book"* will be transformed into the sequence <*I*> <*am*> <*reading*> <*a*> <*book*>. In the second approach, a sentence is converted into the sequence of tokens which indicate its structural or syntactic information [6,7]. We call these types of sequences syntactic. For example, the syntactic sequence of the same sentence will be <*subject*> <*helping* − *verb*> <*verb*> <*article*> <*object*>. This is more like specifying the grammar-domain of words used in a sentence. Researchers use various tools such as dependency parser and Part-Of-Speech (POS) tagger to obtain the structural information of a sentence.

However, we observe that both types of sequences have their inherent limitations. The model trained on lexical sequences is highly specific to the domain of vocabulary of the sentences. Therefore, these models do not generalize well. This implies that, if the sentences in a training set are not enough to cover the large aspect of the English language, the words in test sequences would appear strange to the model. On the other hand, the model trained on syntactic sequences overcomes this limitation by providing the structural characteristics of the sentences, and hence, allow the model to generalize the rules. However, too much generalization is also not good for the model as it often provides insufficient knowledge about the grammar used in a sentence. For example, both words *"a"* and *"an"* are articles but they are used in a different context (e.g., *"an apple"*, *"a banana"*) which cannot be reflected by a syntactic sequence only.

We address this problem by proposing a novel sequence named as Lex-Pos sequence that attempts to capture the specific nature of the lexical sequence and generic nature of the syntactic sequence of a sentence. The structural organization of a sentence in the Lex-Pos format is represented using Part-Of-Speech tags. The required linguistic knowledge is added to the structural knowledge of the sentence to prevent the grammar error classifier from over-generalization.

Since the proposed Lex-Pos sequence contains both lexical tokens and POS-tag tokens, we introduce a new vector representation to represent this sequence in a machine-understandable format. We infused two vector representation techniques viz; word embedding and one-hot encoding to draw the vector of Lex-Pos sequences. We named this representation Word Embedding One-Hot Encoding (*WEOE*). In this *WEOE*vector representation, the lexical tokens in a sequence are converted into embedding vectors, whereas syntactic tokens are converted into binary vectors.

In order to design the grammar error detector algorithm, a large corpus containing a satisfactory quantity of both correct and incorrect sentences is required. The correct sentences are acquired from the Lang-8 English learner corpus (https://sites.google.com/site/naistlang8corpora/). However, for designing a dataset of grammatically invalid sentences, an artificial error corpus is created by inducing the grammatical errors into the correct sentences of the Lang-8 dataset. Talking about the grammar error types, there are a variety of errors in English language and we distribute them in two categories, viz, syntactic errors and semantic errors. Syntactic errors are caused due to varied reasons for example, a word in a sentence does not spell right (misspelling error), a verb does not conform to the subject (subject-verb agreement error) or a preposition is incorrectly used(preposition error), etc. On the other hand, the sentences with semantic errors are structurally correct but does not make any sense in real life, for example, 'I am eating water', 'we are running a banana', etc. The proposed approach can detect all the syntactic errors in an English sentence and verifies the grammatical structure of a sentence but does not ensure if a sentence is semantically valid, (i.e., if the sentence is meaningful).

Since our target is to train the classifier to differentiate between a valid or invalid Lex-Pos sequence which contain two kinds of tokens, i.e., lexical and POS-tag, two sets of incorrect sentences are designed, one with general errors and another with POS-tag specific errors. The general errors are induced to make the model aware of lexical specific mistakes. The existing error introduction techniques such as *missing verb errors*, *repeated word errors*, *subject-verb agreement errors*, etc. [1] have been used to create different types of such ungrammatical sentences. However, for designing the second set of error corpus, a new error induction method has been implemented that induces POS-tags specific errors in the correct English sentences.

In this paper, the major focus is to show that the proposed Lex-Pos sequence which incorporates both linguistic and structural information of a sentence can markedly enhance the performance of the grammar error detection classifier. The source code for the proposed approach has been made available for the researchers (https://github.com/Machine-Learning-and-Data-Science/Lex-POS-Approach). The main contributions of the work are summarised as follows.


The remaining of the paper is structured as follows: Section 2 discusses the literature about grammar detection and correction systems. The proposed Lex-Pos sequence is explained in Section 3 and the datasets and pre-processing are presented in Section 4. In Section 5, different error induction methods are discussed including the newly introduced tag specific error induction. Section 6 presents a novel sequence representation technique that has been used for designing a grammar error detector in this study. The experimental setup and results are discussed in Section 7. Section 8 provides a comparison with existing studies. In Section 9, we have discussed a few limitations of our study, and finally, Section 10 concludes the overall work of Lex-Pos feature-based Grammar Error Detection system for the English Language.

#### **2. Background Study**

In the grammar detection problem, the sentences are mostly converted into some sequence to obtain a feature set for experiments. Prior works have majorly focused on either considering the sentence itself as a sequence of words or extracting the sequence of tokens which depicts the structure of a sentence. For example, [1] has combined the POS tags of the sentence and the output of the XLE parser (https://ling.sprachwiss.uni-konstanz.de/pages/xle/) to extract the feature set for identifying grammatically ill-formed sentences. The authors also proposed the design of an artificial error corpus for training the model by introducing four types of grammatical mistakes including missing word errors, extra word errors, spelling errors, and agreement errors. The work is further extended in [2], where probabilistic parsing features are incorporated with the POS *n*-grams and XLE-based features to improve the results. In [6], the authors propose a classifier to detect grammatical mistakes in the output produced by Statistical Machine Translation (SMT) systems. The structure of the sentences has

been captured using multi-hot encoding where the word vector represents three types of information: *POS tag*, *morphology* and *dependency relation*.

A large section of researchers has focused on representing the sentences using word embedding vectors. The authors of [8] propose the Grammar Error Corrector (GEC) model using the convolutional encoder-decoder architecture which was trained on word embeddings of the sentences. Another work [3] proposes word embeddings that considers both the grammaticality of the target word and the error patterns. To create incorrect sentences in the corpus, the target word in the sentence has been replaced with a similar but different word that often confuses the learners. For example, replacing 'peace' with 'piece'. Authors in [9] have designed a translation model that assists in understanding the unseen word using its context. The encoder-decoder model which is capable of handling the Out Of Vocabulary (OOV) words has been employed. [10] also utilizes the Convolutional Neural Network (CNN) to build a GEC model. However, the problem is considered as a binary classification rather than a sequence-to-sequence problem. The task of the model is to predict the grammatical correctness of a word based on the context where it has been used in the sentence. The authors also implement word embeddings to represent the sequence of a sentence and substitution error induction method to artificially create the negative samples in the training set.

There are also several studies that attempt to integrate a different level of information of the sentence in the sequence. For example, in [11], word-based sequences represented using word embedding are applied to build a neural GEC model. They also infuse character-level information in the neural network where the word embedding representation of OOV words depends on their character sequences. Study [12] attempts to detect the prepositional mistakes in the sentences by extracting the contextual information of the prepositions. The authors in this study integrated the prepositional words (e.g., *into* or *at*) with the noun or verb phrases to predict the probability of their correct usage in the sentences. Similarly, [13] worked on identifying prepositional errors by combining POS-tagged and parsed information with English words. In our work, we convert the complete sentence into a sequence that contains both structural as well as contextual information. The structural tokens are represented using one-hot encoding and context tokens are represented using word embedding.

Other studies on grammatical error detection focus only on specific errors, such as article errors, adjective errors or preposition errors [7,14,15]. The authors of [7] proposed four error generation methods to introduce article mistakes statistically in English sentences to create negative samples that resemble grammar errors naturally occurring in second language learner texts. A model has been designed to detect and correct article errors. Similarly, the authors in [16] put their efforts into selectively correcting article errors in the sentences. Instead of using all the words in sentences, the model is trained on the sequence of words surrounding the articles only, i.e., *n* words before and after the article. Article [14] focuses on the mistakes committed by the learner while using adjectives with nouns in sentences. In our study, an attempt is made to target all kinds of errors with special attention to POS-tag specific errors. Therefore, our work utilizes two corpora for negative samples, one with general errors and another with tag specific errors.

#### **3. Lex-Pos Sequence**

Earlier studies have mainly focused on either lexical knowledge of the sentences such as words appearing in the text or the syntax knowledge of the sentences such as POS tags, as features for training the grammar detection model. In a lexical-based approach, an English sentence can mostly be directly converted into a sequence of words by splitting it with space. Whereas, in a syntactic-based approach, the sentence is first converted into the grammatical structure using tools like dependency parser (http://www.nltk.org/howto/dependency.html) or tagger (https://www.nltk.org/book/ch05.html) and then a sequence is designed by extracting the relevant information.

However, lexical-based models highly depend on the vocabulary of sentences in the training set, therefore, these models are difficult to generalize. For example, a model trained on sentence *S*1: *"I have an umbrella"* might fail to understand the grammaticality of the sentence *S*2: *"I have a cat"* during testing as the words *"a"* and *"cat"* appear new to the model. Therefore, the model trained on the words vocabulary of the sentences is highly vulnerable to categorizing unseen sentences as incorrect.

On the other hand, the learning structure of the sentences allows the model to generalize the rules. For example, the *NLTK pos-tagger* converts both of the above sentences (*S*<sup>1</sup> and *S*2) into the same sequence of POS tags, i.e., <*PRP*> <*VBP*> <*DT*> <*NN*> for denoting the *personal pronoun*, *present tense verb*, *determiner* and *noun* respectively. Therefore, the model trained on syntactic features of the sentence, *"I have an umbrella"* can easily predict the structure of the sentence, *"I have a cat"* as correct. However, too much generalization can also increase the false alarms. For example, the *pos-tagger* tool generates the same sequence for the two sentences *"I have a umbrella"* and *"I have an umbrella"*, i.e., <*PRP*> <*VBP*> <*DT*> <*NN*>. Here the articles *a* and *an* are both categorized under same tag <*DT*>.

Therefore, in this paper, we introduced a new sequence, viz, Lex-Pos by combining the specificity level of the lexical approach and generalization of structural characteristics of sentences. In this feature set, we embed the required linguistic knowledge in the POS-tag sequence of the sentence so that the model can learn to generalize the structure of sentence *"I have an umbrella"* to *"I have a cat"*, and at the same time, also distinguish it from the sentence *"I have a umbrella"*.

In order to construct the Lex-Pos sequence, we first need to identify the problematic POS tags which overgeneralize the structure of a sentence. For example, in the sentences, *"I have an umbrella"* and *"I have a umbrella"*, <*DT*> is the tag which causes the problem. Once we identify these problematic POS tags, we embed additional linguistic knowledge to such tags. For example, the <*DT*> tag is integrated with two tokens; first the article (i.e., a/an/the) itself, and second the pronouncing alphabet of the word that follows the article as shown in sentences 1, 2 and 3 in Table 1. The *pronouncing* (https://pypi.org/project/pronouncing/) library of python has been used to obtain the pronounced letter of the word.

In case of the *NLTK pos-tagger*, the other tags which were found problematic include <*PRP*> representing personal pronoun (e.g., *he*, *she*, *I*, *we*, or *you*), <*VBP*> representing verb such as *am*, *are*, or *have*, and <*IN*> representing preposition/subordinating conjunction e.g., *in*, *at*, or *on*. All these tags in the syntactic sequence of a sentence are provided with extra linguistic information. Algorithm 1 illustrates the step-wise designing of the Lex-Pos sequence.

#### **Algorithm 1:** Lex-Pos Sequence.

#### **begin**




**Table 1.** Examples of Lex-Pos sequences.

Table 1 shows a few instances of Lex-Pos sequences. It can be seen in Table 1 that the two tags <*PRP*> and <*VBP*> are appended with the information of personal pronoun and helping verb in sentences 4, 5 and 6. In the case of <*IN*> tag, three lexical tokens, namely, preposition, word preceding the preposition and word following the preposition are appended. However, if the preceded or followed word comprises some <*DT*> tag words such as *the*, or *some*, then these words are ignored and the next word in the sequence is appended as shown for the last sentence in Table 1.

#### **4. Datasets and Pre-Processing**

Training of grammar classifiers requires both correct and incorrect sentences in a dataset. We used the Lang-8 Corpus of the Learner English dataset as grammatically valid English sentences for our experiments. The dataset contains over 5 million sentences with the length of the sentences ranging from 1–80 words. We selected sentences with a length of less than 15 words in order to reduce the variation in the length of the sentences during the training of the model. Finally, we obtained around 1 million correct sentences. The incorrect sentences are obtained from the correct corpus by writing error induction programs which are explained in detail in Section 5.

Although the sentences in the Lang-8 corpus are already verified as grammatically correct, we performed a few pre-processing functions so as to design an efficient dataset for training. First, we converted the sentence into lower case. Then, we replaced the contracted form of auxiliaries in the sentences with their long-form (e.g., *"I'm not"* → *"I am not"*). Also, numbers in the sentences are replaced with the keyword *digit* to reduce variation (e.g., *"I am 16 years old"* → *"I am digit years old"*). However, we did not remove any punctuation marks from the sentences as they hold significant knowledge of the structure of the sentences. The python libraries, *nltk* (https://www.nltk.org/) and *re* (https://docs.python.org/3/library/re.html), were used to pre-process the sentences.

#### **5. Error Induction Methods**

In this section, we describe the procedure used to generate an artificial error corpus from the Lang-8 dataset which has been made available for the researchers (https://github.com/Machine-Learning-and-Data-Science/Lex-POS-Approach). Our target is to train the machine learning based model to differentiate the correct sequence of the sentence from the wrong ones. Various researchers have used the notion of breeding artificial error data for training the grammar detector model [1,2]. A sentence can be grammatically invalid due to varied reasons, for example, a word in a sentence does not spell right, a verb does not conform to the subject, or a preposition is incorrectly used. Training requires a large set of grammatically incorrect sentences containing enough samples for each kind of error, which is hard to collect in the sentences produced by native language speakers or writers. However, the dataset of grammatically incorrect sentences with a sufficient number of sentences can be created by performing certain transformations in the grammatically correct sentences (e.g., *inserting*, *replacing*, *repeating* or *deleting* words from the correct sentences). While inserting the errors, proper linguistic knowledge is required in order to ensure that the sentence produced by the script is grammatically unacceptable. For example, consider the sentence *"she bought two fresh apples"*, and only deleting the word *"fresh"* from the sentence does not make the sentence incorrect.

In this work, two types of error induction methods, namely General Error Induction and Tag-specific Error Induction, are employed. Sentences with general errors assist the detector in mainly learning the lexical mistakes and tag-based errors helps in making the model learn about POS-tag related mistakes. Both error induction methods are discussed in the following sub-sections.

#### *5.1. General Error Induction Methods*

General errors contain those methods which have been mostly adopted by the earlier studies for creating incorrect sentences. In our dataset, we introduce 5 types of errors, i.e., *misspelled error*, *repeated word error*, *subject-verb agreement error*, *word order error*, and *missing verb error*. Table 2 provides a brief description of the list of these general errors.

In order to ensure that the sentences created by the error induction procedure are grammatically invalid, a few things were taken into consideration. First, we ensure that we do not misspell the words which are proper nouns. Nouns are something for which a dictionary is unlimited, such as the name of a person. For example, *"Alice is having tea"*, *"Aliceee is having tea"*, and *"Ali is having tea"* are all correct sentences. Therefore, we avoid misspelling proper nouns while creating negative references.


**Table 2.** Examples of General Errors.

For creating sentences with subject-verb agreement errors, we replace the singular verbs with the plural verbs or the other way round to create incorrect sentences. For example *are* is replaced with *is* or *"has"* is replaced with *"have"*.

While generating the repeated errors, we avoided repeating words like *very* or *so*, as a repetition of such words does not make a sentence grammatically incorrect. For example, both sentences *"I like you very much"* and *"I like you very, very much"* are treated as correct in grammar.

While creating word-order errors, we avoid swapping helping-verb with its subject if the sentence is interrogative as both sentences *"am I working"* and *"I am working"* are correct in the English language.

Table 3 provides the distribution of errors in the incorrect dataset. It can be noted from the table that the number of sub-verb agreement and missing verb errors are less when compared to other types of errors as these errors are limited to verbs in the sentences, whereas the domain of other types of errors is not limited to verbs only. It should be noticed that multiple errors can be introduced in a single sentence, i.e., a sentence can have more than one kind of error. Even though the total number of errors created is 85,092, the total number of negative samples produced by the general error method is only 62,899.


**Table 3.** Distribution of General Errors.

#### *5.2. Tag-Specific Error Induction Methods*

As discussed in the earlier sections, < *DT* >, < *PRP* >, < *VBP* > and < *IN* > are the tags which provide insufficient knowledge about the structure of a sentence, therefore, these tags must be provided with some additional linguistic knowledge for training a machine learning-based model to differentiate between grammatical correct and incorrect sentences. While creating the tag-specific errors, we introduce errors particularly for these problematic tags to obtain enough negative examples for assisting the model to learn such errors in the sequence structure. Table 4 provides examples of tag specific instances of the sentences.

The total number of negative samples produced by the tag-specific error method is 50,015, which is less than the total number of errors in Table 5 for the same reason as for the general errors. Table 5 provides the distribution of errors in the incorrect dataset.


**Table 4.** Examples of Tag-specific Errors.



#### **6. Feature Representation**

In the proposed work, we convert every sentence (correct and incorrect) in the dataset to the Lex-Pos sequence as discussed in the earlier section. However, for training the machine learning-based model, the Lex-Pos sequence needs to be converted into some machine-understandable (mathematical) form. Researchers have employed a variety of ways to represent a linguistic sequence into useful features, e.g., Bag of Words (BoW), *N*-grams, TF-IDF, word embedding, and one-hot encoding [17], etc. The approaches such as Bag of Words (BoW), *N*-grams and TF-IDF rely on the set of tokens and their frequency in the dataset and are therefore insufficient to capture the exact structure of a sentence.

However, in one-hot encoding representation, each word in the vocabulary is assigned a unique binary vector. Therefore, in this encoding, all the distinct words receive distinct representation and the length of the one-hot vector is decided by the number of words in the vocabulary. Usually, the size of POS-tags vocabulary is limited, and hence employing one-hot encoding is a good choice to represent the POS-tag sequences. But the one-hot vector to represent an English word seems an inefficient approach as the length of the binary vector could be extremely long due to the large size of English vocabulary.

Word embedding is another feature representation technique in which every distinct word in the vocabulary is mapped to a numeric vector so that semantically similar words share similar representations in the vector space. One good advantage of using word embedding is that the words can be represented in a much lower dimension than the one-hot encoding. Therefore, word embedding seems an optimal choice to represent the English tokens.

Earlier studies have represented the sequences using either the one-hot vector or embedded-word vector. Since the proposed Lex-Pos sequence consists of both POS tags and English words, we present the feature representation that combines both techniques, named as *WEOE*. In this technique, we first maintain a list called *tag-list* which contains all the POS-tag tokens generated by *NLTK pos-tagger* along with their index values. The *tag-list* assists in identifying the tokens in the Lex-Pos sequence which need to be represented in one-hot encoded form. We also appended pronouncing alphabets of the words to the *tag-list* for adding the linguistic information to the <*DT*> tag as mentioned in Section 3. In order to obtain the word embedding vectors of the English tokens in the Lex-Pos sequence, Google's pre-trained *Word2Vec* model has been utilized. The model includes 300-dimensional word-vectors for around 3 million English words. The *Gensim* library (https://pypi.org/project/gensim/) of python has been used to extract the embeddings from the *Word2Vec* model.

The Algorithm 2 explains the procedure explains the procedure of changing the Lex-Pos sequence into the Word Embedding and one-hot Encoding (*WEOE*) representation. Three arguments are passed to the algorithm as input: (1) Lex-Pos sequence; (2) *tag-list*; and (3) *Word2Vec* model. The *sentVector* variable is initialized to store the vector representation of each token in the sequence. Also, every token of the Lex-Pos sequence is initialized with a fixed length (*n*) vector having all zero entries. In our case, the size of the *tag-list* is less than the size of the embedding vectors of the *Word2Vec* model. Therefore, the value of *n* ranges from *min* to *max*, where the minimum value is the number of tokens in the *tag-list*, and the maximum value is the length of the embedded vector of the *Word2Vec* model.

**Algorithm 2:** *WEO<sup>E</sup>* representation of a Lex-Pos Sequence.

**begin**

Input: Lex\_Pos\_Seq, POS\_tag\_list, word2vec; Output: Word Embedding and one-hot Encoding Vector *WEOE*\_Sent\_Vec; *WEOE*\_Sent'\_Vec = []; **foreach** *token in Lex\_Pos\_Seq* **do** Initialize a zero vector with n length (*WEOE*\_token\_Vec); **if** *token in POS\_tag\_list* **then** *WEOE*\_token\_Vec[POS\_tag\_list[token]] = 1; **else if** *token in word2vec\_model* **then** *WEOE*\_token\_Vec = word2vec(token)[:n]; **end** Append *WEOE*\_token\_Vec to *WEOE*\_Sent\_Vec; **end end**

In order to generate the *WEOE*feature vector representation, every token of the Lex-Pos sequence is first passed through a filter to check if this token exists in *tag-list*. If found, the zero vector of the token is replaced with the respective binary vector, otherwise the token is searched in the *Word2Vec* model. If a token is found in the model, the zero vector is replaced with its embedded representation. The token is considered as unknown if it is not found in either of the lists. Finally, the vector values of a token are appended to the *sentVector*.

#### **7. Experiments and Results**

In Section 4, we discussed the corpus of grammatically correct sentences, and in Section 5, we presented two types of error induction methods for creating two different corpora of negative samples, i.e., one with incorrect sentences containing general errors and another with incorrect sentences having POS-tag specific errors. All three corpora are utilized to create three datasets in the following manner.

**Dataset 1:** Correct sentences + incorrect sentences with general errors; **Dataset 2:** Correct sentences + incorrect sentences with tag-specific errors; **Dataset 3:** Correct sentences + incorrect sentences with both general and tag-specific errors;

Earlier studies on grammar classifiers have employed either lexical sequences or POS-tag sequences of a sentence for grammar classification. This work presents a Lex-Pos sequence which tends to imbibe the specificity quality of lexical sequences and generalization trait of POS-tag sequences. Therefore, we compare the efficiency of a classifier trained on Lex-Pos sequences with the classifiers modeled using lexical and POS-tag sequences. We evaluate the performance of the proposed work on detecting grammatical errors using the 3 datasets described above. Sentences of each dataset are converted into the three types of sequences, lexical sequence, POS-tag sequence and Lex-Pos sequence as shown earlier in Table 1. There are a total of 3 datasets and 3 types of sequences to represent a sentence in each dataset, thus, in total, nine experiments are conducted for comparing the performances of the proposed grammar detector model.

In Section 6, we discussed the one-hot encoding and word embedding representations to denote a linguistic sequence in numeric form. In the experiments, we represent lexical sequences of sentences using a word embedding vector as it allows to represent a word in lower dimensions. The POS-tag sequences are represented using one-hot encoded vectors as the list of POS-tags is very limited. The Lex-Pos sequences are represented using the *WEOE*-feature vector. Before training the model, all three datasets were balanced by randomly removing the extra instances from the dataset where it was required. The final size of each dataset used in the experiments is shown in Table 6.


**Table 6.** Statistics of Corpuses.

The Long Short-Term Memory (LSTM) neural network architecture has been employed to build a classifier. An LSTM network is a variant of a Recurrent Neural Network (RNN) which is extensively used in solving NLP problems as they are capable of learning the structure of sequential data. All the datasets are split in the ratio of 80:20 for training and testing respectively. The *Keras* framework has been used for implementation. In all of the nine experiments, we have used *sparse-categorical-crossentropy* as a loss function and *adam* as an optimizer with a batch size of 2000. The outmost layer of the network is a *dense* layer with 2 nodes and a *softmax activation* function. Since we are using balanced datasets, the accuracy metric has been evaluated to assess the performance of the models.

The results shown in Tables 7 and 8 are for the grammar classifiers which were trained on lexical and POS-tag sequences of the sentences respectively. If we compare the vocabulary size (i.e., unique number of tokens in the training sets) of datasets of both sequences, it can be seen that the vocabulary size of pos-tag sequences (38 or 39) is much smaller than the lexical ones (15,796 to 20,725). This indicates the generalization capability of keeping the structure of the sentences in its syntactic form. The accuracy obtained on the testing sets of lexical sequences is 80%, 96% and 80% for dataset 1, 2 and 3 respectively. On the other hand, accuracy values obtained on the testing sets of POS-tag sequences are 79%, 75% and 73%, which are significantly lower than the accuracy recorded for the lexical-based classifier. This indicates that the classifier performs better with lexical sequences than the POS-tag sequences in all the datasets.



**Table 8.** POS-tag Sequence—One-hot Encoding Representation.

Also, while creating the specific-tag errors, we mention that these are basically those errors for which the POS-tag classifier finds it difficult to discriminate. The statement is also reflected in the results of Table 8, where it can be seen that the pos-tag classifier achieves better accuracy in dataset 1, which contains general errors in negative samples (79%) than dataset 2, which contains specific errors in negative samples (75%).

However, it can also be noticed in the results shown in Tables 7 and 8 that the accuracy-drops from training to testing sets are higher for lexical sequences by significant margins. For example, the accuracy obtained in the training set for dataset 3 of lexical sequences (87%) is reduced to 80% in the testing set, i.e., a 7% decrement in the accuracy. The value of loss also increases from 0.29 (training) to 0.48 (testing), i.e., a 19% increase in the loss value. On the other hand, while evaluating the performances of the POS-tag based classifiers on the training and testing sets of dataset 3, there is 4% reduction in the accuracy value and 7% increment in the loss values. This indicates that although the POS-based model is not as accurate as the lexical-based model, it is more stable than the lexical-based model.

The objective of this paper is to combine the effectiveness and stability characteristics into one model by converting English sentences into the Lex-Pos sequences. Table 9 shows the results of the classifiers trained on the Lex-Pos sequences of the sentences with *WEOE*feature representation. It can be seen that the vocabulary size of Lex-Pos sequences (1026 to 2122) in the training set lies between the vocabulary size of lexical (15,796 to 20,725) and POS-tag sequences (38 to 39). This indicates that the Lex-Pos sequences tend to maintain a balance between the generalization and specialization of the two sequence types. It is evident from the results that the Lex-Pos classifier outperforms both lexical and POS-tag based classifiers in all the three datasets. The accuracies obtained by the Lex-Pos models on datasets 1, 2 and 3 are 84%, 97% and 87% respectively.

The results also put the Lex-Pos sequences on top from the aspect of stability as they obtain lower values for both metrics, *increment in the loss* and *decrement in the accuracy* while deploying the classifiers from the training to the testing environment. For example, in dataset 3, the loss values of the Lex-Pos system for training and testing are 32% and 26% respectively (see Table 9), thereby, a total of 6% increment in the loss. The value is significantly lesser than the loss increment values for lexical (19%, see Table 7) and POS-tag systems (7%, see Table 8). A similar pattern is observed for the accuracy drop. In dataset 3, the value of accuracy decreases from 89% in training to 87% in testing in the case of Lex-Pos , a total of 2% drop in the accuracy. This accuracy drop of 2% is also markedly lower than the values obtained by lexical (7%) and POS-tag (4%) classifiers.

**Table 9.** Lex-Pos Sequence—*WEO<sup>E</sup>* Representation.


#### **8. Comparative Study**

In this section, we compare the proposed work with two well known existing studies in order to further demonstrate the potential of Lex-Pos sequences. The experiment results show that the Lex-Pos sequences represented using *WEOE*-feature vectors have more potential to capture the grammatical structure of English sentences than the POS-tag sequences and lexical sequences, and so, are more suitable for designing the grammar aware systems. We compare our work with two other existing studies, [5,16]. In each comparison, we replicate the models proposed by the authors in their work and conduct two sets of experiments. In the first experimental setup, we feed the sequence mentioned by authors in [5,16] as input to the implemented model, and in the another setup, we feed the Lex-Pos sequence as input to the implemented model to see and compare the results.

In [5], the authors designed an essay scoring system to evaluate the writing skills. The objective of the system is to assign a rating (i.e., 0–5) to an English essay that reflects the quality of its content based on various parameters including grammatical correctness. The authors experimented with several deep learning models such as CNN, RNN, LSTM and LSTM+CNN and observed that the LSTM-based system outperformed the others. For comparison, we implemented a similar LSTM-based system which was claimed by the authors as the best. The values of the hyper-parameters are set same as by the authors. Table 10 lists the settings of these hyper-parameters used for training the model, referred to here as Essay model.

**Table 10.** Hyper-parameters settings for Author Model [5].


The authors evaluated the quality of English essays, including short ones, on a scale of 0 to 5. However, here, we evaluate the quality of English sentences based on its grammatical structure on the scale of 0 or 1 where 0 score refers to a correct and 1 score refers to an incorrect sentence. The three datasets discussed in Section 7 have been used for training and testing the Essay model. For comparison, two experimental setups have been established. In the first round of experiments, the Essay model takes the word embeddings of the lexical sequences of sentences (as mentioned by the authors in their work) as input. In the other round, we provided our proposed *WEOE*-feature vectors of Lex-Pos sequences as input features to the model.

The results obtained from the two rounds of experiments are shown in Tables 11 and 12 respectively. It can be seen that on the training set, the author methodology (lexical sequence and LSTM model) with the accuracies 0.89, 0.98 and 0.91 shows slightly better performance than the same model trained using Lex-Pos sequences on datasets 1, 2 and 3 respectively. However, while testing the Lex-Pos sequences-based trained model outperforms in all three datasets with accuracy values 0.84, 0.96 and 0.87 respectively. This confirms that models learn more efficiently on Lex-Pos sequences of sentences. Also, if we notice the accuracy drops from training to testing, we observe that it is less in the author model trained from Lex-Pos based features. The accuracy drops are 7%, 3% and 6% for the author model trained on lexical sequences of three datasets respectively, whereas the values, 4%, 1% and and 3% have been recorded for the author model trained on Lex-Pos sequences. These results further confirm that the Essay model trained on Lex-Pos sequences are more capable of generalization and so, are more stable and efficient.


**Table 11.** Performance of Model [5] on lexical sequences.


**Table 12.** Performance of Model [5] on Lex-Pos sequences.

For the second comparison, we carry out experiments on the work proposed by the authors of [16]. In that paper, the authors developed a deep learning model with convolution and pooling layers for detecting article errors in the English sentences. We refer here to this model as the Article model. The Article model takes a sequence of *k* words before and after the article as input in order to learn the surrounding context of the articles. The sequence is translated into a mathematical vector using pre-trained word embeddings. In order to replicate the Article model, we also design a similar CNN model with the same parameters as mentioned in the paper. Table 13 provides the values of these hyper-parameters.

**Table 13.** Hyper-parameters settings for Author Model [16].


The output of the Article model is multiclass with labels *a*, *an*, *the* and *e* where *e* indicates no article. The three datasets that have been used so far for training cannot be applied in training this author-model as these datasets have labels 0 and 1 for denoting the correct and incorrect sentences respectively. Therefore, for this comparative study, we design a new dataset from the correct sentences that have been earlier used for training the models and assign three labels 0, 1, and 2 depending on whether the sentences contain *a*, *an* or no article respectively. We do not consider *"the"* article for prediction as the dataset contains instances of single sentences only which are not sufficient to provide enough knowledge of specific and non-specific nouns. Similar to the first comparison study, we conduct two rounds of experiments. In the first experiment, we extract the context words from the sentences with window size 6 as mentioned by the authors, and translate this sequence into numeric vectors using word embeddings. Afterward, these feature values are supplied to the CNN-based author-model for training and testing. In the second set up, we provide the Lex-Pos sequences of sentences transformed using *WEOE*-feature vectors as input to the Article model.

Table 14 displays the results of both experiments where it can be clearly noticed that the author model performed extremely well on the Lex-Pos sequences of the new dataset by obtaining 99% accuracy, significantly higher than the accuracy yielded by the context-based sequence model, i.e., 90%. The high performance of the Lex-Pos model could be the result of adding phonetic information of the word used immediately after the article into the syntactic sequence of a sentence.

**Table 14.** Performance of Context-based and Lex-Pos Sequence on Author Model [16].


In order to make sure that the proposed approach is statistically significant, we further conducted a number of experimental trials to determine if the Lex-Pos-based classifier can be trusted over the author model. In this regard, 15 pairs of training and testing subsets were constructed by randomly selecting 10,000 and 2000 instances for each pair from the main training and testing set respectively. Afterward, on each pair, both Lex-Pos-based and author model [5] were trained and the respective accuracy values have been recorded. Figure 1 presents a plot drawn from these accuracy values where the x-axis and y-axis represent values obtained by author classifier [5] and Lex-Pos-based classifier respectively. The graph clearly shows that for every pair of subset, Lex-Pos based classifier has performed better by obtaining a higher accuracy score. We also applied paired Student's *t*-tests (https://www.ruf.rice.edu/~bioslabs/tools/stats/ttest.html) on the two sets of accuracy scores to know if the distribution difference is statistically significant. We recorded the *t*-value as 11.516 with a *p*-value less than 0.05 which implies that the accuracy distribution of the two models is statistically different. Therefore, there is sufficient evidence to consider that the Lex-Pos model is better than the author model [5]. It is to be noted that statistical significance test was not conducted for comparing Lex-Pos-based grammar detector with author model [16] as we observed a considerably large improvement in the results, i.e., 9% increment in accuracy.

**Figure 1.** Accuracy values on 15 trials.

#### **9. Discussion and Limitations**

In this work, we have proposed the concept of converting an English sentence into a Lex-Pos sequence represented using a *WEOE*-feature vector in order to design a grammar detector that is capable of taking the advantage of both kinds of sequences, i.e., the specific nature of the lexical sequences and generic nature of syntactic sequences. We compare the performances of the Lex-Pos classifier with the models which are individually trained on lexical and POS-tag sequences of sentences. Lexical sequences were represented using word embedding vectors, while POS-tag sequences were represented using one-hot vector encoding. It is evident from the results that in terms of accuracy, the lexical-based models perform better than POS-tag-based models, whereas, in the context of stability, the POS-tag-based model proved to be more trustworthy. However, Lex-Pos sequence-based classifiers have proven to be the best systems in both aspects, accuracy and stability. This confirms the usefulness of providing additional linguistic knowledge to the POS-tag sequences of sentences and shows that the Lex-Pos sequences are more efficient in capturing the grammar structure of the English language.

In order to further demonstrate the potential of Lex-Pos , two grammar aware models of existing studies have been replicated. The first replica (LSTM-based Essay model) is designed to score the English sentence based on its correctness of grammar. And the second replica (CNN-based Article model) is modeled to classify the article errors in the sentence. The experiments show that both author models performed better on the Lex-Pos sequences than the sequences used in the respective papers. Furthermore, in these experiments also, Lex-Pos based trained author models are observed as more stable with lower accuracy-drops from training to testing.

Although the Lex-Pos models are found to be more efficient and trustworthy, there are also a few limitations associated with the present work. First, it does not ensure if a sentence is semantically valid, i.e., if the sentence is meaningful. The proposed model only verifies the grammatical structure of the

sentence, and therefore, it will not be able to discriminate the two sentences, *S*1: *"I am eating a banana"* and *S*2: *"I am running a banana"*. Both sentences are valid on syntax grounds but the second sentence fails on semantic context since *"I am running a banana"* does not make any sense in real life. Secondly, the proposed model is limited to individual sentences only and does not consider dependency between sentences. For example, consider the two sentences, *S*3: *"I talked to a boy"* and *S*4: *"She is great"*. If these two sentences are considered independently, then both are correct. But if these two sentences are considered in combination where the second sentence follows the first one, then instead of *"She"* as a subject in the *S*4, *"He"* should have been used. These limitations have been considered as the future scope in the proposed work's direction.

#### **10. Conclusions and Future Scope**

In this paper, our main aim was to demonstrate that the proposed sequence, namely, Lex-Pos which incorporates both linguistic and structural information of a sentence, can lead to a significant improvement in the performance of grammar error detection. Since the Lex-Pos sequences contain both lexical and POS-tag tokens, these sequences have been translated into numerical values by providing a new embedding technique, i.e., *WEOE*-encoding. Also, the two types of error corpora have been designed for making the model learn about the lexical and POS-tag specific mistakes respectively. A total of three types of datasets have been used for conducting the experiments where an LSTM architecture was employed to design the grammar detection system.

In the experiments, we found that classifiers trained on lexical sequences yield more accurate results than the classifiers trained on POS-tag sequences. On the contrary, POS-tag-based models are observed as more stable than the lexical- ones. However, Lex-Pos based classifiers outperform the others in both parameters, accuracy and stability. Lex-Pos sequences are also found to be more efficient and trustworthy on the replica systems designed on the basis of existing studies. The comparative study shows that the Lex-Pos sequences can be further employed to design other grammar aware systems other than error detection, e.g., essay scoring system and grammar error correction system. The future scope can be to extend these sequences by imbibing semantic information using methods like named entity recognition in order to make the model learn about semantically valid or invalid sentences.

**Author Contributions:** N.A. and M.A.W. conceived and designed the experiments; N.A. performed the experiments; N.A. and M.A.W. analyzed the data; N.A. prepared the first draft of the paper; M.A.W. edited the paper; P.B. proofread the paper and supervised the overall work. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Acknowledgments:** This work was carried out during the tenure of an ERCIM Alain Bensoussan Fellowship Program.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Mapping Discrete Emotions in the Dimensional Space: An Acoustic Approach**

**Marián Trnka 1,\* , Sakhia Darjaa <sup>1</sup> , Marian Ritomský 1 , Róbert Sabo <sup>1</sup> , Milan Rusko 1,\*, Meilin Schaper <sup>2</sup> and Tim H. Stelkens-Kobsch <sup>2</sup>**

	- meilin.schaper@dlr.de (M.S.); Tim.Stelkens-Kobsch@dlr.de (T.H.S.-K.)

**Abstract:** A frequently used procedure to examine the relationship between categorical and dimensional descriptions of emotions is to ask subjects to place verbal expressions representing emotions in a continuous multidimensional emotional space. This work chooses a different approach. It aims at creating a system predicting the values of Activation and Valence (AV) directly from the sound of emotional speech utterances without the use of its semantic content or any other additional information. The system uses X-vectors to represent sound characteristics of the utterance and Support Vector Regressor for the estimation the AV values. The system is trained on a pool of three publicly available databases with dimensional annotation of emotions. The quality of regression is evaluated on the test sets of the same databases. Mapping of categorical emotions to the dimensional space is tested on another pool of eight categorically annotated databases. The aim of the work was to test whether in each unseen database the predicted values of Valence and Activation will place emotion-tagged utterances in the AV space in accordance with expectations based on Russell's circumplex model of affective space. Due to the great variability of speech data, clusters of emotions create overlapping clouds. Their average location can be represented by centroids. A hypothesis on the position of these centroids is formulated and evaluated. The system's ability to separate the emotions is evaluated by measuring the distance of the centroids. It can be concluded that the system works as expected and the positions of the clusters follow the hypothesized rules. Although the variance in individual measurements is still very high and the overlap of emotion clusters is large, it can be stated that the AV coordinates predicted by the system lead to an observable separation of the emotions in accordance with the hypothesis. Knowledge from training databases can therefore be used to predict AV coordinates of unseen data of various origins. This could be used to detect high levels of stress or depression. With the appearance of more dimensionally annotated training data, the systems predicting emotional dimensions from speech sound will become more robust and usable in practical applications in call-centers, avatars, robots, information-providing systems, security applications, and the like.

**Keywords:** emotion recognition; dimensional to categorical emotion representation mapping; activation; arousal and valence regression; X-vectors; SVM

### **1. Introduction**

According to Scherer's component process definition of emotion [1], vocal expression is one of the components of emotion fulfilling the function of communication of reaction and behavioral intention. It is therefore reasonable to assume that some information on the speaker's emotion can be extracted from the speech signal.

We dared to call our article "Mapping discrete emotions into the dimensional space: An acoustic approach", paraphrasing the title of the work [2], to draw attention to the fact

**Citation:** Trnka, M.; Darjaa, S.; Ritomský, M.; Sabo, R.; Rusko, M.; Schaper, M.; Stelkens-Kobsch, T.H. Mapping Discrete Emotions in the Dimensional Space: An Acoustic Approach. *Electronics* **2021**, *10*, 2950. https://doi.org/10.3390/ electronics10232950

Academic Editor: Chiman Kwan

Received: 30 September 2021 Accepted: 25 November 2021 Published: 27 November 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

that many authors attempt to identify the relationship between categorical and dimensional descriptions of emotions by trying to place a verbal term (label) expressing emotion, (i.e., the name of the category) in the dimensional space ([2,3], and others). This could be wrongly automatically taken for a typical position also for the vocal (acoustic) realizations of utterances of speech under the particular emotion. Evaluation of word terms designating emotions is a different task than evaluation of emotion contained in the sound of speech utterances; nevertheless, the correlation between the placement of emotion labels and the placement of the respective emotional utterances can intuitively be assumed. This work presents a system capable of predicting continuous values of Activation and Valence from the acoustic signal of an utterance and thus finding a position of the emotion presented vocally in the particular segment of speech in the AV space.

Affect, in psychology, refers to the underlying experience of feeling, emotion or mood [4]. AV space can be used to represent affective properties not only of emotional but also stressful, insisting, warning, or calming speech, vocal manifestations of a physical condition, such as pain, or a mental condition, such as depression. Coordinates in AV space can be used to map and compare different types of affective manifestations. For example, one can try to use emotional databases to train a speech stress indicator or an anxiety and depression detect. This work offers a system for predicting such coordinates from the sound of emotional utterance. However, it must always be kept in mind that representation in two-dimensional space greatly reduces affective (and acoustic) information, and the functionality of such indicative mapping must always be well verified with respect to the needs of the application.

#### **2. Discrete (Categorical) versus Dimensional (Continuous) Characterization of Emotions**

The properties of emotions are usually described either categorically, by assigning the emotion to one of the predefined categories or classes, or dimensionally, by defining the coordinates of the emotion in a continuum of multidimensional emotional space [5]. Affective states (i.e., emotion, mood, and feeling) are structured in two fundamental dimensions: Valence and Arousal [6]. Russell has proposed a circumplex model of affect ad has categorized verbal expressions in English language in the two-dimensional space of Arousal–Valence (AV) [3]. The degree-of-arousal dimension is also called activation–deactivation [7], or engagement–disengagement. In this work, we adopt this two-dimensional approach.

As all the three dimensionally annotated databases have the dimensions called Activation and Valence, from now on we use this terminology and difference between terms Arousal and Activation is neglected. The term Arousal will be used when referring to the Russell's work.

In many application scenarios, such as automatic information via voice, using avatars, customer services, etc., it would be useful to have an estimate available of the emotion or stress in the speaker's voice. The system could take the affective state of the customer into account and adapt the mode of communication.

#### *2.1. Issues in Predicting Emotional Dimensions from the Sound of an Utterance*

The possibilities of human's articulation system are physiologically limited. The acoustic cues of emotions are highly non-specific; the vocal realization of the utterance can be very similar in the presence of different emotions. Affective states form a continuum and dividing emotions into disjoint classes is an extreme oversimplification. The real emotions are complex; they almost never appear "in mixtures". The meaning of terms describing emotions is ambiguous and culturally and linguistically dependent. Projections of various utterances into the AV space cannot be expected to be well separable with respect to emotion category. However, certain trends in their placement can be expected.

As noted by Gunes and Schuller [5], Activation is known to be well accessible in particular by acoustic features and Valence or positivity is known to be well accessible by linguistic features. Estimating Valence from the sound itself can therefore be particularly challenging. Oflazoglu and Yildirim [8] even claim that the regression performance for the Valence dimension of their system is low and that "This result indicates that acoustic information alone is not enough to discriminate emotions in Valence dimension" ([8], page 9 of 11).

A special issue is that very little is known about the mutual dependency of the dimensions of the emotional space [9,10]. The authors of this research have noticed that it is very hard for the annotators to evaluate Valence independently of Activation when the semantic information is unavailable. The emotions with low activation are often assigned Valence values in the center of the range.

Activation and Dominance show even higher interdependencies. In the analysis of their Turkish emotional database, Oflazoglu and Yildrim [8] show in Figure 8 of their paper the distribution of Activation and Dominance, which appears as a narrow cloud lying on the diagonal, which indicates a strong dependence between the ratings of Activation and Dominance dimensions. Nevertheless, extending the representation of space to threedimensional (Activation, Valence, Dominance) can help to differentiate emotions (for example, to distinguish Anger from Fear). In this work, Dominance is not addressed.

Ekman argued that emotion is fundamentally genetically determined so that facial expressions of discrete emotions are interpreted in the same way across most cultures or nations [11,12]. However, the inner image of emotion in a person's mind and the idea of how it is to be presented in speech depends to a large extent on his experience, education, and to a large extent on the culture in which he lives. Lim argues that culture constrains how emotions are felt and expressed and that cross-cultural differences in emotional arousal level have consistently been found. "Western culture is related to high-arousal emotions, whereas Eastern culture is related to low-arousal emotions" [12]. In this work, we examine the vocal manifestations of emotions in four Western languages (English, German, Italian, and Serbian) and in the first approximation we consider the task of automatic prediction of Activation and Valence from sound to be culture independent. One of the results of this work may be the information, whether the proposed approach works also on languages other than the one it was trained on.

The biggest problem is that there is no ground truth information available. One has to rely on the values estimated by annotators and consider them as ground truth. However, the number of annotators is often small and the reliability of the evaluation is debatable.

The available emotional speech databases were designed for various purposes, which also means they differ in methodology and annotation convention, instructions to annotators, choice of emotional categories, or even language. Moreover, the annotation of emotions was often done with a help of video, face and body gestures, text or semantic information. This information may be absent (not reflected) in the sound modality. The sound-based predictor then misses this information in the training process.

Other problem is the small volume and representativeness of the data available for emotional training. To achieve as large amount of data as possible for regressor training, to cover more variability, three publicly available databases with annotated Activation and Valence (AV) dimensions were combined in one pool.

Different emotional databases contain different choice of emotions. In this work, only the emotions that occur in the majority of the available emotional databases are addressed, namely, Angry, Happy, Neutral, and Sad.

The differences in definitions, methodology, and conditions of creation of individual databases have to be taken into account when evaluating the reliability and informative value of the obtained results.

#### *2.2. Hypothesis*

Emotional space is a multidimensional continuum. The cues of emotions in the voice are highly non-specific. Emotions are often present in mixtures, the meaning (inner representation) of the emotional terms in both speakers and raters are culture dependent. So, the areas into which the individual realizations of emotions are projected in the dimensional space largely overlap. Nevertheless, we assume that the centroids of the clusters

of points to which the utterances are projected in the AV space, should meet certain basic expectations considering their emotion category. expectations considering their emotion category. In order to illustrate the expected distribution of emotions in the AV space, we pre-

Emotional space is a multidimensional continuum. The cues of emotions in the voice

are highly non-specific. Emotions are often present in mixtures, the meaning (inner representation) of the emotional terms in both speakers and raters are culture dependent. So, the areas into which the individual realizations of emotions are projected in the dimensional space largely overlap. Nevertheless, we assume that the centroids of the clusters of points to which the utterances are projected in the AV space, should meet certain basic

*Electronics* **2021**, *10*, x FOR PEER REVIEW 4 of 17

In order to illustrate the expected distribution of emotions in the AV space, we present in Figure 1 the placement of the stimulus words Anger, Happy, and Sad in the space of pleasure–displeasure and degree of arousal according to Russell [3]. Neutral emotion was not addressed in his work. For simplicity, it can be assumed that Neutral emotion should be located at the origin of the coordinate system. sent in Figure 1 the placement of the stimulus words Anger, Happy, and Sad in the space of pleasure–displeasure and degree of arousal according to Russell [3]. Neutral emotion was not addressed in his work. For simplicity, it can be assumed that Neutral emotion should be located at the origin of the coordinate system.

*2.2 Hypothesis* 

**Figure 1.** Placement of the stimulus words Anger, Happy, and Sad in the space of pleasure–displeasure (x-axis) and degree of arousal (y-axis) according to Russell [3]. **Figure 1.** Placement of the stimulus words Anger, Happy, and Sad in the space of pleasure– displeasure (x-axis) and degree of arousal (y-axis) according to Russell [3].

Due to the various sources of uncertainty in dimension prediction and the early phase of research, the hypothesis can only be formulated very vaguely. Our working hypothesis is that when predicting the values of Activation and Valence, the centroid of Angry emotion utterances cluster should have a higher Activation value and a lower Valence value than the centroid of Neutral utterances. The centroid of Happy emotion utterances cluster should have a higher Activation value and a higher Valence value than the centroid of Neutral utterances. Sad emotion is less pronounced, and the centroid may lie close Due to the various sources of uncertainty in dimension prediction and the early phase of research, the hypothesis can only be formulated very vaguely. Our working hypothesis is that when predicting the values of Activation and Valence, the centroid of Angry emotion utterances cluster should have a higher Activation value and a lower Valence value than the centroid of Neutral utterances. The centroid of Happy emotion utterances cluster should have a higher Activation value and a higher Valence value than the centroid of Neutral utterances. Sad emotion is less pronounced, and the centroid may lie close to the Neutral utterances; anyway, it should have observably overall lower Valence than Neutral and considerably lower Arousal than the Angry emotion.

#### to the Neutral utterances; anyway, it should have observably overall lower Valence than **3. The Data Used in the Experiments**

#### Neutral and considerably lower Arousal than the Angry emotion. *3.1. Training Databases*

**3. The Data Used in the Experiments**  *3.1. Training Databases*  Three databases were available to the authors, in which values of Activation and Valence were annotated. Each of these three "training databases" was randomly divided into its training set (90% of data) and test set (remaining 10%). This ratio was chosen to preserve as much training data as possible.

Three databases were available to the authors, in which values of Activation and Valence were annotated. Each of these three "training databases" was randomly divided into its training set (90% of data) and test set (remaining 10%). This ratio was chosen to preserve as much training data as possible. IEMOCAP [13]. The Interactive Emotional Dyadic Motion Capture database is an acted, multimodal and multispeaker database in English (10 speakers, 10,000 utterances). It contains 12 h of audiovisual data. The actors perform improvisations or scripted scenarios. IEMOCAP database is annotated by multiple annotators into categorical labels, such as anger, happiness, sadness, neutrality, as well as dimensional labels: Valence, Activation, and Dominance.

IEMOCAP [13]. The Interactive Emotional Dyadic Motion Capture database is an acted, multimodal and multispeaker database in English (10 speakers, 10,000 utterances). It contains 12 h of audiovisual data. The actors perform improvisations or scripted scenar-MSP IMPROV [14]. MSP-IMPROV corpus is a multimodal emotional database in English (12 speakers, 8500 utterances). Pairs of actors improvised the emotion-specific situations. Categorical labels, such as anger, happiness, sadness, and neutrality, as well as dimensional labels—Valence, Activation, and Dominance—are provided.

ios. IEMOCAP database is annotated by multiple annotators into categorical labels, such as anger, happiness, sadness, neutrality, as well as dimensional labels: Valence, Activation, and Dominance. VaM [15]. The database consists of 12 h of audio-visual recordings of the German TV talk show Vera am Mittag (47 speakers, 1000 utterances). This corpus contains spontaneous and emotional speech in German recorded from unscripted, authentic discussions. The

MSP IMPROV [14]. MSP-IMPROV corpus is a multimodal emotional database in

English (12 speakers, 8500 utterances). Pairs of actors improvised the emotion-specific sit-

emotion labels are given on a continuous valued scale for three emotion primitives: Valence, Activation, and Dominance.

Recognizing emotions from facial expressions is a common research topic nowadays (see e.g., [16,17]) and the categorical annotation is often based on facial expressions. A part of VaM database, the "VaM Faces", includes such a categorical annotation of emotion based on the facial expression, that can be linked to the corresponding speech utterance. However, this information is available only for very small number of utterances and the emotion information contained in the facial expression may not be present in the vocal presentation. Therefore, this categorical annotation of VaM was not used in this work.

The AV dimensions in all three databases were annotated using a five-point selfassessment manikins [18] scale. The final rating is the mean of the ratings of all raters. The values on the AV axes were mapped to the range from 1 to 5 in this work.

In addition to training on individual databases, we also trained on a mixture of all three databases, which we will refer to as MIX3, and on a mixture of two larger databases, IEMOCAP and MSP-IMPROV, which we will call MIX2.

#### *3.2. Testing Databases*

The ability of the regressor to differentiate between emotions resp. place the emotions in the AV space was tested on ten publicly available databases: EmoDB [19], EMOVO [20], RAVDESS [21], CREMA-D [22], SAVEE [23], VESUS [24], eNTERFACE [25], JL Corpus [26], TESS [27], and GEES [28]. These databases are categorically annotated and do not include information on AV values.

Their content used in this work is briefly listed in Table 1 (Abbreviations used in the table are: Ang—angry; bor—bored; anx—anxious; hap—happy; sad—sad; disg—disgusted; neu—neutral; fear—feared; surp—surprised; calm—calm; exc—excited; Au—audio; Vi—video)


**Table 1.** List of testing databases for cross-corpus experiments.

#### **4. System Architecture**

In the areas of applied machine learning, such as text or vision, embeddings extracted from discriminatively trained neural networks are the state-of-the-art. They are now also used in speaker recognition [29]. The approaches that have been successfully applied in speaker recognition are often adopted in emotion recognition (see e.g., [30–32]).

#### *4.1. X-Vector Approach to Signal Representation*

The approach used in this work is based on neural network embeddings called X-vectors [29]. The X-vector extractor is based on Deep Neural Networks (DNN) and its training requires large amounts of training data. Ideally, training data should also include information describing emotions. However, to the knowledge of the authors of this work, any extra-large training database with emotions annotated that would be suitable for training emotion-focused extractor from scratch, is not available.

#### 4.1.1. X-Vector Extractor Training Phase

The X-vectors generated by extractor trained on speaker verification datasets provide primarily the information on speaker identity. However, it was shown they can also serve as a source of information on age, sex, language, and affective state of the speaker [33]. Therefore, the X-vector extractor was trained on the speaker-verification databases: Vox-Celeb [34], having 1250 speakers and 150,000 utterances, and VoxCeleb2 [35] having 6000 speakers and 1.1 million utterances. The volume of training data was further augmented using reverberation and noising [36]. The feature extraction module transforms sound into representative features-30-dimensional Mel Frequency Cepstral Coefficients (MFCCs) with a frame-length of 25 ms, mean-normalized over a sliding window of up to 3 s [29]. The energy-based Voice Activity Detector (VAD) was used to filter out silence frames. The result of the training is DNN (X-vector extractor model). In the X-vector extraction process, an MFCC features matrix is fed to the input of this DNN, and an X-vector with a size of 512 is output.

#### 4.1.2. Regression Model Training Phase

The training and test sets for regression are organized in pairs of features representing particular utterances–an X-vector, and the corresponding value of the perceived Valence (for Valence regressor) or Activation (for Activation regressor). The Scikit-learn library was used for training of the Support Vector Regressor (SVR) [37]. Default settings were used for the SVR.

The regression models trained in this phase are able to predict the value of Valence resp. Activation from the input X-vector representing the incoming utterance.

Various types of regressors were tested: AdaBoost regressor, Random Forest regressor, Gradient Boosting regressor, Bagging regressor, Decision Tree regressor, K-neighbors regressor, and Multi-layer Perceptron regressor, but none of them gave consistently better results than Support Vector Regressor.

#### 4.1.3. Prediction Phase

In the prediction phase, the utterances from the pool of test-databases undergo the process of X-vector extraction and prediction values of Valence and Activation. The result is a pair of values indicating the coordinates in the AV space of each utterance.

#### *4.2. Overall Architecture*

The overall architecture of the system is shown in Figure 2.

As we have shown in Section 4.1, the whole process has three phases. In the first phase, we trained the X-vector extractor (or X-vector model) on large speaker verification databases. In the second phase, we trained regressors for Valence and Activation on dimensionally annotated databases. In the third phase, the prediction of AV dimension values for the addressed emotion categories in the categorically annotated test databases was performed. In real-world application operation, the test databases in the prediction phase will be replaced by a speech signal audio input.

**Figure 2.** Schematic diagram of the system estimating the Activation and Valence values from speech utterances. **Figure 2.** Schematic diagram of the system estimating the Activation and Valence values from speech utterances.

#### **5. Results 5. Results**

#### *5.1. Visualization of Results 5.1. Visualization of Results*

The results are presented in the form of figures and tables. The figures show the position of utterances in the AV plane. Seaborn statistical data visualization library [38] was used for visualization. Due to variability the utterances belonging to one emotion in a certain database create clouds or clusters in the AV space. The center of gravity of each cluster is a centroid, marked with a small circle of the corresponding color. The clusters were depicted in a form of a cloud with contour lines representing iso-proportion levels. The graphs were plotted using the kdeplot function with the lowest iso-proportion level at which to draw the contour line set to 0.3 [39]. The results are presented in the form of figures and tables. The figures show the position of utterances in the AV plane. Seaborn statistical data visualization library [38] was used for visualization. Due to variability the utterances belonging to one emotion in a certain database create clouds or clusters in the AV space. The center of gravity of each cluster is a centroid, marked with a small circle of the corresponding color. The clusters were depicted in a form of a cloud with contour lines representing iso-proportion levels. The graphs were plotted using the kdeplot function with the lowest iso-proportion level at which to draw the contour line set to 0.3 [39].

#### *5.2. Ground Truth—Original AV Values Indicated by Annotators 5.2. Ground Truth—Original AV Values Indicated by Annotators*

The original AV values indicated by annotators (perceptual Activation and perceptual Valence values) are considered in our work as ground truth. Figure 3 presents the emotions, how they were rated in original annotations. As various corpora contain different sets of emotions, only four emotions were chosen for comparison, that were present in all databases—Angry, Happy, Neutral, and Sad. The original AV values indicated by annotators (perceptual Activation and perceptual Valence values) are considered in our work as ground truth. Figure 3 presents the emotions, how they were rated in original annotations. As various corpora contain different sets of emotions, only four emotions were chosen for comparison, that were present in all databases—Angry, Happy, Neutral, and Sad.

The granularity of IEMOCAP data is caused by the fact that there were very few annotators. It can be seen that the layout of centroids of emotion clusters is similar for IE-MOCAP and MSP-IMPROV. The graph for VaM original annotation is absent as VaM does not include annotation of emotion categories for vocal modality. The granularity of IEMOCAP data is caused by the fact that there were very few annotators. It can be seen that the layout of centroids of emotion clusters is similar for IEMOCAP and MSP-IMPROV. The graph for VaM original annotation is absent as VaM does not include annotation of emotion categories for vocal modality.

**Figure 3.** Clusters of emotions, as rated by annotators: (**a**) IEMOCAP-full (train + test); (**b**) MSP-IMPROV-full (train + test). Color code: blue—Angry; red—Happy; green—Neutral, orange—Sad. **Figure 3.** Clusters of emotions, as rated by annotators: (**a**) IEMOCAP-full (train + test); (**b**) MSP-IMPROV-full (train + test). Color code: blue—Angry; red—Happy; green—Neutral, orange—Sad. **Figure 3.** Clusters of emotions, as rated by annotators: (**a**) IEMOCAP-full (train + test); (**b**) MSP-IMPROV-full (train + test). Color code: blue—Angry; red—Happy; green—Neutral, orange—Sad.

#### *5.3. Regression Evaluation—AV Values Estimated on Combinations of the Test Sets 5.3. Regression Evaluation—AV Values Estimated on Combinations of the Test Sets* Figure 4 presents clusters of emotions, estimated by regressor, trained on the mixture

*5.3. Regression Evaluation—AV Values Estimated on Combinations of the Test Sets* 

Figure 4 presents clusters of emotions, estimated by regressor, trained on the mixture of the IEMOCAP-train and MSP-IMPROV-train (MIX2-train), and tested on of IEMOCAPtest and MSP-IMPROV-test sets. Figure 4 presents clusters of emotions, estimated by regressor, trained on the mixture of the IEMOCAP-train and MSP-IMPROV-train (MIX2-train), and tested on of IEMOCAP-test and MSP-IMPROV-test sets. of the IEMOCAP-train and MSP-IMPROV-train (MIX2-train), and tested on of IEMOCAPtest and MSP-IMPROV-test sets.

**Figure 4.** Clusters of emotions, estimated by regressor trained on the mix of the IEMOCAP-train and MSP-IMPROV-train sets, and tested on (**a**) IEMOCAP-test and (**b**) MSP-IMPROV-test sets. Color code identifying the emotion in the figure are as follows: blue—Angry; red—Happy; green— Neutral, orange—Sad. **Figure 4.** Clusters of emotions, estimated by regressor trained on the mix of the IEMOCAP-train and MSP-IMPROV-train sets, and tested on (**a**) IEMOCAP-test and (**b**) MSP-IMPROV-test sets. Color code identifying the emotion in the figure are as follows: blue—Angry; red—Happy; green— Neutral, orange—Sad. **Figure 4.** Clusters of emotions, estimated by regressor trained on the mix of the IEMOCAP-train and MSP-IMPROV-train sets, and tested on (**a**) IEMOCAP-test and (**b**) MSP-IMPROV-test sets. Color code identifying the emotion in the figure are as follows: blue—Angry; red—Happy; green—Neutral, orange—Sad.

Comparing the figures, it can be seen how the knowledge from the annotated values in the datasets (Figure 3) is reflected to the predicted values on the test set (Figure 4). Comparing the figures, it can be seen how the knowledge from the annotated values in the datasets (Figure 3) is reflected to the predicted values on the test set (Figure 4). Comparing the figures, it can be seen how the knowledge from the annotated values in the datasets (Figure 3) is reflected to the predicted values on the test set (Figure 4).

It can be seen that the distances of the centroids are considerably reduced. Either the scales are transformed, or the resolution resp. ability to separate the emotions was influenced by the regression. This can be caused by the fact that the training set does not include samples representing the whole AV plane; for some values, it has many realizations, for the others, they are completely missing—it is not representative, nor balanced. It can be seen that the distances of the centroids are considerably reduced. Either the scales are transformed, or the resolution resp. ability to separate the emotions was influenced by the regression. This can be caused by the fact that the training set does not include samples representing the whole AV plane; for some values, it has many realizations, for the others, they are completely missing—it is not representative, nor balanced. It can be seen that the distances of the centroids are considerably reduced. Either the scales are transformed, or the resolution resp. ability to separate the emotions was influenced by the regression. This can be caused by the fact that the training set does not include samples representing the whole AV plane; for some values, it has many realizations, for the others, they are completely missing—it is not representative, nor balanced.

As it is not sufficient to validate the regressor just from the figures, the Congruence Correlation Coefficient (CCC) and Mean Absolute Error (MAE) were used as regression quality measures to compare annotated and predicted values of Activation and Valence. As it is not sufficient to validate the regressor just from the figures, the Congruence Correlation Coefficient (CCC) and Mean Absolute Error (MAE) were used as regression quality measures to compare annotated and predicted values of Activation and Valence. As it is not sufficient to validate the regressor just from the figures, the Congruence Correlation Coefficient (*CCC*) and Mean Absolute Error (*MAE*) were used as regression quality measures to compare annotated and predicted values of Activation and Valence.

CCC is a correlation measure that was used for instance in the OMG-Emotion Challenge at the IEEE World Congress on Computational Intelligence in 2018 [39]. CCC is a correlation measure that was used for instance in the OMG-Emotion Challenge at the IEEE World Congress on Computational Intelligence in 2018 [39]. *CCC* is a correlation measure that was used for instance in the OMG-Emotion Challenge at the IEEE World Congress on Computational Intelligence in 2018 [39].

Let *N* be the number of testing samples, {*yi*} *N i*=1 be the true Valence (Arousal) levels, and {*y*ˆ*i*} *N i*=1 be the estimated Valence (Arousal) levels. Let *µ* and *σ* be the mean and standard deviation of {*yi*}, respectively; *µ*ˆ and *σ*ˆ be the mean and standard deviation of {*y*ˆ*i*}, respectively; and *ρ* be the Pearson correlation coefficient between {*yi*} and {*y*ˆ*i*}. Then, the *CCC* is computed as:

$$\text{CCC} = \frac{2\rho\sigma\vartheta}{\sigma^2 + \vartheta^2 + (\mu - \hat{\mu})^2} \tag{1}$$

*CCC* is still being used by many authors together with traditional error measure *MAE*.

$$MAE = \frac{\sum\_{i=1}^{n} |y\_i - \mathcal{g}\_i|}{n} \tag{2}$$

where *y<sup>i</sup>* is true value; *y*ˆ*<sup>i</sup>* is predicted value; and *n* stands for total number of datapoints. The results of further experiments, evaluation of regression quality with various training and test sets by means of *CCC* and *MAE* are presented in Table 2.

**Table 2.** Evaluation of regression quality by means of *CCC* and *MAE*. (Dim stands for Dimension Val for Valence and Act for Activation. MIX2 is the mixture of IEMOCAP and MSP-IMPROV datasets and MIX3 is the mixture of IEMOCAP, MSP-IMPROV and VaM).


SVR trained on MIX2 give general slightly better results than with MIX3 trained on all the three datasets. This may indicate that vocal manifestation of emotions in VaM may be less pronounced and less prototypical; the data and the annotation may be more different from other two databases. Moreover, VaM is in German and IEMOCAP and MSP-IMPROV contain English speech.

The results also show that the model obtained by training on a mixture of databases is more universal and achieves better results on the mixed test set. In some cases, it also achieves better results for individual databases than a model trained on their own training set.

Both *CCC* and *MAE* show that the quality of prediction is better for Activation than for Valence, which is in line with the observation of Oflazoglu and Yildirim [8].

#### *5.4. Cross-Corpus Experiments, AV Values Estimated by Regression on "Unseen" Corpora*

In these experiments, the utterances from the categorically annotated emotional speech corpora are input to the AV predictor. The result is represented by predicted values of Activity and Valence for each utterance.

Cross-corpus emotion recognition has been addressed by many works, but most of them focus on a categorical approach or they try to identify to which quadrant of the AV space the utterance belongs (see e.g., [40]). Our approach tries to predict continuous values of the AV dimensions. Figure 5 presents clusters of emotions, estimated by the regressor trained on MIX2 and tested on different unseen emotional corpora. Experiments were also performed with MIX3, but the regressor using MIX2 performed better (Table 3).

Cross-corpus emotion recognition has been addressed by many works, but most of them focus on a categorical approach or they try to identify to which quadrant of the AV space the utterance belongs (see e.g., [40]). Our approach tries to predict continuous values of the AV dimensions. Figure 5 presents clusters of emotions, estimated by the regressor trained on MIX2 and tested on different unseen emotional corpora. Experiments were also performed with MIX3, but the regressor using MIX2 performed better (Table 3).

**Figure 5.** *Cont*.

**Figure 5.** Clusters of emotions, estimated by regressor trained on MIX2 training set and tested on: (**a**) EmoDB; (**b**) EMOVO; (**c**) CREMA-D; (**d**) RAVDESS; (**e**) eNTERFACE; (**f**) SAVEE; (**g**) VESUS; (**h**) JL Corpus; (**i**) TESS and (**j**) GEES. Color code identifying the emotion in the figure are as follows: blue—Angry; red—Happy; green—Neutral, orange—Sad. **Figure 5.** Clusters of emotions, estimated by regressor trained on MIX2 training set and tested on: (**a**) EmoDB; (**b**) EMOVO; (**c**) CREMA-D; (**d**) RAVDESS; (**e**) eNTERFACE; (**f**) SAVEE; (**g**) VESUS; (**h**) JL Corpus; (**i**) TESS and (**j**) GEES. Color code identifying the emotion in the figure are as follows: blue—Angry; red—Happy; green—Neutral, orange—Sad.

**Table 3.** Evaluation of the regression quality using distances between centroids. **Table 3.** Evaluation of the regression quality using distances between centroids.


 activation angry-sad 0.42 0.44 EMOVO valence angry-happy 0.37 0.35

activation angry-sad 0.77 0.82


**Table 3.** *Cont.*

Based on the figures, it is now possible to try to interpret the results obtained by the regressor on the corpora with annotated emotion categories:

The results of the EmoDB database confirm the observation that it contains strongly prototypical emotions [41]. The overlap of emotion clusters is smaller compared to other corpora. The clusters are significantly more differentiated, especially on the axis of Activation, which suggests that the actors performed full-blown emotions with a large range of arousal.

The results of the EMOVO database suggest a possible fact, which is also confirmed by observations in other databases, that Valence for Sad does not reach such low values as expected. The Sad cluster is located on the Valence axis even more towards higher values than the Neutral emotion cluster. According to the predicted AV values, the sound realization of Sad utterances seems to be hardly recognizable from that of Neutral ones in this database. It can be speculated that one of the possible sources of variance may be the inter-cultural difference, as the regressor was trained on English databases and EMOVO is Italian, but this possibility would need more extensive research.

The CREMA-D, RAVDESS, eNTERFACE, and JL Corpus databases give roughly the results as expected (see Section 2.2), although cluster differentiation is relatively small. The centroid of Sad in CREMA-D and JL Corpus have similar position on Valence axis as that of Neutral. The eNTERFACE database does not contain Neutral emotion, therefore the other three emotions cannot be compared to it.

Although the differentiation of clusters is not marked for the SAVEE database, it basically meets the expected trends. The exception is again the Sad emotion, which has higher mean value of Activation than one might expect and has approximately the same mean value of Valence as Neutral emotion.

The Canadian TESS database has the mutual placement of Angry, Happy, and Neutral emotions fully in line with the hypothesis. However, the centroid of the Sad cluster again achieves a higher value of Activation and Valence than expected.

GEES is a Serbian database meant for speech synthesis, which means that the prototypical emotions are presented very clearly and with high intensity. Therefore, the emotion centroids are placed on the expected positions. It is not any surprise that these positions are practically identical to other highly prototypical database, such as German EMO-DB.

#### *5.5. Centroid Distance as a Measure of Regression Quality*

In the following experiment, the distance of centroids of Angry and Happy emotion clusters on the Valence axis (for Valence regression) and the distance of centroids of Angry and Sad emotion clusters on the Activation axis (for Activation regression) were taken for an ad hoc objective measure of the ability of the regressor to differentiate between emotions. The evaluation of the regression quality using distances between centroids is presented in Table 3.

The two regressors have similar results, but in 15 of 20 cases the one trained on MIX2 (without VaM) has better resolution, and in two cases the results were the same. So, the conclusion could be that adding VaM data to the training set does not improve the universality of regression models and slightly degrades the performance of the regressors.

As was said in Section 3.1, due to the small amount of data in the corpora, we have only allocated 10% of the data for regression quality testing. To evaluate the possible impact of test data selection, we performed a 10-fold regression test on the "wining" mixture MIX2. The results of the individual folds showed only negligible differences with very low standard deviations both for Valence and Activation (see Table 4) and confirmed that 10% of the data is in this case a sufficiently representative sample for testing. MIX2. The results of the individual folds showed only negligible differences with very low standard deviations both for Valence and Activation (see Table 4) and confirmed that 10% of the data is in this case a sufficiently representative sample for testing. **Table 4.** Results the 10-fold regression on MIX2.

As was said in Section 3.1, due to the small amount of data in the corpora, we have only allocated 10% of the data for regression quality testing. To evaluate the possible impact of test data selection, we performed a 10-fold regression test on the "wining" mixture

**CCC MAE** 

*Electronics* **2021**, *10*, x FOR PEER REVIEW 13 of 16


**Table 4.** Results the 10-fold regression on MIX2.

#### *5.6. Overall Picture of Emotion Positions in the AV Space* We displayed emotion centroids for each database in one figure to assess whether

We displayed emotion centroids for each database in one figure to assess whether the same emotion category from different databases has a similar location in the AV space, and whether that location corresponds to the hypothesized positions (Figure 6). the same emotion category from different databases has a similar location in the AV space, and whether that location corresponds to the hypothesized positions (Figure 6).

**Figure 6.** Centroids the emotions contained in the 10 testing databases, obtained by regression (each centroid belongs to the particular emotion in one database). Numeric code identifying the databases in the figure are as follows: 1 CREMA-D, 2 EMO-DB, 3 EMOVO, 4 eNTERFACE, 5 JL Corpus, 6 RAVDESS, 7 SAVEE, 8 VESUS, 9 TESS, 10 GEES. **Figure 6.** Centroids the emotions contained in the 10 testing databases, obtained by regression (each centroid belongs to the particular emotion in one database). Numeric code identifying the databases in the figure are as follows: 1 CREMA-D, 2 EMO-DB, 3 EMOVO, 4 eNTERFACE, 5 JL Corpus, 6 RAVDESS, 7 SAVEE, 8 VESUS, 9 TESS, 10 GEES.

Centroids of Angry, Happy, and Neutral emotion clusters form well-distinguishable groups located in the AV space in an expected manner. This fact confirms that the system can evaluate the position of the perceived emotion in the AV space from the sound of Centroids of Angry, Happy, and Neutral emotion clusters form well-distinguishable groups located in the AV space in an expected manner. This fact confirms that the system can evaluate the position of the perceived emotion in the AV space from the sound of utterances.

utterances. However, the group of Sad emotion shows considerable variance and largely overlaps with the Neutral emotion. Sad utterances from some of the databases also achieve However, the group of Sad emotion shows considerable variance and largely overlaps with the Neutral emotion. Sad utterances from some of the databases also achieve higher Valence values than expected.

higher Valence values than expected.

**6. Discussion and Conclusions** 

#### **6. Discussion and Conclusions**

Due to the small volume and small number of training databases, the "ground truth" data is very sparse and unreliable. They cover only a small fraction of the variety of possible manifestations of emotions in speech. Moreover, the training data are not available for all the parts of the AV plane, and the frequencies of occurrence of training samples representing different points of the AV space are far from being balanced. A substantial part of the data belongs to the less intensely expressed emotions, and they hardly differ from neutral speech. Examples of intense emotions, with extremely low or high Valence and Activation values, are rare. This also leads to certain narrowing of the range of predicted AV values, which is well observable when comparing the positions of emotional category centroids from annotator ratings in Figure 3 with the positions of the respective centroids estimated by regressor in Figure 4.

It is not possible to make general statements about the absolute position of individual emotions in the AV space, but it is reasonable to evaluate their relative position.

From the results obtained by the proposed system, it can be seen that in general Anger has higher Activation and lower Valence, and Happy has higher Activation and higher Valence, than the Neutral emotion. Valence predicted by the proposed system for Sad utterances does not reach such low values as could be expected with respect to the values in original annotations (Figure 3) of the training databases and with respect to Russell's circumplex model. A valuable observation is that, despite the fact that the training data were in English, the emotions from the German, Serbian, and Italian databases were also placed in accordance with the hypothesis.

Due to the variety of sources of uncertainty in speech data and non-specificity of vocal cues of emotion, the clusters of emotions acquired by regression are close to each other and they overlap considerably. However, centroids of corresponding emotion clusters from various unseen databases form observable groups, which are well separable for Angry–Happy–Sad and Angry–Happy–Neutral triplets of emotions. The locations of these groups in the AV space correspond to hypothesized expectations for Angry, Happy, and Neutral emotions.

Some models (e.g., LSTM model as presented by Parry et al. [42] in Figure 2a of their paper) seem to be more successful in determining the affiliation of utterances to individual databases than in identifying emotions. This only confirms the fact that the utterances reflect various technical and methodological aspects of the design of databases, cultural and linguistic differences, and the like. It is therefore difficult to identify emotions from acoustic characteristics of voice. However, we have proven in our experiments that measurement of coordinates of speech utterances in emotional space is in principle feasible, but the resolution and the ability to differentiate various emotions is better for high-activity emotions (Angry–Happy), than for low-activity ones (Sad–Neutral). This may be caused by technical aspects of the solution, but also by the lack of reliable training data, inconsistencies in annotation, diversity of inner psychological interpretation of emotional categories, cultural and linguistic differences, and differences in methodology. At the same time, however, it is highly probable that the sound of speech expressing low-activity emotions contains much less marked distinctive features and is very similar to neutral speech.

In the meantime, the authors have obtained access to the additional dimensionally annotated database, OMG-Emotion Behavior Dataset [39], so one of the future steps will be an analysis, processing, and incorporating this dataset in the training database pool. The other areas of possible improvement are: finetuning of the X-vector extractor for the emotion recognition task, experimenting with combinations of different analysis timeframes, experimenting with various representative features, as well as experiments with new machine learning algorithms and architectures of regressors. Axes scales normalization and finding the position of the origin (center) of the AV space need to be implemented.

The research on the measurement of AV dimensions from speech sound is in its infancy, where the predicted values have high variance and the ranges and units of the dimensionaxes are not well defined. However, with new databases, increasing volume of training

data, more precise and representative annotation, and improved regression techniques, it will certainly be possible to achieve significantly higher accuracy and better applicability of AV dimensions estimation. Such a system could be used in practical applications in call-centers, avatars, robots, information-providing systems, security applications, and many more.

The designed regressor is currently utilized for a Valence prediction in stress detector from speech in the Air Traffic Management security tools developed in European project SATIE (Horizon 2020, No. 832969), and in a depression detection module developed in Slovak VEGA project No. 2/0165/21.

**Author Contributions:** Conceptualization, M.R. (Milan Rusko) and T.H.S.-K.; methodology, M.R. (Milan Rusko); software, M.T., M.R. (Marian Ritomský) and S.D.; validation, R.S. and M.S.; formal analysis, T.H.S.-K.; investigation, M.T., S.D. and M.R. (Milan Rusko); resources, R.S. and M.S.; data curation, R.S. and M.S.; writing—original draft preparation, M.R. (Milan Rusko); writing—review and editing, M.R. (Milan Rusko) and T.H.S.-K.; visualization, M.T., S.D., and M.R. (Marian Ritomský); supervision, M.R. (Milan Rusko); project administration, M.R. (Milan Rusko) and T.H.S.-K.; funding acquisition, T.H.S.-K. All authors have read and agreed to the published version of the manuscript.

**Funding:** This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No. 832969. This output reflects the views only of the authors, and the European Union cannot be held responsible for any use which may be made of the information contained therein. For more information on the project, see: http://satie-h2020.eu/. The work was also funded by the Slovak Scientific Grant Agency VEGA, project No. 2/0165/21.

**Data Availability Statement:** Only publicly available databases VoxCeleb, Voxceleb2, IEMOCAP, MSP IMPROV, VaM, EmoDB, EMOVO, CREMA-D, RAVDESS, eNTERFACE, SAVEE, VESUS, JL Corpus, TESS, and GEES were used in this research.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Santosh Gondi 1,\* and Vineel Pratap <sup>2</sup>**


**Abstract:** Deep learning–based speech recognition applications have made great strides in the past decade. Deep learning–based systems have evolved to achieve higher accuracy while using simpler end-to-end architectures, compared to their predecessor hybrid architectures. Most of these state-ofthe-art systems run on backend servers with large amounts of memory and CPU/GPU resources. The major disadvantage of server-based speech recognition is the lack of privacy and security for user speech data. Additionally, because of network dependency, this server-based architecture cannot always be reliable, performant and available. Nevertheless, offline speech recognition on client devices overcomes these issues. However, resource constraints on smaller edge devices may pose challenges for achieving state-of-the-art speech recognition results. In this paper, we evaluate the performance and efficiency of transformer-based speech recognition systems on edge devices. We evaluate inference performance on two popular edge devices, Raspberry Pi and Nvidia Jetson Nano, running on CPU and GPU, respectively. We conclude that with PyTorch mobile optimization and quantization, the models can achieve real-time inference on the Raspberry Pi CPU with a small degradation to word error rate. On the Jetson Nano GPU, the inference latency is three to five times better, compared to Raspberry Pi. The word error rate on the edge is still higher, but it is not too far behind, compared to that on the server inference.

**Keywords:** ASR; speech-to-text; edge AI; Wav2Vec; transformers; PyTorch

#### **1. Introduction**

Automatic speech recognition (ASR) is a process of converting speech signals to text. It has a large number of real-world use cases, such as dictation, accessibility, voice assistants, AR/VR applications, captioning of videos, podcasts, searching audio recordings, and automated answering services, to name a few. On-device ASR makes more sense for many use cases where an internet connection is not available or cannot be used. Private and always-available on-device speech recognition can unblock many such applications in healthcare, automotive, legal and military fields, such as taking patient diagnosis notes, in-car voice command to initiate phone calls, real-time speech writing, etc.

Deep learning–based speech recognition has made great strides in the past decade [1]. It is a subfield of machine learning which essentially mimics the neural network structure of the human brain for pattern matching and classification. It typically consists of an input layer, an output layer and one or more hidden layers. The learning algorithm adjusts the weights between different layers, using gradient descent and backpropagation until the required accuracy is met [1,2]. The major reason for its popularity is that it does not need feature engineering. It autonomously extracts the features based on the patterns in the training dataset. The dramatic progress of deep learning in the past decade can be attributed to three main factors [3]: (1) large amounts of transcribed data sets; (2) rapid increase in GPU processing power; and (3) improvements in machine learning algorithms and architectures. Computer vision, object detection, speech recognition and other similar fields have advanced rapidly because of the progress of deep learning.

**Citation:** Gondi, S.; Pratap, V. Performance Evaluation of Offline Speech Recognition on Edge Devices. *Electronics* **2021**, *10*, 2697. https:// doi.org/10.3390/electronics10212697

Academic Editors: Matúš Pleva, Yuan-Fu Liao and Patrick Bours

Received: 23 September 2021 Accepted: 1 November 2021 Published: 4 November 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

The majority of speech recognition systems run in backend servers. Since audio data need to be sent to the server for transcription, the privacy and security of the speech cannot be guaranteed. Additionally, because of the reliance on a network connection, the server-based ASR solution cannot always be reliable, fast and available.

On the other hand, on-device-based speech recognition inherently provides privacy and security for the user speech data. It is always available and improves the reliability and latency of the speech recognition by precluding the need for network connectivity [4]. Other non-obvious benefits of edge inference are energy and battery conservation for on-the-go products by avoiding Bluetooth/Wi-Fi/LTE connection establishments for data transfers.

Inferencing on edge can be achieved either by running computations on CPU or on hardware accelerators, such as GPU, DSP or using dedicated neural processing engines. The benefits and demand for on-device ML is driving modern phones to have dedicated neural engine or tensor processing units. For example, Apple iOS 15 will support on-device speech recognition for iPhones with Apple neural engine [5]. The Google Pixel 6 phone comes equipped with a tensor processing unit to handle on-device ML, including speech recognition [6]. Though dedicated neural hardwares might become a general trend in the future, at least in the short term, a large majority of IoT, mobile or wearable devices will not have these dedicated hardwares for on-device ML. Hence, training the models on backend and then pre-optimizing for CPU or general purpose GPU-based edge inferencing is a practical near term solution for on-edge inference [4].

In this paper, we evaluate the performance of ASR on Raspberry Pi and Nvidia Jetson Nano. Since the CPU, GPU and memory specification of these two devices are similar to those of typical edge devices, such as smart speakers, smart displays, etc., the evaluation outcomes in this paper should be similar to the results on a typical edge device. Related to our work, large vocabulary continuous speech recognition was previously evaluated on an embedded device, using CMU SPHINX-II [7]. In [8], the authors evaluated the on-device speech recognition performance with DeepSpeech [9], Kaldi [10] and Wav2Letter [11] models. Moreover, most on-the-edge evaluation papers focus on computer vision tasks, using CNN [12,13]. To the best of our knowledge, there have been no evaluations done for any type of transformer-based speech recognition models on low power edge devices, using both CPU- and GPU-based inferencing. The major contributions of this paper are as follows:


The rest of the paper is organized as follows: In the background section, we discuss ASR and transformers. In the experimental setup, we go through the steps for preparing the models and setting up both the devices for inferencing. We highlight some of the challenges we faced while setting up the devices. We go over the accuracy, performance and efficiency metrics in the results section. Finally, we conclude with the summary and outlook.

#### **2. Background**

ASR is the process of converting audio signals to text. In simple terms, the audio signal is divided into frames and passed through fast Fourier transform to generate feature vectors. This goes through an acoustic model to output the probability distribution of phonemes. Then, a decoder with a lexicon, vocabulary and language model is used to generate the word *n*-grams distributions. The hidden Markov model (HMM) [14] with a Gaussian mixture model (GMM) [15] was considered a mainstream ASR algorithm until a decade ago. Conventionally, the featurizer, acoustic modeling, pronunciation modeling, and decoding all were built separately and composed together to create an ASR system. Hybrid HMM–DNN approaches replaced GMM with deep neural networks with significant

performance gains [16]. Further advances used CNN- [17,18] and RNN-based [19] models to replace some or all components in hybrid DNN [1,2] architecture. Over time, ASR model architectures have evolved to convert audio signals to text directly, called sequence-tosequence models. These architectures have simplified the training and implementation of ASR models.The most successful end-to-end ASR are based on connectionist temporal classification (CTC) [20], recurrent neural network (RNN) transducer (RNN-T) [19], and attention-based encoder–decoder architecture [21].

Transformer is a sequence-to-sequence architecture originally proposed for machine translation [22]. When used for ASR, the input of transformer is audio frames instead of the text input, as in translation use case. Transformer uses multi head attention and positional embeddings. It learns sequential information through a self-attention mechanism instead of the recurrent connection used in RNN. Since their introduction, transformers are increasingly becoming the model of choice for NLP problems. The powerful natural language processing (NLP) models, such as GPT-3 [23], BERT [24], and AlphaFold 2 [25], which is the model that predicts the structures of proteins from their genetic sequences, are all based on transformer architecture. The major advantages of transformers over RNN/LSTM [26] is that they process the whole sequence at once, enabling parallel computation and hence, reducing the training time. They also do not suffer from long dependency issues; hence, they are more accurate. Since the transformer processes the whole sequence at once, they are not directly suitable for streaming-based applications, such as continuous dictation. In addition, their decoding complexity is quadratic over input sequence length because the attention is computed pairwise for each input. In this paper, we focus on the general viability and computational cost of transformer-based ASR on audio files. In future, we plan to explore streaming supported transformer architectures on edge.

#### *2.1. Wav2Vec 2.0 Model*

Wav2Vec 2.0 is a transformer-based speech recognition model trained using a selfsupervised method with contrastive training [27]. The raw audio is encoded using a multilayer convolutional network, the output of which is fed to the transformer network to build latent speech representations. Some of the input representations are masked during training. The model is then fine tuned with a small set of labeled data, using the connectionist temporal classification (CTC) [20] loss function. The great advantage of Wav2Vec 2.0 is the ability to learn from unlabeled data, which is tremendously useful in training for speech recognition for languages with very limited labeled audio. For the remaining part of this paper, we refer to the Wav2Vec 2.0 model as Wav2Vec to reduce verbosity. In our evaluation, we use a pre-trained base Wav2Vec model, which was trained on 960 hr of unlabeled LibriSpeech audio. We evaluate a 100 hr and a 960 hr fine-tuned model.

Figure 1 shows the simplified flow of the ASR process with this model.

**Figure 1.** Wav2Vec2 inference.

#### *2.2. Speech2Text Model*

The Speech2Text model is a transformer-based speech recognition model trained using the supervised method [28]. The transformer architecture is based on [22]. In addition, it has an input subsampler. The purpose of the subsampler is to downsample the audio sequence to match the input dimensions of the transformer encoder. The model is trained with a LibriSpeech, 960 hr, labeled training data set. Unlike Wav2Vec, which takes raw audio samples as input, this model accepts 80-channel log Mel filter bank extracted features

with a 25 ms window size and 10 ms shift. Additionally, utterance-level cepstral mean and variance normalization (CMVN) [29] is applied on the input frames before feeding to the subsampler. The decoder uses a 10,000 unigram vocabulary.

Figure 2 shows the simplified flow of the ASR process with this model.

**Figure 2.** Speech2Text inference.

#### **3. Experimental Setup**

#### *3.1. Model Preparation*

We use PyTorch models for evaluation. PyTorch is an open-source machine learning framework based on the Torch library. Figure 3 shows the steps for preparing the models for inferencing on edge devices.

**Figure 3.** Model preparation steps.

We first go through a few of the PyTorch tools and APIs used in our evaluation.

#### 3.1.1. TorchScript

TorchScript is the means by which PyTorch models can be optimized, serialized and saved in intermediate representation (IR) format. *torch.jit* (https://pytorch.org/docs/ stable/jit.html (accessed on 30 October 2021)) APIs are used for converting, saving and loading PyTorch models as ScriptModules. TorchScript itself is a subset of the Python language. As a result, sometimes, a model written in Python needs to be simplified to convert it into a script module. The TorchScript module can be created either using tracing or scripting methods. Tracing works by executing the model with sample inputs and capturing all computations, whereas scripting performs static inspection to go through the model recursively. The advantage of scripting over tracing is that it correctly handles the loops and control statements in the module. A saved script module can then be loaded either in a Python or C++ environment for inferencing purposes. For our evaluation, we generated ScriptModules for both Speech2Text and Wav2Vec models after applying any valid optimizations for specific devices.

#### 3.1.2. PyTorch Mobile Optimizations

PyTorch provides a set of APIs for optimizing the models for mobile platforms. It uses module fusing, operator fusing, and quantization among other things to optimize the models. We apply dynamic quantization for models used in this experiment. During this quantization, the scale factors are determined for activations dynamically based on the data range observed at runtime. By quantization, a neural network is converted to use a reduced precision integer representation for the weights and/or activations. This saves on model size and allows the use of higher throughput math operations on CPU or GPU.

#### 3.1.3. Models

We evaluated the Speech2Text and Wav2Vec transformer-based models on Raspberry Pi and Nvidia Jetson Nano. Inference on Raspberry Pi happens on CPU, while on Jetson Nano, it happens on GPU, using CUDA APIs. Given the limited RAM, CPU, and storage on these devices, we make use of Google Colab for importing, optimizing and saving the model as a TorchScript module. The saved modules are copied to Raspberry Pi and Jetson Nano for inferencing. On Raspberry Pi, which uses CPU-based inference, we evaluate both quantized and unquantized models. On Jetson Nano, we only evaluate unquantized models since CUDA only supports floating point operations.

#### Speech2Text Model

The Speech2Text pre-trained model is imported from *fairseq* (https://github.com/ pytorch/fairseq/tree/master/examples/speech\_to\_text (accessed on 30 October 2021)). Fairseq is a sequence modeling toolkit that allows researchers and developers to train custom models for speech and text tasks. We needed to make minor syntactical changes, such as Python type hints, to export the generator model as a TorchScript module. We have used *s2t\_transformer\_s* small architecture for this evaluation. The decoding uses a beam search decoder with a beam size of 5 and a SentencePiece tokenizer.

#### Wav2Vec Model

Wav2Vec pre-trained models are imported from *huggingface* (https://huggingface. co/transformers/model\_doc/wav2vec2.html (accessed on 30 October 2021)) using the *Wav2Vec2ForCTC* interface. We have used *Wav2Vec2CTCTokenizer* to decode the output indexes into transcribed text.

#### *3.2. Raspberry Pi Setup*

Raspberry Pi 4 B is used in this evaluation. The device specs are provided in Table 1. The default Raspberry Pi OS is 32 bit, which is not compatible with PyTorch. Hence, we installed a 64 bit OS.


**Table 1.** Raspberry Pi 4 B specs.

The main Python package required for inferencing is *PyTorch*. The default prebuilt wheel files of this package are mainly for Intel architecture, which depend on *Intel-MKL* (math kernel library) for math routines on CPU. The ARM-based architectures cannot use Intel MKL. They instead have to use *QNNPACK/XNNPACK* backend with other BLAS (basic linear algebra subprograms) libraries. QNNPACK (https://github.com/pytorch/ QNNPACK (accessed on 30 October 2021)) (quantized neural networks package) is a mobile-optimized library for low-precision, high-performance neural network inference. Similarly, XNNPACK (https://github.com/google/XNNPACK (accessed on 30 October 2021)) is a mobile-optimized library for higher precision neural network inference. We built and installed the torch wheel file on Raspberry Pi from source with XNNPACK and QNNPACK cmake configs. We needed to set the device backend to QNNPACK during inference as *torch.backends.quantized.engine='qnnpack'*. Note that with the latest PyTorch release 1.9.0, the wheel files are available for ARM 64-bit architectures. Hence, there is no need to build *torch* from source anymore.

The lessons learnt during setup are as follows:

• Speech2Text transformer models expect Mel-frequency cepstral coefficients [30] as input features. However, we could not use *Torchaudio*, *PyKaldi*, *librosa* or *python\_speech\_features* libraries for this because of dependency issues. *Torchaudio* has dependency on Intel MKL. Building *PyKaldi* on device was not feasible because of memory limitations. The *librosa* and *python\_speech\_features* packages produced different outputs for MFCC, which were unsuitable for PyTorch models. Therefore, the MFCC features for the LibriSpeech data set were pre-generated, using *fairseq audio\_utils* (https://github.com/ pytorch/fairseq/blob/master/fairseq/data/audio/audio\_utils.py (accessed on 30 October 2021)) on the server, and saved as NumPy files. These NumPy files were used as model input after applying CWVN transforms.


#### *3.3. Nvidia Jetson Nano Setup*

We configured Jetson Nano using the instructions on the Nvidia website. The Nano flash file comes with JetPack pre-installed, which includes all the CUDA libraries required for inferencing on GPU. The full specs of the device are provided in Table 2.

**Table 2.** Jetson Nano specs.


For Nano, we needed to build *torch* from source with CUDA cmake option. Further, an upgrade was needed to Clang and LLVM compiler toolchain to use Clang for compiling PyTorch.

The lessons learnt during setup are as follows:


#### *3.4. Evaluation Methodology*

This section explains the methodologies used for collecting and presenting the metrics in this paper. The LibriSpeech [31] test and dev datasets were used to evaluate ASR performance on both Raspberry Pi and Jetson Nano. The test and dev datasets together contain 21 hr of audio. To save time, for these experiments we randomly sampled 300 (∼10%) of the audio files in each of the four data sets for inference. The same set for each configuration was used so that the results would be comparable. Typically, ML practitioners only report the WER metric for server-based ASR. So, we did not have a server side reference for latency and efficiency metrics, such as memory, CPU or load times. Unlike backend servers, the edge devices are constrained in terms of memory, CPU, disk and energy. To achieve on-device ML, the inferencing needs to be efficient enough to fit within the device's resource budgets. Hence, we measured these efficiency metrics along with the accuracy to assess the plausibility of meeting these budgets on typical edge devices.

#### 3.4.1. Accuracy

Accuracy is measured using word error rate (WER), a standard metric for speech-totext tasks. It is defined as in Equation (1):

$$WER = (S + I + D) / N \tag{1}$$

where *S* is the number of substitutions, *D* is the number of deletions, *I* is the number of insertions and *N* is the number of words in the reference.

WER for a dataset is computed as the total number of errors over the total number of reference words in the dataset. We compare the on-device WER on Raspberry Pi and Jetson Nano with the on-server-based WER as reported in Speech2Text [28] and Wav2Vec [27] papers. In both papers, the WER for all models was computed on LibriSpeech test and dev data sets with GPU in standalone mode. On server, the Speech2Text model used a beam size of 5 and vocabulary of 10,000 words for decoding, whereas the Wav2Vec model used a transformer-based language model for decoding. The pre-trained models used in this experiment have the same configuration as that of the server models.

#### 3.4.2. Latency

The latency of ASR is measured using real time factor (RTF). It is defined in Equation (2). In simple terms, with a RTF of 0.5, two seconds of audio will be transcribed by the system in one second.

$$\text{RTF} = (\text{read time} + \text{inference time} + \text{decoding time}) / \text{total utterance duration} \tag{2}$$

We compute the avg, mean, pctl 75 and pctl 90 RTF over all the audio samples in each data set. We also used PyTorch profiler to visualize the CPU usage of various operators and functions inside the models.

#### 3.4.3. Efficiency

We measure the CPU load and memory footprint during the entire data set evaluation, using the Linux *top* command. The top command is executed in the background every two minutes in order to avoid side effects on the main inference script.

The model load time is measured by collecting the *torch.jit.load* API latency to load the scripted model. We separately measured the load time by running 10 iterations and took an average. We ensured that the load time measurements were from a clean state, i.e., from the system boot, to discount any caching in the Linux OS layer for subsequent model loads.

#### **4. Results**

In this section, we present the accuracy, performance and efficiency metrics for Speech2Text and Wav2Vec model inference.

#### *4.1. WER*

Tables 3 and 4 show the WER on Raspberry Pi and Jetson Nano, respectively.


**Table 3.** WER on Raspberry Pi.

The WER is slightly higher for the quantized models, compared to the unquantized ones by an avg of ∼0.5%. This is a small trade off in accuracy for better RTF and efficient inference. The *test-other* and *dev-other* data sets have a higher WER, compared to the *testclean* and *dev-clean* data sets. This is expected because *other* datasets are noisier, compared to *clean* ones.

The WER on device for unquantized models is generally higher than what is reported on the server. We need to investigate further to understand this discrepancy. One plausible reason could be due to a smaller sampled dataset used in our evaluation, compared to the server WER, which is calculated over the entire dataset.


**Table 4.** WER on Jetson Nano.

WER for the Wav2Vec case is higher because of batching of the input samples at the 64 K (4 s audio) boundary. If a sample duration is longer than 4 s, we divide it into two batches. See Section 3.3 for the reasoning. So, words at the boundary of 4 s can be misinterpreted. We plan to investigate this batching problem in future. We report the WER figures here for the purpose of completeness.

#### *4.2. RTF*

In our experiments, RTF is dominated by *model inference time* > 99% compared to other two factors in (2). Tables 5 and 6 show the RTF for Raspberry Pi and Jetson Nano, respectively. RTF does not vary between different data sets for the same models. Hence, we show the RTF (avg, mean, pctl 75 and pctl 90) per model instead of one per data set.


**Table 5.** RTF of Raspberry Pi.

RTF is improved by ∼10% for quantized models, compared to unquantized floating point models. This is because CPU has to load less memory and can run tensor computations more efficiently in int8 than in floating points. The inferencing of the Speech2Text model is three times faster than the Wav2Vec model. This can be explained by the fact that the Wav2Vec has three times more parameters than the Speech2Text model (refer to Table 7). There is no noticeable difference in RTF between 100 hr and 960 hr fine-tuned Wav2Vec models because the number of parameters do not change between 960 hr and 100 hr fine-tuned models.

**Table 6.** RTF on Jetson Nano.


**Table 7.** Model size.


RTF on Jetson Nano is three times better for the Speech2Text model and five times better for the Wav2Vec model, compared to Raspberry Pi. Nano is able to make use of a large number of CUDA cores for tensor computations. We do not evaluate quantized models on Nano because CUDA only supports floating point computations.

Wav2Vec RTF on Raspberry Pi is close to real time, whereas in every other case, the RTF is far below 1. This implies that on-device ASR can be used for real-time dictation, accessibility, voice based app navigation, translation and other such tasks without much latency.

#### *4.3. Efficiency*

For both CPU and memory measurements over time, we use the Linux *top* command. The command is executed in loop every 2 min in order to not affect the main processing.

#### 4.3.1. CPU Load

Figures 4 and 5 show the CPU load of all model inferences on Raspberry Pi and Jetson Nano, respectively. The CPU load in Nano for both the Speech2Text and Wav2Vec models is ∼85% in steady state. It mostly uses one of the four cores during operation. Most of the CPU processing on Nano is for copying the input to memory for GPU processing and also copying back the output. On Raspberry Pi, the CPU load is ∼380%. Since all the tensor computations happen on CPU, all CPU cores are utilized fully during model inference. On Nano, the initial few minutes are spent loading and benchmarking the model. That is why the CPU is not busy during the initial few minutes.

**Figure 4.** CPU load on Raspberry Pi.

**Figure 5.** CPU load on Jetson Nano.

#### 4.3.2. Memory Footprint

Figures 6 and 7 show the memory of all model inferences on Raspberry Pi and Jetson Nano, respectively. The memory values presented here are *RES (resident set size)* values from top command. On Raspberry Pi, the quantized Wav2Vec model consumes ∼50% less memory (from 1 GB to 560 MB), compared to the unquantized model. Similarly, the Speech2Text model consumes ∼40% less memory (from 480 MB to 320 MB), compared to the unquantized model. On Nano, memory consumption for the Speech2Text model is ∼1 GB, and the Wav2Vec model is ∼500 MB. On Nano, the same memory is shared between GPU and CPU.

**Figure 6.** Memory footprint on Raspberry Pi.

**Figure 7.** Memory footprint on Jetson Nano.

#### 4.3.3. Model Load Time

Table 8 shows the model load times on Raspberry Pi and Jetson Nano. A load time of 1–2 s on Raspberry Pi seems reasonable for any practical application where the model is loaded once and the process inference requests multiple times. The load time on Nano is 15–20 times longer than on Raspberry Pi. Nano *cuDNN* has to allot some amount of cache for loading the model, which takes time.

**Table 8.** Model load times.


#### *4.4. PyTorch Profiler*

PyTorch profiler (https://pytorch.org/tutorials/recipes/recipes/profiler\_recipe.html (accessed on 30 October 2021)) can be used to study the time and memory consumption of the model's operators. It is enabled through Context Manager in Python. The profiler is used to understand the distribution of CPU percentage over model operations. Some of the columns from the profiler are not shown in the table for simplicity.

#### 4.4.1. Jetson Nano Profiles

Tables 9 and 10 show the profiles of Wav2Vec and Speech2Text models on Jetson Nano. For Wav2Vec model, the majority of the CUDA time is spent in *aten::cudnn\_convolution* for input convolutions followed by matrix multiplication (*aten::mm*). Additionally, the CPU and GPU spend a significant amount of time transferring data between each other, *aten::to*.

For the Speech2Text model, the majority of the CUDA time is spent in decoder *forward* followed by *aten::mm* for tensor multiplication operations.

#### 4.4.2. Raspberry Pi profiles

Tables 11–14 show the profiles of Wav2Vec and Speech2Text models on Raspberry Pi.


**Table 9.** Jetson Nano profile for the Wav2Vec model.

**Table 10.** Jetson Nano profile for Speech2Text model.


**Table 11.** Raspberry Pi profile for Wav2Vec quantized on model.


The CPU time is dominated by *linear\_dynamic* for linear layer computations followed by *aten::addmm\_* for tensor add multiplications.


**Table 12.** Raspberry Pi profile for Wav2Vec non-quantized model.

Compared to the quantized model, the non-quantized model spends 5 s more time in linear computations, *prepacked::linear\_clamp\_run*.


**Table 13.** Raspberry Pi profile for Speech2Text quantized model.

**Table 14.** Raspberry Pi profile for Speech2Text non-quantized model.


CPU percentages are dominated by forward function, linear layer computations and batched matrix multiplication in both quantized and unquantized models.

The unquantized linear layer processing is 40% higher than the quantized version.

#### **5. Conclusions**

We evaluated the ASR accuracy, performance and computational efficiency of transformerbased models on edge devices. By applying quantization and PyTorch mobile optimizations for CPU based inferencing, we gain ∼ 10% improvement in latency and ∼50% reduction in the memory footprint at the cost of ∼0.5% increase in WER, compared to the original model. Running the inference on Jetson Nano GPU improves the latency by a factor of 3 to 5. With 1–2 s load times, ∼300 MB of memory footprint and RTF < 1.0, the latest transformer models can be used on typical edge devices for private, secure, reliable and always-available ASR processing. For applications such as dictation, smart home control, accessibility, etc., a small trade off in WER for latency and efficiency gains is mostly acceptable since small ASR errors will not hamper the overall task completion rate for voice commands, such as turning off a lamp, opening an app on a device, etc. By offloading inference to a general purpose GPU, we can potentially gain 3–5× latency improvements.

In future, we are planning to explore other optimization techniques, such as pruning, sparsity, 4-bit quantization and different model architectures to further analyze the WER vs. performance trade offs. We also plan to measure the thermal and battery impact of various models in CPU and GPU platforms on mobile and wearable devices.

**Author Contributions:** Conceptualization—S.G. and V.P.; methodology—S.G. and V.P.; setup and experiments—S.G.; original draft preparation—S.G.; review and editing—S.G. and V.P. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Data Availability Statement:** Publicly available Librispeech datasets were used in this study. his data can be found here: https://www.openslr.org/12 (accessed on 30 October 2021).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**


#### **References**


*Article*

## **Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Deep Length Normalization for Text-Independent Speaker Verification System**

### **Soonshin Seo and Ji-Hwan Kim \***

Department of Computer Science and Engineering, Sogang University, Seoul 04107, Korea; ssseo@sogang.ac.kr **\*** Correspondence: kimjihwan@sogang.ac.kr; Tel.: +82-2-705-8924

Received: 19 August 2020; Accepted: 15 October 2020; Published: 17 October 2020

**Abstract:** One of the most important parts of a text-independent speaker verification system is speaker embedding generation. Previous studies demonstrated that shortcut connections-based multi-layer aggregation improves the representational power of a speaker embedding system. However, model parameters are relatively large in number, and unspecified variations increase in the multi-layer aggregation. Therefore, in this study, we propose a self-attentive multi-layer aggregation with feature recalibration and deep length normalization for a text-independent speaker verification system. To reduce the number of model parameters, we set the ResNet with the scaled channel width and layer depth as a baseline. To control the variability in the training, we apply a self-attention mechanism to perform multi-layer aggregation with dropout regularizations and batch normalizations. Subsequently, we apply a feature recalibration layer to the aggregated feature using fully-connected layers and nonlinear activation functions. Further, deep length normalization is used on a recalibrated feature in the training process. Experimental results using the VoxCeleb1 evaluation dataset showed that the performance of the proposed methods was comparable to that of state-of-the-art models (equal error rate of 4.95% and 2.86%, using the VoxCeleb1 and VoxCeleb2 training datasets, respectively).

**Keywords:** text-independent speaker verification system; self-attentive pooling; multi-layer aggregation; feature recalibration; deep length normalization; speaker embedding; shortcut connections; convolutional neural networks; ResNet

#### **1. Introduction**

Speaker recognition aims to analyze the speaker representation from input audio. A subfield of speaker recognition is speaker verification, which determines whether the utterance of the claimed speaker should be accepted or rejected by comparing it to the utterance of the registered speaker. Speaker verification is divided into text dependent and text independent. Text-dependent speaker verification aims to recognize only the specified utterances when verifying the speaker. Examples include Google's "OK Google" and Samsung's "Hi Bixby." Meanwhile, text-independent speaker verification is not limited to the type of utterances to be recognized. Therefore, the problems to be solved using text-independent speaker verification are more difficult. If the performance is guaranteed, text-independent speaker verification can be utilized in various biometric systems and e-learning platforms, such as biometric authentication for chatbots, voice ID, and virtual assistants.

Owing to advances in computational power and deep learning techniques, the performance of text-independent speaker verification has been improved. Text-independent speaker verification using deep neural networks (DNN) is divided into two streams. The first one is an end-to-end system [1]. The input of the DNN is a speech signal, and the output is the verification result. This is a single-pass operation in which all processes can be operated at once. However, the input speech of a variable

length is difficult to handle. To address this problem, several studies have applied a pooling layer or temporal average layer to an end-to-end system [2,3]. The second is a speaker embedding-based system [4–14], which generates an input of variable length into a vector of fixed length using a DNN. The generated vector is used as an embedding to represent the speaker. The speaker embedding-based system can handle input speech of variable length and can generate speaker representations from various environments. pass operation in which all processes can be operated at once. However, the input speech of a variable length is difficult to handle. To address this problem, several studies have applied a pooling layer or temporal average layer to an end-to-end system [2,3]. The second is a speaker embedding-based system [4–14], which generates an input of variable length into a vector of fixed length using a DNN. The generated vector is used as an embedding to represent the speaker. The speaker embeddingbased system can handle input speech of variable length and can generate speaker representations from various environments.

*Electronics* **2020**, *9*, x FOR PEER REVIEW 2 of 15

[1]. The input of the DNN is a speech signal, and the output is the verification result. This is a single-

As shown in Figure 1, a DNN has been used as a speaker embedding extractor in a speaker embedding-based system. In general, a speaker embedding-based system executes the following processes [4–7]: As shown in Figure 1, a DNN has been used as a speaker embedding extractor in a speaker embedding-based system. In general, a speaker embedding-based system executes the following processes [4–7]:


In addition, back-end methods, for example, probabilistic linear discriminant analysis, can be used [8–10]. In addition, back-end methods, for example, probabilistic linear discriminant analysis, can be used [8–10].

**Figure 1.** Overview of speaker embedding-based text-independent speaker verification system. **Figure 1.** Overview of speaker embedding-based text-independent speaker verification system.

The most important part in the above system is the speaker embedding generation [13]. Speaker embedding is a high-dimensional feature vector that contains speaker information. An ideal speakerembedding maximizes inter-class variations and minimizes intra-class variations [10,14,15]. The component that directly affects the speaker embedding generation is the encoding layer. The encoding layer takes a frame-level feature and converts it into a compact utterance-level feature. It The most important part in the above system is the speaker embedding generation [13]. Speaker embedding is a high-dimensional feature vector that contains speaker information. An ideal speaker-embedding maximizes inter-class variations and minimizes intra-class variations [10,14,15]. The component that directly affects the speaker embedding generation is the encoding layer. The encoding layer takes a frame-level feature and converts it into a compact utterance-level feature. It also converts variable-length features to fixed-length features.

also converts variable-length features to fixed-length features. Most encoding layers are based on various pooling methods, for example, temporal average pooling (TAP) [10,14,16], global average pooling (GAP) [13,15], and statistical pooling (SP) [6,14,17,18]. In particular, self-attentive pooling (SAP) has improved performance by focusing on the frames for a more discriminative utterance-level feature [10,19,20], and pooling layers provide compressed speaker information by rescaling the input size. These are mainly used with convolutional neural networks (CNN) [10,13,14–17,20]. The speaker embedding is extracted using Most encoding layers are based on various pooling methods, for example, temporal average pooling (TAP) [10,14,16], global average pooling (GAP) [13,15], and statistical pooling (SP) [6,14,17,18]. In particular, self-attentive pooling (SAP) has improved performance by focusing on the frames for a more discriminative utterance-level feature [10,19,20], and pooling layers provide compressed speaker information by rescaling the input size. These are mainly used with convolutional neural networks (CNN) [10,13–17,20]. The speaker embedding is extracted using the output value of the last pooling layer in a CNN-based speaker model.

the output value of the last pooling layer in a CNN-based speaker model. To improve the representational power of the speaker embedding, residual learning derived from ResNet [21] and squeeze-and-excitation (SE) blocks [22] were adapted for the speaker models [10,13–16,20,23]. Residual learning maintains input information through mappings between layers called "shortcut connections." A large-scale CNN using shortcut connections can avoid gradient degradation. The SE block consists of a squeeze operation (which condenses all of the information on the features) and an excitation operation (which scales the importance of each feature). Therefore, a channel-wise feature response can be adjusted without significantly increasing the model complexity in the training. To improve the representational power of the speaker embedding, residual learning derived from ResNet [21] and squeeze-and-excitation (SE) blocks [22] were adapted for the speaker models [10,13–16,20,23]. Residual learning maintains input information through mappings between layers called "shortcut connections." A large-scale CNN using shortcut connections can avoid gradient degradation. The SE block consists of a squeeze operation (which condenses all of the information on the features) and an excitation operation (which scales the importance of each feature). Therefore, a channel-wise feature response can be adjusted without significantly increasing the model complexity in the training.

The main limitation of the previous encoding layers is that the model uses only the output feature of the last pooling layer as input. In other words, the model uses only one frame-level feature when performing speaker embedding. Therefore, similar to [14,24], a previous study presented a shortcut connection-based multi-layer aggregation to improve the speaker representations when calculating the weight at the encoding layer [13]. Specifically, the frame-level features are extracted from between each residual layer in ResNet. Then, these frame-level features are fed into the input of the encoding layer using shortcut connections. Consequently, a high-dimensional speaker embedding is generated. The main limitation of the previous encoding layers is that the model uses only the output feature of the last pooling layer as input. In other words, the model uses only one frame-level feature when performing speaker embedding. Therefore, similar to [14,24], a previous study presented a shortcut connection-based multi-layer aggregation to improve the speaker representations when calculating the weight at the encoding layer [13]. Specifically, the frame-level features are extracted from between each residual layer in ResNet. Then, these frame-level features are fed into the input of the encoding layer using shortcut connections. Consequently, a high-dimensional speaker embedding is generated.

However, the previous study [13] has limitations. First, the model parameter size is relatively large, and the model generates high-dimensional speaker embeddings (1024 dimensions, about 15 million model parameters). This leads to inefficient training and thus requires a sufficiently large amount of data for training. Second, the multi-layer aggregation approach increases not only the speaker's information but also the intrinsic and extrinsic variation factors, for example, emotion, noise, and reverberation. Some of these unspecified factors increase variability while generating speaker embedding. However, the previous study [13] has limitations. First, the model parameter size is relatively large, and the model generates high-dimensional speaker embeddings (1024 dimensions, about 15 million model parameters). This leads to inefficient training and thus requires a sufficiently large amount of data for training. Second, the multi-layer aggregation approach increases not only the speaker's information but also the intrinsic and extrinsic variation factors, for example, emotion, noise, and reverberation. Some of these unspecified factors increase variability while generating

Hence, we propose a self-attentive multi-layer aggregation with feature recalibration and deep length normalization for a text-independent speaker verification system, as shown in Figure 2. We present an improved version of the previous study, as described in the following steps: speaker embedding. Hence, we propose a self-attentive multi-layer aggregation with feature recalibration and deep length normalization for a text-independent speaker verification system, as shown in Figure 2. We


The remainder of this paper is organized as follows. Section 2 describes a baseline system using shortcut connections-based multi-layer aggregation. Section 3 introduces the proposed self-attentive multi-layer aggregation method with feature recalibration and normalization. Section 4 discusses our experiments, and conclusions are drawn in Section 5. The remainder of this paper is organized as follows. Section 2 describes a baseline system using shortcut connections-based multi-layer aggregation. Section 3 introduces the proposed self-attentive multi-layer aggregation method with feature recalibration and normalization. Section 4 discusses our experiments, and conclusions are drawn in Section 5.

**Figure 2.** Overview of proposed network architecture: Self-attentive multi-layer aggregation with a feature recalibration layer and a deep length normalization layer (We extract a speaker embedding after the normalization layer on each utterance). **Figure 2.** Overview of proposed network architecture: Self-attentive multi-layer aggregation with a feature recalibration layer and a deep length normalization layer (We extract a speaker embedding after the normalization layer on each utterance).

#### **2. Baseline System: Shortcut Connections-Based Multi-Layer Aggregation 2. Baseline System: Shortcut Connections-Based Multi-Layer Aggregation**

#### *2.1. Prior System 2.1. Prior System*

In a previous study [13], a shortcut connections-based multi-layer aggregation with ResNet-18 was proposed. Its main difference from the standard ResNet-18 [21] is the manner that speaker In a previous study [13], a shortcut connections-based multi-layer aggregation with ResNet-18 was proposed. Its main difference from the standard ResNet-18 [21] is the manner that speaker embedding is aggregated. Multi-layer aggregation uses not only the output feature of the last residual layer but also the output features of all previous residual layers. These features are concatenated into one feature through shortcut connections. The concatenated feature is fed into several fully-connected layers to construct high-dimensional speaker embedding. The prior system improved the performance by a simple method.

However, it has large parameters because the system uses multi-layer aggregation, as presented in Table 1. The model parameters of standard ResNet-18 and standard ResNet-34 number are approximately 11.8 million and 21.9 million, respectively. Conversely, the model parameters of the prior system based on ResNet-18 and ResNet-34 are approximately 15.6 million and 25.7 million, respectively. In addition, the forward–backward training times of standard ResNet-18 and standard ResNet-34 are approximately 6.025 ms and 10.326 ms, respectively. However, the forward–backward training times of the prior system based on ResNet-18 and ResNet-34 are approximately 6.576 ms and 10.820 ms, respectively (when measuring the forward–backward training time, three units of GTX1080Ti and 96 mini-batch size were used).

**Table 1.** Comparison of model parameters and computational time in training between standard ResNet models and the prior system (MLA = multi-layer aggregation; Dim = speaker embedding dimension; Params = model parameters; FBTT = forward–backward training time (ms/batch)).


#### *2.2. Modifications*

As discussed in Section 2.1, the prior system improved the performance; however, the model parameters were too large. The prior system is modified considering scaling factors, such as layer depth, channel width, and input resolution, for efficient learning in the CNN [26]. First, we used high-dimensional log-Mel filter banks with data augmentation for the input resolution. We extracted an input feature map of size *D* × *L*, where *D* is the number of single-frame spectral features and *L* is the number of frames. Here, Mel-filter banks determine dimension *D* from zero to 8,000 Hz. Subsequently, the channel width is reduced, and the layer depth is expanded because ResNet can improve the performance without significantly increasing the parameters when the layer depth is increased.

Consequently, the scaled ResNet-34 was constructed, as shown in Table 2. The scaled ResNet-34 is composed of three, four, six, and three residual blocks. It has reduced the number of channels by half compared to the standard ResNet-34 [21]. In addition, shortcut connections-based multi-layer aggregation is added to the model using the GAP encoding method. The output features of each GAP are concatenated and fed into the output layer. Then, high-dimensional speaker embedding is generated from a penultimate layer in a network. Thus, the scaled ResNet-34 has only approximately 5.9 million model parameters compared to the prior system, as presented in Table 3. In addition, the forward–backward training time in milliseconds of the scaled ResNet-34 is faster than the prior system based on ResNet-34 (the forward–backward training time in milliseconds of the scaled ResNet-34 is approximately 5.658 ms).


**Table 2.** Architecture of scaled ResNet-34 using multi-layer aggregation as a baseline (*D* = input dimension; *L* = input length; *N* = number of speakers; GAP = global average pooling; SE = speaker embedding).

**Table 3.** Comparison of model parameters and computational time in training between the prior system and the scaled ResNet model (MLA = multi-layer aggregation; Dim = speaker embedding dimension; Params = model parameters; FBTT = forward–backward training time (ms/batch)).


#### **3. Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Normalization**

As discussed in Section 1, the previous study has two problems. The model parameter problem is addressed by building a scaled ResNet-34. However, the problem of multi-layer aggregation remains. Multi-layer aggregation uses output features of multiple layers to develop the speaker embedding system. It is assumed that not only speaker information but also other unspecified factors exist in the output feature of the layer. The unspecified factor lowers the speaker verification performance. Therefore, we proposed three methods: self-attentive multi-layer aggregation, feature recalibration, and deep length normalization.

#### *3.1. Model Architecture*

As presented in Figure 2 and Table 4, the proposed network mainly consists of a scaled ResNet and an encoding layer. Frame-level features are trained in the scaled ResNet, and utterance-level features are trained in the encoding layer.

In the scaled ResNet, given an input feature *X* = [*x*1, *x*2, . . . , *x<sup>l</sup>* , . . . , *xL*] of length *L* (*x<sup>l</sup>* ∈ R*<sup>d</sup>* ), output features *P<sup>i</sup>* = h *p*1 , *p*<sup>2</sup> , . . . , *p<sup>c</sup>* , . . . , *p<sup>C</sup>* i (*pc* ∈ R) from each residual layer of the scaled ResNet are generated using SAP. Here, the length *C<sup>i</sup>* is determined by the number of channels in the *i th* residual layer. Then, the generated output features are concatenated into one feature *V* as in Equation (1) (where [+] indicates concatenation).

$$\mathbf{V} = \mathbf{P}\_1 \begin{bmatrix} + \end{bmatrix} \mathbf{P}\_2 \begin{bmatrix} + \end{bmatrix} \mathbf{P}\_3 \begin{bmatrix} + \end{bmatrix} \mathbf{P}\_4 \begin{bmatrix} + \end{bmatrix} \mathbf{P}\_5 \tag{1}$$

The concatenated feature *V* = [*v*1, *v*2, . . . , *vc*, . . . , *vC*] (length *C* = *C*<sup>1</sup> + *C*<sup>2</sup> + *C*<sup>3</sup> + *C*<sup>4</sup> + *C*5, *v<sup>c</sup>* ∈ R) is a set of frame-level features and is used as the input of the encoding layer.

The encoding layer comprises a feature recalibration layer and a deep length normalization layer. In the feature recalibration layer, the concatenated feature *V* is recalibrated by fully-connected layers and nonlinear activations. Consequently, a recalibrated feature *V*´ = [*v*´ <sup>1</sup>, *v*´ <sup>2</sup>, . . . , *v*´ *<sup>c</sup>*, . . . , *v*´*C*] (*v*´ *<sup>c</sup>* ∈ R) is generated. Then, the recalibrated feature is normalized according to the length of input *V*´ in the

deep-length normalization layer. The normalized feature is used as a speaker embedding and is fed into the output layer. Further, a log probability for speaker classes *s*, *P*(*spk<sup>s</sup> <sup>x</sup>*1, *<sup>x</sup>*2, . . . , *<sup>x</sup><sup>l</sup>* , . . . , *x<sup>L</sup> ),* is generated in the output layer.

**Table 4.** Architecture of proposed scaled ResNet-34 model using self-attentive multi-layer aggregation with feature recalibration and deep length normalization layers (*D* = input dimension; *L* = input length; *N* = number of speakers; *P* = output features of pooling layers; *V* = output features of concatenation layer; *V*´ = output features of feature recalibration layer; FR = feature recalibration; DLN = deep length normalization; SAP = self-attentive pooling; SE = speaker embedding).


#### *3.2. Self-Attentive Multi-Layer Aggregation*

Then, the embedding ∈ ℝ

weights and as in Equation (3).

*3.3. Feature Recalibration*

in Equation (4).

 ×

output feature = [́

*3.4. Deep Length Normalization*

∈ ℝ

The embedding vector can be rewritten as = [

dropout regularization and batch normalization are used in

the dimensions. Consequently, the SAP output feature

channel of the concatenated feature; this is inspired by [22].

is the front fully-connected layer, ∈ ℝ

́

́

of recalibration according to the importance of the channels.

́

concatenated into one feature, , as in Equation (1).

Given an input feature = [

As shown in Figures 2 and 3, SAP is applied to each residual layer using shortcut connections. For every input feature, given an output feature of the first convolution layer or the *i th* residual layers after conducting an average pooling, *Y<sup>i</sup>* = h *y*1 , *y*<sup>2</sup> , . . . , *y<sup>n</sup>* , . . . , *y<sup>N</sup>* i of length *N* (*y<sup>n</sup>* ∈ R*<sup>c</sup>* ) is obtained. The number of dimensions *Electronics* **2020** *c* is determined by the number of channels. , *9*, x FOR PEER REVIEW 7 of 15

**Figure 3.** Overview of self-attentive pooling procedure. **Figure 3.** Overview of self-attentive pooling procedure.

Then, the average feature is fed into a fully-connected hidden layer to obtain = [ ] using a hyperbolic tangent activation function*.* Given ∈ ℝ and a learnable context vector ∈ ℝ , the attention weight is measured by training the similarity between and with a softmax normalization as in Equation (2). Then, the average feature is fed into a fully-connected hidden layer to obtain *H<sup>i</sup>* = [*h*1, *h*2, . . . , *hn*, . . . , *hN*] using a hyperbolic tangent activation function. Given *h<sup>n</sup>* ∈ R*<sup>c</sup>* and a learnable context vector *u* ∈ R*<sup>c</sup>* , the attention weight *w<sup>n</sup>* is measured by training the similarity between *h<sup>n</sup>* and *u* with a softmax normalization as in Equation (2).

$$w\_n = \frac{\exp\left(\mathbf{h}\_n^{-T}\mathbf{u}\right)}{\sum\_{n=1}^{N} \exp(\mathbf{h}\_n^{-T}\mathbf{u})} \tag{2}$$

(3)

is generated. This process helps generate

] ( ∈ ℝ, where is the sum of all channels),

= ) = ( )) (4)

] (́ ∈ ℝ) is generated. This generated feature is the result

. Then, the generated features are

is the back fully-connected layer,

] ( ∈ ℝ) in the order of

a more discriminative feature while focusing on the frame-level features of each layer. Moreover,

After the self-attentive multi-layer aggregation, the concatenated feature is fed into the feature recalibration layer. The feature recalibration layer aims to train the correlations between each

the feature channels are recalibrated using two fully-connected layers and nonlinear activations, as

Here, refers to the leaky rectified linear unit activation; refers to the sigmoid activation;

As in [11], deep length normalization was applied to the proposed model. The L2 constraint is

applied to the length axis of the recalibrated feature with a scale constant, α, as in Equation (5).

, and

. According to the reduction ratio , a dimensional transformation is performed between

× 

the two fully-connected layers, such as a bottleneck structure, while channel-wise multiplication is performed. The rescaled channels are then multiplied by the input feature . Consequently, an

Then, the embedding *e* ∈ R*<sup>c</sup>* is generated using the weighted sum of the normalized attention weights *w<sup>n</sup>* and *y<sup>n</sup>* as in Equation (3).

$$e = \sum\_{n=1}^{N} y\_n w\_n \tag{3}$$

The embedding vector *e* can be rewritten as *P<sup>i</sup>* = h *p*1 , *p*<sup>2</sup> , . . . , *p<sup>c</sup>* , . . . , *p<sup>C</sup>* i (*pc* ∈ R) in the order of the dimensions. Consequently, the SAP output feature *P<sup>i</sup>* is generated. This process helps generate a more discriminative feature while focusing on the frame-level features of each layer. Moreover, dropout regularization and batch normalization are used in *P<sup>i</sup>* . Then, the generated features are concatenated into one feature, *V*, as in Equation (1).

#### *3.3. Feature Recalibration*

After the self-attentive multi-layer aggregation, the concatenated feature *V* is fed into the feature recalibration layer. The feature recalibration layer aims to train the correlations between each channel of the concatenated feature; this is inspired by [22].

Given an input feature *V* = [*v*1, *v*2, . . . , *vc*, . . . , *vC*] (*v<sup>c</sup>* ∈ R, where *C* is the sum of all channels), the feature channels are recalibrated using two fully-connected layers and nonlinear activations, as in Equation (4).

$$\dot{\mathbf{V}} = f\_{\text{FR}}(\mathbf{V}, \mathbf{W}) = \sigma(\mathbf{W}\_2 \delta(\mathbf{W}\_1 \mathbf{V})) \tag{4}$$

Here, δ refers to the leaky rectified linear unit activation; σ refers to the sigmoid activation; *W*<sup>1</sup> is the front fully-connected layer, *W*<sup>1</sup> ∈ R *c*× *c <sup>r</sup>* , and *W*<sup>2</sup> is the back fully-connected layer, *W*<sup>2</sup> ∈ R *c r* ×*c* . According to the reduction ratio *r*, a dimensional transformation is performed between the two fully-connected layers, such as a bottleneck structure, while channel-wise multiplication is performed. The rescaled channels are then multiplied by the input feature *V*. Consequently, an output feature *V*´ = [*v*´ <sup>1</sup>, *v*´ <sup>2</sup>, . . . , *v*´ *<sup>c</sup>*, . . . , *v*´*C*] (*v*´ *<sup>c</sup>* ∈ R) is generated. This generated feature *V*´ is the result of recalibration according to the importance of the channels.

#### *3.4. Deep Length Normalization*

As in [11], deep length normalization was applied to the proposed model. The L2 constraint is applied to the length axis of the recalibrated feature *V*´ with a scale constant, α, as in Equation (5). *Electronics* **2020**, *9*, x FOR PEER REVIEW 8 of 15

$$f\_{\rm DLN} \left( \dot{\mathbf{V}} \right) = \frac{\alpha \dot{\mathbf{V}}}{\|\dot{\mathbf{V}}\|\_{2}} \tag{5}$$

Then, the normalized *V*´ is fed into the output layer for speaker classification. This feature is used as a speaker embedding, as shown in Figure 4. Then, the normalized is fed into the output layer for speaker classification. This feature is used as a speaker embedding, as shown in Figure 4.

**Figure 4.** Overview of deep length normalization procedure. **Figure 4.** Overview of deep length normalization procedure.

#### **4. Experiments and Discussions 4. Experiments and Discussions**

utterances with 1,251 and 6,112 speakers, respectively.

#### *4.1. Datasets 4.1. Datasets*

training dataset.

In our experiments, we used the VoxCeleb1 [27] and VoxCeleb2 [16] datasets presented in Table 5. These datasets comprise various utterances of celebrities collected in real environments from YouTube, including noise, laughter, cross talk, channel effects, music, and other sounds [27]. All utterances were encoded at a 16-kHz sampling rate with 2 bytes per sample. These are large-scale In our experiments, we used the VoxCeleb1 [27] and VoxCeleb2 [16] datasets presented in Table 5. These datasets comprise various utterances of celebrities collected in real environments from YouTube, including noise, laughter, cross talk, channel effects, music, and other sounds [27]. All utterances were encoded at a 16-kHz sampling rate with 2 bytes per sample. These are large-scale text-independent

text-independent speaker verification datasets, comprising more than 100 thousand and 1 million
