**Artificial Neural Networks for IoT-Enabled Smart Applications**

Editors

**Andrei Velichko Dmitry Korzun Alexander Meigal**

MDPI • Basel • Beijing • Wuhan • Barcelona • Belgrade • Manchester • Tokyo • Cluj • Tianjin

*Editors* Andrei Velichko Petrozavodsk State University Petrozavodsk, Russia

Dmitry Korzun Petrozavodsk State University Petrozavodsk, Russia

Alexander Meigal Petrozavodsk State University Petrozavodsk, Russia

*Editorial Office* MDPI St. Alban-Anlage 66 4052 Basel, Switzerland

This is a reprint of articles from the Special Issue published online in the open access journal *Sensors* (ISSN 1424-8220) (available at: https://www.mdpi.com/journal/sensors/special issues/ IoT SmartApp).

For citation purposes, cite each article independently as indicated on the article page online and as indicated below:

LastName, A.A.; LastName, B.B.; LastName, C.C. Article Title. *Journal Name* **Year**, *Volume Number*, Page Range.

**ISBN 978-3-0365-8428-7 (Hbk) ISBN 978-3-0365-8429-4 (PDF)**

© 2023 by the authors. Articles in this book are Open Access and distributed under the Creative Commons Attribution (CC BY) license, which allows users to download, copy and build upon published articles, as long as the author and publisher are properly credited, which ensures maximum dissemination and a wider impact of our publications.

The book as a whole is distributed by MDPI under the terms and conditions of the Creative Commons license CC BY-NC-ND.

## **Contents**


#### **Ahmed Diab, Rasha Kashef and Ahmed Shaker**


## **About the Editors**

#### **Andrei Velichko**

Dr. Andrei Velichko received a Ph.D. degree in physics and mathematics with the specialization "Physical Electronics" from Petrozavodsk State University, Petrozavodsk, Russia, in 2002. From 2001 to 2015, he worked as a senior lecturer in the Department of Electronic and Ion Devices, and since 2016, he has worked as a leading research scientist at PetrSU. Dr. Velichko's research mainly focuses on artificial neural networks, oxide ReRAM, and metal–insulator transitions. He is the author of the LogNNet neural network configuration for IoT devices as well as the time series entropy estimation algorithm called Neural Network Entropy (NNetEn).

#### **Dmitry Korzun**

Dr. Dmitry Korzun received his B.Sc. (1997) and M.Sc (1999) degrees in Applied Mathematics and Computer Science from Petrozavodsk State University (PetrSU, Russia). He received a Ph.D. degree in Physics and Mathematics from St. Petersburg State University (Russia) in 2002. He is an Associate Professor in the Department of Computer Science of PetrSU (since 2003 and ongoing). He was a Visiting Research Scientist at the Helsinki Institute for Information Technology HIIT, Aalto University, Finland (2005-2014). In 2014–2016, he performed the duties of Vice Dean for Research in the Faculty of Mathematics and Information Technology of PetrSU. Since 2014, he has acted as a Leading Research Scientist at PetrSU, originating research and development activity within fundamental and applied research projects on emerging topics in ubiquitous computing, smart spaces, and Internet technology. Dmitry Korzun serves on technical program committees and editorial boards of a number of international conferences and journals. His research interests include the modeling and evaluation of distributed systems, mathematical modeling and concept engineering of cyber–physical systems, ubiquitous computing and smart spaces, Internet of Things and its applications, software engineering and programming methods, algorithm design and complexity, linear Diophantine analysis and its applications, theory of formal languages and parsing. His educational activity started in 1997 in the Faculty of Mathematics of PetrSU.

#### **Alexander Meigal**

Prof. Alexander Meigal received a Candidate of Sciences degree in Biology in the field of "Physiology" from the Institute of Experimental Physiology, St. Petersburg, Russia (1991), and a Doctor of Science degree in Medicine from Archangelsk State Medical University, Russia (1997). He has been a Head of the Department of Human and Animal Physiology and Pathophysiology at the Medical Institute of PetrSU since 20018, and Head of the laboratory of novel methods in physiology, PetrSU, since 2014. He was also a Visiting Research Scientist at the University of Eastern Finland (Kuopio, Finland, 2006–2014). His research interests are in the field of signal processing and evaluation of electromyogram (EMG), accelerogram, heart rate variability (HRV), kinematics, nonlinear dynamics of EMG and HRV, diagnostics of Parkinsonism and multiple sclerosis, muscle tone disorders, tremor, gravitational and space physiology, thermoregulation, wearable and textile sensors, and the instrumentalization of tools and gadgets, including smartphones.

**Andrei Velichko 1,\*, Dmitry Korzun <sup>2</sup> and Alexander Meigal <sup>3</sup>**


In the age of neural networks and the Internet of Things (IoT), the search for new neural network architectures capable of operating on devices with limited computing power and small memory size is becoming an urgent agenda. Trends in the development of artificial intelligence (AI) applications in the field of the Internet of Things include smart healthcare services [1–5], smart object-recognition [6–8], smart environment monitoring [9,10], and smart disaster rescue [11]. Traditionally, such applications operate in real time. For example, security camera-based object-recognition tasks operate with detection intervals of 500 ms to capture and respond to target events. The data processing of human health and physiological parameters from different sensors (heart rate monitoring, glucose monitoring, oxygen saturation, etc.) generally requires immediate processing. Often, commercial smart IoT devices transfer information to the cloud for subsequent intelligent processing. However, stable network connections are not available everywhere, and it is a limitation for meeting real-time requirements. The solution to this problem can be the execution of information processing using neural networks installed directly on IoT devices. In this case, the quality of the Internet connection would not have a significant impact. Enabling artificial intelligence directly on the device is a challenge because of the limited computing power and small memory size of IoT devices. Frequently, smart applications need to run on a lightweight OS with a minimal set of libraries that imposes limitations on the operation of resource-intensive neural networks.

AI technologies for IoT devices and edge computing are demanded in mobile healthcare (m-Health), as well as in close application domains. Ambient intelligence (AmI) environments are constructed in IoT domains to provide smart services based on realtime analysis of human cognitive and motion functions. This Special Issue focuses on recent developments in the constantly growing application field of computing technologies and artificial intelligence algorithms. It includes new approaches to the organization of artificial intelligence on edge devices, as well as the organization of modular, feed forward, distributed, reservoir, recurrent, convolutional, and deep neural networks for various IoT-enabled smart applications. The guest editors are Andrei Velichko (Institute of Physics and Technology), Dmitry Korzun (Institute of Mathematics and Information Technology), and и Alexander Meigal (Medical Institute); they are all from Petrozavodsk State University, Russia.

The Special Issue collects eleven papers to provide a multi-domain overview of the trends and developments in the edge computing and starts with the illustration of achievements in smart healthcare services. The digitalization of healthcare driven by the IoT and AmI leads to the effective use of sensors, when various parameters of the human body are instantly tracked and processed in daily life [1,2]. The concept of machine learning sensors is applied to the diagnosis of COVID-19 as IoT application in healthcare and ambient

**Citation:** Velichko, A.; Korzun, D.; Meigal, A. Artificial Neural Networks for IoT-Enabled Smart Applications: Recent Trends. *Sensors* **2023**, *23*, 4853. https://doi.org/ 10.3390/s23104853

Received: 10 May 2023 Accepted: 15 May 2023 Published: 18 May 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

assisted living. An important task is to determine the status of infection with COVID-19 using various diagnostic tests. This study provides a fast, reliable, and cost-effective alternative tool for diagnosing COVID-19 based on routine blood values measured at clinic admission. Popular machine learning classifiers were studied and their important features were identified to ensure the high accuracy of disease diagnostics. The study [2] continues the topic of COVID-19 diagnostic and reviews the routine blood values using a backward feature elimination algorithm and the LogNNet reservoir neural network. The proposed method reduces the negative pressures on the health sector and helps doctors to understand the pathogenesis of COVID-19 using the key blood values. The method demonstrates high opportunity of the LogNNet network to be applied in IoT smart applications.

An IoT-enabled system to monitor gait in subjects with Parkinson's disease is presented in study [3]. Parkinson's disease is one of the most studied pathologies in the field of neurology and is suitable for application of science-intensive study methods, virtual reality technologies and even robotics. Wearable sensors and IoT-enabled technologies look promising for monitoring motor activity and gait in Parkinson's disease patients. The gait was measured and characterized with help of the accelerometer signal acquired from inertial measurement unit of a smartphone attached to the head during the timed up and go test. Smartphones as a measuring device are well suited for the creation of IoTenabled systems. The use of accelerometer signals received from a smartphone inertial measurement unit creates high potential for AI-supported systems and makes the proposed method applicable not only in healthcare laboratories but in the daily life settings.

Smart healthcare applications, the Internet of Things (IoT), and artificial intelligence are arguably the most appropriate customized solutions for such shortcomings of traditional healthcare systems, such as long waiting times, unnecessary long trips to health centers, high costs, and mandatory periodic doctor visits. The comprehensive literature review [4] determines the impact of IoT, AI, various communication technologies, sensor networks, and disease detection in Cardiac healthcare. The results of the review show that deep learning is emerging as a promising technology along with the combination of IoT in the field of cardiac care with increased accuracy and real-time clinical monitoring. In addition, this study points out the main advantages and major challenges of e-cardiology in the areas of IoT and AI.

Another illustration of the effective application of neural networks in healthcare is presented in study [5]. Speech is a complex mechanism that allows us to communicate our needs, desires, and thoughts. In some cases of nervous dysfunction, this ability is severely affected, making daily activities that require communication difficult. This study explores various options for an intelligent imaginary speech recognition system that can be installed on low-cost devices with limited resources. The authors used a method based on covariance in the frequency domain, which performed better than other methods in the time domain. Several architectures of convolutional neural networks have been studied and it has been demonstrated that a more complex architecture does not necessarily lead to better results. The results prove that cheap IoT devices can be effectively used in speech recognition and contribute to the development of IoT-enabled smart applications.

The realm of smart object-recognition applications of AI systems is presented in the subsequent articles [6–8]. Driver assistants have become a more and more popular class of smart IoT-enabled smart applications, as illustrated in study [6] that detects distracting actions in driver activities. According to the World Health Organization, the increase in car accidents is a major problem in today's transportation systems, and is the eighth leading cause of death worldwide. More than 80% of traffic accidents are caused by distraction while driving. A practical approach to solving this problem is to introduce quantitative indicators of driver activity and develop a classification system that identifies distracting activities. Authors implemented a portfolio of different ensemble deep learning models that have been proven to effectively classify driver distractions and provide in-vehicle recommendations to minimize distraction levels and improve safety. Another lifesaving application based on deep learning is a child drowning prevention system [7]. The proposed deep convolutional neural networks-based models can be used to automatically detect the possible distractions of a caregiver who is supervising a child and generate alerts to warn them. The system was tested in a swimming pool, and we think it could be implemented in natural water reservoirs to avoid possible child drowning. Such smart applications for the rapid detection of dangerous situations are of critical importance, as they are able to observe persons and their activity more effectively than humans.

For the IoT-enabled smart applications, point clouds are one of the most widely used data formats created by depth sensors. Research on feature extraction from disordered and irregular point cloud data has advanced recently. The overview [8] of the different types of models is presented, and studies of point clouds and remote sensing problems have been carried out using deep learning methods. It is concluded that convolutional neural networks achieve the best performance in various remote sensing applications that operate directly with raw cloud data. The lightweight models are especially important for IoT edge computing.

The research direction of smart environment monitoring is presented by the study [9], in which a model of artificial neural network was integrated into a Raspberry Pi-based sensor to implement edge computing for hourly river level prediction. The model that consists of a three-layer perceptron is able to predict river levels with a high degree of accuracy using only previously observed water levels, precipitation, and runoff information as input, without the need for other hydrological and meteorological parameters. This study is a first attempt to combine real-time customized sensors and artificial neural network algorithms in practice. The model was built into a low-cost, open-source, and low-energy-consumption custom sensor to forecast the water level. A high level of model performance applied to real events, and the low-cost system is of interest for environmental monitoring. Another potential reference case for the development of smart IoT-enabled systems for environmental monitoring is presented in the study of determining the carcinogenicity of thousands of wide-variety classes of real-life exposure chemicals [10]. Authors have developed carcinogen prediction models based on the hybrid neural network deep learning method. The proposed model has a high potential for use in various IoT environmental projects.

Smart disaster rescue is presented by an interesting development of a wearable device for search dogs that recognizes the behavior of a dog when a victim is found, using deep learning models [11]. With their exceptional sense of smell and hearing, search and rescue dogs are important in first aid because they are able to locate a victim in conditions that are difficult for humans to reach. The authors propose an implementation of a wearable device that supports deep learning, including a base station, a mobile application, and a cloud infrastructure. The device can, firstly, track the activity, sounds and location of the search and rescue dog in real time, and, secondly, recognize and alert the rescue team whenever the dog spots a victim. For activity recognition, deep convolutional neural networks were used for classifying dog sounds, as well as inertial sensors. The developed deep learning models operated on a wearable IoT device. The functioning of the system was tested in two separate search and rescue scenarios, which allowed to successfully locate the victim and inform the rescue team in real time based on IoT technology.

In conclusion, this Special Issue illustrates advanced cases of using the AI technology for IoT-enabled smart applications. Each case demonstrates a promising trend for applying AI in IoT environments, making a step towards the effective use of modern technologies in our everyday life.

**Author Contributions:** All authors contributed equally to this editorial. All authors have read and agreed to the published version of the manuscript.

**Funding:** The first part of this research is implemented with financial support by Russian Science Foundation, project no. 22-11-20040 (https://rscf.ru/en/project/22-11-20040/, accessed on 1 May 2023) jointly with Republic of Karelia and funding from Venture Investment Fund of Republic of Karelia (VIF RK). The concept of machine learning sensor is studied using practical examples of COVID-19 infectious disease and IoT-enabled monitoring of human gait. Additionally, edge-oriented sensorics are studied for implementing smart object-recognition. The second part of the research was supported by the Russian Science Foundation (grant no. 22-11-00055, https://rscf.ru/en/project/22 -11-00055/, accessed on 30 March 2023). Diagnosis and prognosis of COVID-19 disease using the LogNNet Neural Network and an overview of the directions of smart environment monitoring and smart disaster rescue were made.

**Acknowledgments:** The authors express their gratitude to Andrei Rikkiev for valuable comments made in the course of the article translation and revision.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

## *Article* **Deep Learning Empowered Wearable-Based Behavior Recognition for Search and Rescue Dogs**

**Panagiotis Kasnesis 1,\*, Vasileios Doulgerakis 1, Dimitris Uzunidis 1, Dimitris G. Kogias 1, Susana I. Funcia 2, Marta B. González 2, Christos Giannousis <sup>1</sup> and Charalampos Z. Patrikakis <sup>1</sup>**


**Abstract:** Search and Rescue (SaR) dogs are important assets in the hands of first responders, as they have the ability to locate the victim even in cases where the vision and or the sound is limited, due to their inherent talents in olfactory and auditory senses. In this work, we propose a deep-learningassisted implementation incorporating a wearable device, a base station, a mobile application, and a cloud-based infrastructure that can first monitor in real-time the activity, the audio signals, and the location of a SaR dog, and second, recognize and alert the rescuing team whenever the SaR dog spots a victim. For this purpose, we employed deep Convolutional Neural Networks (CNN) both for the activity recognition and the sound classification, which are trained using data from inertial sensors, such as 3-axial accelerometer and gyroscope and from the wearable's microphone, respectively. The developed deep learning models were deployed on the wearable device, while the overall proposed implementation was validated in two discrete search and rescue scenarios, managing to successfully spot the victim (i.e., obtained F1-score more than 99%) and inform the rescue team in real-time for both scenarios.

**Keywords:** deep learning; canine activity recognition; bark detection; wearable computing; search and rescue system

#### **1. Introduction**

Animal Activity Recognition (AAR) and monitoring is an emerging research area enhanced mainly by the recent advances in computing, Deep Learning (DL) algorithms, and motion sensors. AAR attracted significant attention as it can provide significant insights about the behavior, health condition, and location of the observing animal [1]. In addition, if a proper network implementation is considered (e.g., with the proper devices, software, and communication protocol) the monitoring of the animal can be performed in real-time to allow exploitation of AAR for various purposes, e.g., study of the interaction between different animals, search and rescue missions [2], protection of animals from poaching and theft, etc. [3]. To perform this, the use of inertial sensors is mandated, such as accelerometers, gyroscopes, and magnetometers as well as a Machine Learning (ML) method, which after the proper training can accurately classify the animal activity [4].

Acknowledging the fact that AAR is a rich source of information that not only provides insights into animals life and well-being but also about their environment, over the past years, several works reporting on the use of animal activity recognition were published, increasingly focusing on the use of ML [5], while several open access datasets [6] were available, assisting the development of models and tools for accurate activity recognition of different animals.

**Citation:** Kasnesis, P.; Doulgerakis, V.; Uzunidis, D.; Kogias, D.G.; Funcia, S.I.; González, M.B.; Giannousis, C.; Patrikakis, C.Z. Deep Learning Empowered Wearable-Based Behavior Recognition for Search and Rescue Dogs. *Sensors* **2022**, *22*, 993. https://doi.org/10.3390/s22030993

Academic Editors: Andrei Velichko, Dmitry Korzun and Alexander Meigal

Received: 20 December 2021 Accepted: 25 January 2022 Published: 27 January 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

In this work, we focus on the Dog Activity Recognition (DAR) for search and rescue (SaR) missions. SaR dogs are important assets in the hands of first responders due to their inherent talents with olfactory and auditory senses. However, in some cases the dog handler is impossible to be present in the same spot with the SaR dog, and thus, a life-critical amount of time is spent as the dog must return to the trainer and guide him to the victim [7]. To solve this problem, we introduce a novel implementation comprised of a wearable device, a base station, a cloud server, a mobile application, and Deep Convolutional Neural Networks (CNN), which were shown in [8–10] to be more accurate compared with that of other ML algorithms due to their ability to extract features automatically. More specifically, we developed a back-mounted wearable device for the SaR dogs that can:


The proposed implementation is validated in two SaR scenarios managing to successfully locating the victim and communicating this message to the first responders in real-time with more than 99% F1-score.

In the rest of the paper, we analyze the related work in the field to provide a wider view in the problem we address (Section 2). In Section 3, we propose the core modules of our implementation along with their details and specifications, and we illustrate the overall network architecture developed to communicate the messages between the first responder and the SaR dog. Section 4 elaborates the data collection/annotation steps as well as the employed CNN architectures, while in Section 5 we evaluate the algorithmic results in terms of efficiency and efficacy. Next, in Section 6 the validation of the proposed solution is discussed, proving that our prototype satisfies all the desired functional and nonfunctional requirements. Finally, Section 7 discusses the obtained results, the limitations of the approach, and the future steps, while Section 8 concludes the paper.

#### **2. Related Work**

In the current section, we present the related works presenting results on canine behavior recognition, audio classification, and existing SaR systems based on animal wearables.

#### *2.1. Activity Recognition*

In prior research, animal activity recognition and monitoring was exploited to study various types of animals, spanning from livestock animals [10–15] to wild animals [16–18]. In the former case, the animal monitoring can (a) optimize the asset management, as the animals can be maintained always within preset "virtual fences", (b) provide insights about the animals' health through tracking the fluctuation on their activity levels, and (c) designate the optimal pastures. In wild animals, the animal activity monitoring can (a) minimize the poaching illegal activity and stock theft, (b) extract the state of health of the observed populations, and (c) assist the observations about the behavior of the wild animals and the interactions between them and other species.

In the category of pet animals, a literature review which analyzes the different technologies used to monitor various target features, such as location, health, behavior, etc. can be found in [19]. In the domain of DAR, these results can aid us towards a better interpretation of the everyday routine of the animals and their needs, which in turn can directly benefit the interaction with their handlers or can be exploited to perceive the behavior SaR units, providing valuable information to their trainers (e.g., victim discovery). The field of DAR emerged over the last decade due to the availability of low-cost sensors and smart devices that can acquire data and perform the ML algorithmic procedure in real-time [6,20–28]. Usually, the sensors are located in the back, collar, withers, and tail of the dog, while the employed sensors are mainly 3-axial accelerometers, 3-axial gyroscopes and sensors which monitor biometric data (e.g., heart rate). After completing the data

collection process from the various sensors and performing their proper preprocessing, the data are then fed into an ML algorithm for training to classify any forthcoming activity.

For the purposes of DAR, various ML algorithms were utilized to attain sufficient accuracy. A k-NN classifier was employed in [21] to classify 17 different activities by studying the naturalistic behavior of 18 dogs attaining an accuracy of about 70%. In [25], the SVM (Support Vector Machines) classifier was applied into a dataset that comprised 24 dogs performing seven discrete activities and attained an accuracy of above 90%. Further, in [28], the accuracy of various ML classification algorithms was evaluated in a dataset comprising 10 dogs of different breeds, ages, sizes and gender performing seven different activities. The employed algorithms were Random Forest, SVM, k-NN, Naïve Bayes, and Artificial Neural Network (ANN). ANN outperformed the other four algorithms in activity detection, whilst Random Forest outperformed the other four in emotion detection. The attainable accuracy exceeded 96% in all cases. A recent study [6] in dog behavior recognition examined the optimal sensor placement in the dog, through a comparison of various algorithms (e.g., SVM). In particular, the authors attached two sensor devices to each dog, one on the back of the dog in a harness and one on the neck collar. The movement sensor at the back yielded up to 91% accuracy in classifying the dog activities and the sensor placed at the collar yielded 75% accuracy at best. These results helped the current work to decide the optimal sensor placement, which was mounting a harness on the back of the SaR dog with the developed device in it. Finally, the authors in [29] created a huge dataset exploiting a 3-axial accelerometer and collecting data from more than 2500 dogs of multiple breeds. Then they trained a deep learning classifier which was then validated for a real-world detection of eating and drinking behavior. The validated results attained a true positive rate of 95.3% and 94.9%for eating and drinking activities, respectively. The details of the related work on DAR are shown in Table 1.


**Table 1.** Summary of related work on Dog Activity Recognition (DAR).


**Table 1.** *Cont.*

#### *2.2. Audio Classification*

Similar to wearable-based activity recognition, over the last years, there were proposed several audio signal processing techniques relying on DL algorithms and were proved to achieve better results than baseline ML algorithms [30]. DL algorithms, such as Deep CNNs, possess the ability to increase their performance as the training dataset grows; thus, the authors in [31] applied well-known CNN architectures, which were employed successfully in computer vision tasks, to test their effectiveness on classifying large-scale audio data. The networks architectures they used were a fully connected ANN, an AlexNet [32], a VGG [33], an Inception V3 [34], and a ResNet-50 [35]; these networks were trained and evaluated using AudioSet, which consists of 2,084,320 human-labeled 10-second audio clips drawn from YouTube videos. The audio classes are based on an audio ontology [36], which is specified as a hierarchical graph of event categories, covering a wide range of human and animal sounds, musical instruments and genres, and common everyday environmental sounds. Their experiments showed that the ResNet-50 model, which had the most layers (i.e., it was deeper than the others), achieved the best results.

In addition to this, CNNs are also state-of-the-art models, even for relative smaller audio datasets consisting of a few thousand samples. Salamon and Bello [37] compare a baseline system (i.e., using MFCCs features) with unsupervised feature learning performed on patches of PCA-whitened log-scaled mel-spectrograms using the UrbanSound8K dataset. In particular, they utilized the spherical k-means algorithm [38] followed by the Random Forests algorithm and managed an average classification accuracy 5% higher than the baseline system. Furthermore, Karol J. Piczak [30] obtained state-of-the-art results for the UrbanSound8K dataset, training a relatively shallow CNN (two convolutional layers), which had as input the log-scaled mel-spectograms of the audio clips. The proposed CNN model had an average accuracy of about 73.1% against the 68% average accuracy of the baseline model, despite the fact that it seemed to overfit the training data. A deeper, VGGlike CNN model (five convolutional layers) was implemented by A. Kumar [39] and used on the UltraSound8K dataset, reaching a 73.7% average accuracy.

Finally, data augmentation techniques were adopted by the researchers to increase the number of the audio samples. To this end, Salamon and Bello [40] explored the influence of different augmentation techniques ((a) Time Stretching; (b) Dynamic Range Compression; (c) Pitch Shifting; and (d) Adding Background Noise) on the performance of a proposed CNN architecture, and they obtained an average accuracy close to 79% using recordings of the ESC-50 (2000 clips) and ESC-10 (400 clips) datasets [41]. Moreover, Karol J Piczak [41] utilized random time delays to the original recordings of the ESC-50 and ESC-10 datasets. The CNN architecture achieved better accuracy results from the baseline model for both datasets, while in the case of the ESC-50, the difference between the average accuracies was over 20% (baseline accuracy: 44%, best CNN: 64.5%).

#### *2.3. Existing SaR Solutions Based on Animal Wearables*

SaR systems are vital components when it comes to disaster recovery due to the fact that every second might be life-critical. Trained animals, such as dogs (i.e., K9s), are exploited by SaR teams due to their augmented senses (e.g., smell), and their small size is ideal for searching under the debris for survivors.

The authors in [2] developed a two-part system consisting of a wearable computer interface for working SaR dogs communicating with their handler via a mobile application. The wearable comprised a bite sensor and a GPS to display the K9s location in the mobile application. The SaR dog bites the bringsel, which is equipped with the bite sensor to notify its handler. In addition to this, the work in [42] demonstrates several interfaces developed for animal–computer interaction purposes, which could be used in SaR missions for notifying the canine handler, such as bite sensors, proximity sensor, and tug sensor. Furthermore, in [7,43] the use of head gestures is examined to establish communication between the SaR dogs and the handlers. The developed wearable is added in a collar and is comprised by motion sensors (3-axial accelerometer, gyroscope and magnetometer), while the systems analyzes motion signals produced by the canine wearable using dynamic time warping. Each detected head gesture is paired with a predetermined message that is voiced to the humans by a smart phone. To this end, the participating K9s were specifically trained to perform the appropriate gesture.

Existing patented canine wearables, such as [44,45], could also be used for SaR purposes. A wirelessly interactive dog collar is presented in [45]; it allows voice commands and tracking over long distances, along with features that facilitate tracking and visualization, exploiting its embedded sensors (GPS, microphone, speaker, light). Moreover, in [44] an enhanced animal collar is presented. This device consists of extra sensors, such as camera, thermographic camera, and infrared camera to enable the transmission of the captured images, in addition to audio signals.

Finally, the animal-machine collaboration was also explored. The authors in [46] introduce a new approach to overcome the mobility problem of canines through narrow paths in the debris utilizing a robot snake. The SaR dog carries this small robot, and when it is close to the victim it barks to release the robot that locates the trapped person. The robot snake is equipped with a microphone and a camera. Rat cyborg is another option for SaR missions [47]. The system is implanted with microelectrodes in the brain of a rat, through which the outer electrical stimuli can be delivered into the brain in vivo to control its behaviors. The authors state that the cyborg system could be useful in search and rescue missions where the rat handler can navigate through the debris by exploiting a camera mounted on the rats.

Table 2 summarizes the aforementioned works including our solution, in terms of equipped sensors/actuators and their capabilities (i.e., communication with handler and victim, edge data processing, delivery package, search through debris, extra animal training, no welfare concerns, no rescuer guidance needed).


**Table 2.** Summary of the existing SaR solutions based on animal wearables.


#### **3. Network Architecture**

The overall system architecture for the SaR dog real-time monitoring is illustrated in Figure 1. The architecture is divided into two levels (i.e., layers):


The communication from the EDGE layer to the FOG takes place mainly between the wearable device and the Secure IoT Middleware (SIM), which contains an encrypted KAFKA (https://kafka.apache.org/, accessed on 22 September 2021) Publish-Subscribe broker using Wi-Fi connectivity. When such connectivity is not available at the area of operation, a secondary communication path is deployed. This path represents communication at the EDGE layer and includes an RF (Radio Frequency) connection between the wearable and the Base Station (BS). Once the data are collected by the BS, it uses a Wi-Fi/3G/4G connection to publish them at the FOG layer's KAFKA pub-sub broker through the SIM.

Finally, the smartphone of the first responder, which is the handler of the animal, is notified in real-time about the SaR dog's behavior by the KAFKA broker via his/her mobile application. All these data flows can be seen in Figure 1, while the developed EDGE-level modules are explained in detail in the following subsections.

**Figure 1.** Wearable for animals—system architecture and communication flows.

#### *3.1. Wearable Device*

The wearable device is the most important module as it collects various types of data, such as 3-axial accelerometer and 3-axial gyroscope data, audio recordings, and localization data, while it can also provide feedback to the dog via vibration and audio signals. The wearable was developed from our team with the guidance of K9-SaR experts solely for the purposes of the SaR task; however, it can be exploited for various other tasks comprising animal activity monitoring and recognition, e.g., in dogs or even in livestock animals for the purposes of behavioral analysis.

The designed harness is back-mounted vest instead of neck-mounted (i.e., collar) to further improve the animal's comfort by moving the center of mass to a more suitable place, as well as to achieve higher accuracy for the activity recognition task [6]. The new design is completely modular since all the components are attached with Velcro to the wearable. A strip with Velcro is also included at the belly of the animal to provide for further grip. In general, the detachability requirement is related to the dog's safety, as it ensures that the dog will break free from the wearable if tangled.

In Figure 2 the sketch for the designs of the animal wearable is displayed including the strips with velcro attachments (points *A* and Γ), the pouch for electronics (point *B*) and the mini-camera position at the animal's front. In particular, point *B* (back of the animal) contains the main computational platform, the custom board and the battery, while the camera is placed on the animal's chest, always facing in front.

Additionally, except from the pouch for the electronics, another, optional, smaller pouch/pocket was introduced, able to fit a small device (e.g., a mobile phone), or any small item considered useful to be carried by the animal (Figure 3a). This design pertains to rescue scenarios where the delivery of a small item to an unreachable trapped person could be of great importance and contribute to the efficient rescue.

The main features of the wearable device, which is pictorially described in Figure 3b, are the:


low-voltage audio power amplifier, a mono amplifier capable of conducting up to 750 mW of RMS power to an 8Ω speaker continuously. The amplifier's footprint measures 5 mm × 3.1 mm, although the complete audio circuitry also includes an electrolytic capacitor measuring 4.3 mm × 4.3 mm × 5.5 mm.


**Figure 2.** Drawing displaying placement of animal wearable.

**Figure 3.** Designed animal harness (**a**) and electronic device (**b**).

The custom board was designed to meet or exceed the predefined specifications covering device functionality and achieve a balance between battery life, weight, and physical dimensions. Therefore, in most cases, the chosen modules are the smallest that would satisfy the consumption and functional requirements. The total weight of the device is 121 g including the battery (47 g), while the total cost for ordering the components and assembling them was equal to 260€. Table 3 provides details on the electrical characteristics, maximum ratings and recommended operating conditions of the device.


**Table 3.** Electrical characteristics of device.

#### *3.2. SaR Base Station*

The BS device is a portable wireless device, based on the Raspberry Pi Zero W (https: //www.raspberrypi.com/products/raspberry-pi-zero-w/, accessed on 19 January 2022) and powered by an internal power-bank. It is equipped with an XBee SX 868 RF module (https://www.digi.com/xbee, accessed on 19 January 2022) similar to the wearable devices and creates an XBee network to which all animal wearable devices in range can connect. This results in an extended range of coverage for the animal wearables. The BS device, includes a pocket Wi-Fi module, granting 4G connectivity. Any messages sent from the wearables are received by the BS through the Xbee network and delegated to the SIM over either the Wi-Fi connection to the Pocket Wi-Fi device and then transmitted over 4G network, or to any other known Wi-Fi hot spot. Likewise, any commands issued by the rest of the modules to the wearables (e.g., initiate data collection), are either received directly by the devices through Wi-Fi, or received by the BS and relayed to the devices via the XBee network. The existence of a BS is extremely critical in a disaster scenario, as public telecommunications networks cannot be taken for granted. For this purpose, the animal wearable device cannot rely solely on mobile network coverage.

#### *3.3. Smartphone Application*

The application of the animal wearable is one feature of a wider application developed for FASTER (First responder Advanced technologies for Safe and efficienT Emergency Response) EU Horizon 2020 project (https://cordis.europa.eu/project/id/833507, accessed on 16 December 2021). As a result, the application contains four tabs for displaying: (a) biometrics of first responders, (b) environmental data, (c) the behavior and location of SaR dogs, and (d) upcoming notifications (e.g., a victim was found). In general, it is an Android application (supporting android version 8.0 and above) that makes use of Google Maps (https://www.google.com/maps, accessed on 16 December 2021) for the depiction of information about the location of the dog. The application receives the information from a KAFKA broker with the aid of a Quarkus Reactive Streams (https://quarkus.io/, accessed on 16 December 2021) service. The information flows continuously from KAFKA to the screen of the user. Reactive streams work by publishing messages whenever they receive new information from a source. This makes the information flow "seamless" and most importantly it does not spam the server with http requests every some seconds. The android system can absorb these streams with the use of a library called okSse (https: //github.com/biowink/oksse, accessed on 16 December 2021) which helps to establish a connection with a reactive streams service.

Once we get the information, we feed it to our system with the use of LiveData (https: //developer.android.com/topic/libraries/architecture/livedata, accessed on 16 December 2021). LiveData is an observable data holder class. Unlike a regular observable, LiveData is lifecycle-aware, meaning it respects the lifecycle of other application components, such as activities, fragments, or services. This awareness ensures LiveData only updates the application component observers that are in an active lifecycle state. With the use of an observer, we "observe" any changes to the state of the information, and when we find something new we draw on the map the new location or behavior of the dog. The dog actions describe the state in which the animal is at a particular time in space (Figure 4). For example, whether the dog is walking/running or standing still.

**Figure 4.** Developed smartphone application displaying SaR dog behavior: (**a**) canine is not moving and barks; (**b**) canine is moving and does not bark.

#### **4. Data Collection, Processing, and Deployment of Deep Learning Algorithms**

#### *4.1. Data Collection Process*

The tests were performed in an arena covered with ruins to mimic a real search and rescue operation as best as possible (Figure 5). The tests included search and rescue missions both during the day and night. In the former case, adequate vision is considered, while in the latter, only limited vision can be attained. The resulting AI algorithms are trained in both cases, as in a real operation both cases can be encountered.

Next, the testing procedure is as follows. First, a member of the rescuing team, the "victim", hides somewhere in the arena among the ruins, in one of the various spots which are designed for this purpose. Then, after the wearable on the SaR dog is activated by his trainer, the dog is allowed to search for the victim. The test is successfully completed when the SaR dog is able to found the "victim". In this successful case, the SaR dog makes a characteristic bark sound, which lasts for some seconds, while it is in a standing position and stares at the "victim". Depending on the location of the "victim" in the arena, the search and rescue test may last from half a minute up to a few minutes.

**Figure 5.** Arena used to conduct search and rescue tests.

#### *4.2. Labeling Process*

The labeling process was performed offline using video and audio recordings. The videos were recorded using a smartphone camera which was positioned on a high place on one side of the arena to capture almost the entire search and rescue field. The audio recordings were performed using the wearable device's microphone. Only segments longer than 2 s were considered during the labeling process, which means that a single activity needs to last more than two consecutive seconds to be labeled. The recorded videos were synchronized with sensor data using metadata (e.g., timestamp) and via exploiting the plotted time series of the sensors (e.g., accelerometer). Four activities were considered:


In cases where it was not possible to identify the dog activity, either due to insufficient light during the night operation or when the dog was not clearly shown in camera, (e.g., it was behind an obstacle) a "missing" label was considered. These data were omitted for the Artificial Intelligence (AI) training procedure. Next, the audio recordings include only two classes, barking and nonbarking, as the barking is the required state that designates that the SaR dog spotted the "victim". Examples of the four dog activities are shown in Figure 6.

**Figure 6.** Instances of four activitie.

#### *4.3. Details of the Created Dataset*

The complete dataset comprises nine dog search and rescue sessions. After the labeling process, each session is segregated in various segments, where each segment comprises only one activity, considering a minimum segment duration of 2 s. Each second of raw data consists of 100 values for the two 3-axial sensors (3-axial accelerometer and 3-axial gyroscope) forming a total of 600 values. Next, each segment is segregated in samples with a 2 s length where a 50% overlap is considered. An example of samples for the four SaR dog activities from both the accelerometer and the gyroscope is illustrated in Figure 7. Evidently, the amplitude of the accelerometer and the gyroscope increase as the activity becomes more intense, which means that the lowest amplitude can be found in standing and the highest in running.

Further, the dataset details for all seven search and rescue testing sessions are tabulated in Table 4. Evidently, the most frequent activities are standing and trotting. This is expected, as during the search and rescue operation, on one hand the dog trots while searching for the "victim" between the ruins and on the other hand, when the "victim" is found, the dog remains in a standing position and barks. Moreover, only one of the K9s provided a sufficient amount of "running" examples (session 4), and only two canines sufficient amount of "standing" examples (session 4 and 6). Thus, by adopting a leave one subject out approach, it is impossible to check the model's generalizability on the classes "running" and "walking", and, as a result, we merged the motion activities "running", "walking" and "trotting" into one class, called "searching".

Turning our attention to the bark detection, similar to SaR dog activity detection, the labeling process was performed offline using the provided audio recordings and it was compared with the video recordings to verify the annotations. The annotated data were afterwards segmented into 2 s audio clips. This window size was selected to reduce the throughput to the developed model.

Another reason for selecting 2 s was to match the window size of the Inertial Measurement Unit (IMU) data and, also, to have a better understanding of the situation the SaR dog is into. For example, in the case of real-time inference and for e.g., a 4 s window, if the dog barks in the first second of the audio stream the model would still classify it as bark, despite the fact this occurred 3 s ago.

The dataset we built consists of 1761 examples (i.e., audio clips), where 258 are audio clips containing bark and 1503 do not, leading to an unbalanced dataset, which however reflects a real-world search and rescue operation. Before introducing the data in the Deep CNN, we split them into three subsets, namely training set, validation set, and test set, following the standard procedure of training an neural network. The train set contains

around 74% of the data, the validation set around 10% of the data and the test set around 16% of the data. The split was performed based on the search and rescue sessions. i.e., audio signals recorded during a specific search and rescue session belong to the same dataset, avoiding in this way overlapping samples between the different sets or characteristic bark patterns.

**Figure 7.** Sensor samples of 2 s duration for four activities.

**Table 4.** Number of samples for each search and rescue session for four monitored activities.


#### *4.4. Developed DL Algorithms*

#### 4.4.1. Activity Recognition

The employed Deep CNN for the dog activity recognition is a lightweight architecture to be deployed on the animal wearable (i.e., contains around 21,400 parameters), it is based on late sensor fusion [8] (i.e., the fist convolutional layers process the input signals individually) and consists of the following layers (Figure 8):

• layer 1: sixteen convolutional filters with a size of (1, 11), i.e., *W*<sup>1</sup> has shape (1, 11, 1, 16).

This is followed by a ReLU activation function, a (1, 4) strided max-pooling operation and a dropout probability equal to 0.5.

• layer 2: twenty-four convolutional filters with a size of (1, 11), i.e., *W*<sup>2</sup> has shape (1, 11, 16, 24).

Similar to the first layer, this is followed by a ReLU activation function, a (1,2) strided max-pooling operation and a dropout probability equal to 0.5.

• layer 3: thirty-two convolutional filters with a size of (2, 11), i.e., *W*<sup>3</sup> has shape (2, 11, 24, 32).

The 2D convolution operation is followed by a ReLU activation function, a 2D global max-pooling operation and a dropout probability equal to 0.5.

• layer 4: thirty-two hidden units, i.e., *W*<sup>4</sup> has shape (32, 1), followed by a sigmoid activation function.

Before feeding the algorithms with the collected data, we performed a preprocessing routine as follows. To acquire orientation independent features, we calculated a 3D vector (the l2-norm) from the sensors' individual axes [48]. The orientation-independent magnitude of the 3D-vector is defined as:

$$S(i) = \sqrt{s\_x^2(i) + s\_y^2(i) + s\_z^2(i)}\tag{1}$$

where *sx*(*i*), *sy*(*i*), and *sz*(*i*) are the three respective axes of each sensor (accelerometer and gyroscope) for the *i* th sample. Then, the dataset is divided seven-fold (i.e., one per session). To obtain subject independent results and evaluate the generalization of the algorithms, we used five folds as a training set, one as a validation set, and one as a test set. Afterwards, a circular rotation between training, validation and test subsets was performed to ensure that the data from all sessions will be tested. Finally, each sensor's values (obtained by Equation (1)) were normalized by subtracting the mean value and dividing by the standard deviation (calculated by the examples included only in the training set), defined as:

$$Z(i) = \frac{S(i) - \mu}{\sigma} \tag{2}$$

where *S*(*i*) denotes the *i* th sample of a particular sensor (e.g., accelerometer), *Z*(*i*) its normalized representation and *μ* and *σ* denote their mean and standard deviation values, respectively.

**Figure 8.** Overall architecture of developed Deep CNN for activity recognition task. Input tensor has two rows representing produced *Z*(*i*) for accelerometer and gyroscope, each one of them containing 200 values and one channel. Every convolutional operation is followed by a ReLU activation function, and pooling layers are followed by a dropout equal to 0.5. Final dense layer outputs one value and is followed by a sigmoid operation that represents probability of SaR dog searching or standing.

#### 4.4.2. Bark Detection

For the task of bark detection, we evaluated two different strategies. The first one is based on a large pretrained model where we applied transfer learning, i.e., we finetuned its weights using the dataset we collected. In particular, we selected the model introduced in [49] that achieved state-of-the-art results in the ESC dataset [41]. The code for reproducing the model is publicly available (https://github.com/anuragkr90/weak\_feature\_extractor, accessed on 12 September 2021). The latter was a custom lightweight (i.e., contains 10,617 parameters) Deep CNN architecture and consists of the following layers (Figure 9):

• layer 1: sixteen convolutional filters (i.e., kernels) with a size of (3, 3), i.e., *W*<sup>1</sup> has shape (3, 3, 1, 16)

This is followed by the ReLU activation function, a strided (2, 2) max-pooling operation and a dropout probability equal to 0.5.

• layer 2: twenty-four convolutional filters with a size of (3, 3), i.e., *W*<sup>2</sup> has shape (3, 3, 16, 24).

Similar to the first layer, this is followed by a ReLU activation function, a (2,2) strided max-pooling operation and a dropout probability equal to 0.5.

• layer 3: thirty-two convolutional filters with a size of (3, 3), i.e., *W*<sup>3</sup> has shape (3, 3, 24, 32).

The 2D convolution operation is followed by a ReLU activation function, a global max-pooling operation, and a dropout probability equal to 0.5.

• layer 4: thirty-two hidden units, i.e., *W*<sup>4</sup> has shape (32, 1), followed by a sigmoid activation function.

Before injecting the collected audio data in the CNN, we performed data normalization by dividing all the values with the max value included in the sample. Afterwards, the log-scaled mel-spectrograms were extracted from the audio clips having a window size of 1024, hop length of 512 and 128 mel-bands. Moreover, the segments of each clip overlapped 50% with the previous and the next one, and we discarded a lot of silent segments since they increased significantly the number of not-bark examples without, however, increasing the model's performance.

Figures 10 and 11 visualize the transformation of a clip containing bark and a clip including nonbarking activity, respectively. The comparative difference between the barking and the nonbarking state is obvious both in the raw data representation and in the mel-spectrogram.

**Figure 9.** Overall architecture of developed Deep CNN for bark detection task. Input tensor is log-scaled mel-spectrogram, with 173 rows, each one of them containing 128 values (mels) and one channel. Every convolutional operation is followed by a ReLU activation function, and pooling layers are followed by a dropout equal to 0.5. Final dense layer outputs one value and is followed by a sigmoid operation that represents probability of SaR dog barking or not barking.

**Figure 10.** Raw representation (**a**) and log-scaled mel-spectrogram (**b**) of a dog barking audio signal.

**Figure 11.** Raw representation (**a**) and log-scaled mel-spectrogram (**b**) of a dog not barking audio signal (dog running).

#### **5. Results**

#### *5.1. Results on the Activity Recognition*

In this section, we benchmark the proposed CNN against four other machine learning algorithms, namely Logistic Regression (LR), k-Nearest Neighbours (k-NN), Decision Tree (DT), and Random Forest (RF). For these algorithms we opted to extract the same seven timedependent features for each sensor (accelerometer and gyroscope), resulting in 14 features in total (see Table 5). The ML experiments were executed on a computer workstation equipped with an NVIDIA GTX 1080Ti GPU, which has 11 gigabytes RAM, 3584 CUDA cores, and a bandwidth of 484 GB/s. Python was used as the programming language, and specifically the Numpy for matrix multiplications, data preprocessing, segmentation, and transformation and the Keras high-level neural networks library using as a backend the Tensorflow library. To accelerate the tensor multiplications, CUDA Toolkit in support with the cuDNN was used, which is the NVIDIA GPU-accelerated library for deep neural networks. The software is installed on a 16.04 Ubuntu Linux operating system.

The proposed CNN model was trained using the Adam optimizer [50] with the following hyper-parameters: learning rate = 0.001, *beta*<sup>1</sup> = 0.9, *beta*<sup>2</sup> = 0.999, epsilon = 10<sup>−</sup>8, decay = 0.0. Moreover, we set the minimum number of epochs to 500; however, the training procedure terminated automatically whether the best training accuracy improved or not after a threshold of 100 epochs. The training epoch that achieved the lowest error rate on the validation set was saved, and its filters were used to obtain the accuracy of the model on the test set.

Table 6 presents the accuracy results that were obtained on applying the aforementioned algorithms and the developed Deep CNN architecture on the SaR dog activity recognition dataset. The presented results were obtained per dog having different folds in the test set (i.e., 5-fold cross-validation), while we made five runs for each to avoid reducing the dependency on different weights initializations and averaged them afterwards. The highest accuracy was achieved by the Deep CNN model (93.68%), which surpassed importantly the baseline algorithms, especially DT and k-NN. Moreover, having the algorithms achieved the best results (98.57% averaged accuracy) having dog five in the test set and the worst ones when they were evaluated on the dog seven examples (83.57% averaged accuracy). In addition to this, through the following table we can observe that k-NN had the biggest deviation in terms of accuracy among the seven subjects (i.e., dogs) ranging from 73.34% to 100%, while the RF was the smallest one, ranging from 84.34% to 100%.


**Table 5.** Description of selected features.


**Table 6.** Per dog accuracy of each Machine Learning (ML) model on dog activity recognition dataset.

Figure 12 displays the confusion matrix of the developed deep CNN averaged over the different test sets. The false positives (i.e., examples falsely predicted as "stand") are more than the false negatives (i.e., examples falsely predicted as "search"), which is somewhat unexpected since the "search" class contains more examples than the "stand" class. However, after performing error analysis on the obtained results we noticed that 11 out of the 65 walking activities, were falsely classified as "stand". This misclassification concerning the SaR dogs' low intense activities adds around 1.52 false positives, and without it, the portion of false-positive and negatives would be almost equal.

**Figure 12.** Confusion matrix (averaged over different test sets) of developed deep CNN.

#### *5.2. Results on the Bark Detection*

We followed the same experimental set-up that described in section for activity recognition regarding the workstation used, the libraries, and the optimizer. The hyperparameters of the Deep CNN were: learning rate = 0.001, *beta*<sup>1</sup> = 0.9, *beta*<sup>2</sup> = 0.999, epsilon = 10−8, decay = 0.0, while Adam optimizer is also considered. Moreover, we set the minimum number of epochs to 1000; however, the training procedure terminated automatically whether the best training accuracy had improved or not, after a threshold of 100 epochs. Similar with the case of the CNN in the activity recognition, the training epoch that achieved the lowest error rate on the validation set was saved, and its filters were used to obtain the accuracy of the model on the test set.

The results on the test set of the developed search and rescue dataset are presented in Table 7 the best results were achieved exploiting the Deep CNN after applying transfer learning using the Deep CNN in [49] (named as Deep CNN TL). The attainable accuracy of our model is 99.13% and the F1-score is 98.41%, while the Deep CNN TL achieved 99.34% accuracy and 98.73% F1-score.

Furthermore, Figure 13 shows the confusion matrix for the bark and nonbark classes of the lightweight CNN model. Evidently, the model produced on average more false negatives (2.1 bark activity examples were classified and not bark) than false positives (0.4 not bark activity examples were classified and bark) probably due to the fact that the dataset is imbalanced, containing significantly more nonbarking examples ( 6/1 ratio).


**Table 7.** Performance of developed ML models on SaR dog bark detection dataset.

Apart from the performance metrics, since we were interested in deploying the selected DL model on the wearable device, we measured the inference time of the models. Table 8 presents the response times for both DL models measured on (a) a workstation equipped with an Intel(R) Core(TM) i7-7700K CPU (4 cores) running on max turbo frequency equal to 4.20 GHz and (b) a Raspberry Pi 4 computing module (quad-core ARM Cortex-A72 processor). We converted the developed models to a TensorFlow Lite format. TensorFlow Lite is a set of tools that enables on-device machine learning such as mobile, embedded, and IoT devices. As a result, the TensorFlow models were converted in a special efficient portable format known as FlatBuffers (identified by the .tflite file extension), providing several advantages over TensorFlow's protocol buffer model format (identified by the .pb

file extension) such as reduced size and faster inference time. The performance of the models was not decreased after the conversion to .tflite format.

For our measurement purposes, we ran the models 10,000 times and then computed the average inferencing time. The first inferencing run, which takes longer due to loading overheads, was discarded. As expected the inference time for the .tflite formats is significantly lower than those of the .pb formats. Moreover, since the objective was to deploy the model to a Raspberry Pi 4 we selected to use our Deep CNN. Even though it achieved a 0.32% lower F1-score, it is significantly faster (almost x7 times) than the Deep CNN TL model enabling real-time inference at the edge of the network.

**Figure 13.** Confusion matrix of Deep CNN on test set of SaR dog search and rescue dataset.


**Table 8.** Mean inference time measure in milliseconds for each model.

#### **6. Validation of the Proposed Implementation**

The proposed system was validated in an abandoned and demolished hospital southwest of Madrid, running two scenarios with the assistance of different SaR dogs. Similarly to the data collection process, a first responder had the role of the "victim", and was hidden somewhere in the arena among the ruins. Then, a SaR dog with the developed wearable mounted on its back started its SaR mission.

During this process we measured the accuracy and F1-scores of the developed bark detection and activity recognition models separately. Moreover, we estimated the overall F1 score for notifying the K9 handlers whether the victim was found or not. This is achieved by injecting an alert rule on the mobile that is triggered when the SaR dog is barking and standing simultaneously, which is what it is trained for denoting that it has found a missing person.

Table 9 presents the classes of the collected singals (IMU and audio). Not all of the motion signals were annotated. This is due to the fact that the SaR dog was missing (e.g., was behind the debris) or there was an overlap in the activities for a 2 s window (e.g., the SaR dog was searching for the first 800 ms and then stopped moving for the rest 1200 ms). Thus there are presented less examples than the total amount.


**Table 9.** Number of samples for two SaR evaluation scenarios.

The obtained F1-scores and the corresponding accuracy results are presented in Table 10. The developed deep CNN activity recognition model achieved a F1-score equal to 91.21% and 91.26% accuracy, while the bark detection model acquired 99.08% F1-score and 99.77% accuracy. In particular, the latter provided only two false positives (i.e., the misclacified "not barking" as "barking"), and these, also, triggered the alert notification providing the same F1-score and accuracy metrics for the overall victim detection task.

**Table 10.** Obtained F1-score and accuracy results for tasks of activity recognition, bark detection, and victim found recognition regarding two evaluation SaR scenarios.


Moreover, the developed solution was able to operate in real-time on the field, exploiting data processing at the edge, and it enabled the first responders to be aware of the K9 position and its behavior. Figures 14 and 15 display a summary plot of the outputs of the DL models and the received smartphone notifications with respect to the received KAFKA messages, respectively. A video displaying these results and the whole validation procedure can be found here (https://www.youtube.com/watch?v=704AV4mNfRA, accessed on 20 January 2022).

**Figure 14.** Plots displaying outputs of developed DL models during 275 s of 1st SaR scenario (**left**). Red line denotes threshold value (i.e., 0.5) for classifying an audio signal as "bark" and IMU signal as "stand". The final 25 s of this scenario are displayed on (**right**) plot.

**Figure 15.** Mobile screenshot displaying generated alert in 1st SaR scenario (**left**), and corresponding KAFKA messages, with last one triggering alert rule (**right**).

#### **7. Discussion**

One of the main advantages of the current work is that it exploits edge computing to process in real-time the generated data before transmitting them through the network. In particular, in the case where there is no Wi-Fi available and the RF module is not efficient to send streaming audio data since the maximum data rate is 250 kbits/s, and the necessary rate for a medium quality audio signal is equal to 192 kbit/s, let alone the need to transmit the IMU signal and the GPS coordinates. In addition to this, to expand the data transmission range we reduced the data rate to 10 kbit/s, making it impossible to transmit the produced raw signals.

Moreover, the inclusion of IMU sensors is significant since the included SaR dogs are trained to bark and stand still when identifying a missing/trapped person. Thus, it reduces the false positives (i.e., victim found recognition) in the case the algorithm outputs that the dog barks but it is not standing or the dog falsely produces a barking sound. Furthermore, micro movements where the dog is confused (e.g., makes small circles) or is sniffing are not noticeable (i.e., the displayed coordinates will indicate it as standing) though the GPS signal, due to its estimation error (could reach up to 5 m), but are classified as searching by our algorithm.

However, one limitation of the approach is the activity recognition algorithm's performance. Even though the overall accuracy is high, having an average of 7.32 misclafications in 100 s time span, for a critical mission application where even a second matters, this is not considered to be low, mainly due to the fact that the provided algorithm has not "seen" the examples of the dogs included in the test set. In other words, the behavioral patterns of some dogs are not close to the others and having more training data would be beneficial for the algorithm's performance [51], a case that will be explored in the future.

Another possible limitation is that of the activity recognition algorithm's generalizability in different dog breeds and environments. The SaR dogs included in training and evaluation were German Shepherds, American Labrador Retrievers, Golden Retrievers, Belgian Malinois, or mixed breeds (of the aforementioned) and ranged from 20 kg to 32 kg dogs. Moreover, the training and evaluation environments (arenas) were relative small areas with a lot of obstacles, such as debris. Thus, the algorithms performance on bigger

SaR dogs (e.g., Saint Bernard) and wide open areas was not tested (e.g., forest covered with snow).

Finally, the current work has followed the guidelines regarding the Ethics Code (https: //escuelasalvamento.org/wp-content/uploads/2021/04/Codigo-Etico\_vf.pdf, accessed on 5 August 2021) of K9 training and the participating SaR dogs did not undergo any extra training for the purposes of this paper.

#### **8. Conclusions**

In this paper, we proposed a novel implementation that performs dog activity recognition and bark detection in real-time to alert the dog handler (a) about the dog position and (b) whether it has found the victim during a search and rescue operation. The proposed solution can significantly aid the first aid responders in search and rescue missions, especially in places where the rescuers either are not possible to enter, e.g., below debris, or if they cannot have the rescue dog within their line of sight. To realize thins, the candidate implementation incorporates CNNs, which have the ability to extract features automatically, attaining the highest accuracy compared with other known ML algorithms. In particular, it attained an accuracy of more than 93% both in activity recognition and bark detection in the collected test datasets and managed in both discrete validation scenarios to classify and alert the rescuer at the time that the dog managed to find the victim.

**Author Contributions:** Conceptualization, P.K. and C.Z.P.; methodology, P.K. and D.U.; software, P.K., V.D. and C.G.; validation, P.K., V.D. and D.U.; formal analysis, P.K., V.D. and D.U.; investigation, P.K., S.I.F. and M.B.G.; data curation, P.K. and D.U.; resources, S.I.F. and M.B.G.; writing—original draft preparation, P.K., V.D. and D.U.; visualization, P.K. and D.U.; supervision, D.G.K. and C.Z.P.; project administration, D.G.K. and C.Z.P.; funding acquisition, C.Z.P. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the European Commission's H2020 program (under project FASTER) grant number 833507.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Acknowledgments:** This research was funded by the European Commission's H2020 program (under project FASTER) grant number 833507.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


**Mustafa Aljasim and Rasha Kashef \***

Electrical, Computer, and Biomedical Engineering, Ryerson University, Toronto, ON M5B 2K3, Canada; mustafa.aljasim@ryerson.ca

**\*** Correspondence: rkashef@ryerson.ca

**Abstract:** The increasing number of car accidents is a significant issue in current transportation systems. According to the World Health Organization (WHO), road accidents are the eighth highest top cause of death around the world. More than 80% of road accidents are caused by distracted driving, such as using a mobile phone, talking to passengers, and smoking. A lot of efforts have been made to tackle the problem of driver distraction; however, no optimal solution is provided. A practical approach to solving this problem is implementing quantitative measures for driver activities and designing a classification system that detects distracting actions. In this paper, we have implemented a portfolio of various ensemble deep learning models that have been proven to efficiently classify driver distracted actions and provide an in-car recommendation to minimize the level of distractions and increase in-car awareness for improved safety. This paper proposes E2DR, a new scalable model that uses stacking ensemble methods to combine two or more deep learning models to improve accuracy, enhance generalization, and reduce overfitting, with real-time recommendations. The highest performing E2DR variant, which included the ResNet50 and VGG16 models, achieved a test accuracy of 92% as applied to state-of-the-art datasets, including the State Farm Distracted Drivers dataset, using novel data splitting strategies.

**Keywords:** deep learning; stacking; ensemble learning; distracted driving

#### **1. Introduction**

With the continuous growth of the population, new technologies and transportation methods need to emerge to serve people effectively and efficiently. Building an efficient and safe transportation system can positively affect economies, the environment, and human mental and physical health. The increasing number of accidents is a major issue in our current transportation system. According to the World Health Organization, road accidents are the eighth highest reason for death worldwide [1]. According to [2], 1.35 million people die every year in a car accident, and up to 50 million people are injured. This makes traffic safety a major concern worldwide. More than 80% of road accidents are caused by distracted driving, such as using a mobile phone, talking to passengers, and smoking [2]. Therefore, more attention has been directed to driver action analysis and monitoring. Many efforts have been made to tackle the problem of driver distraction with effective approaches using countermeasures for distracted driver actions [3]. The measures can be divided into three categories: (1) distraction prevention before distraction occurs, (2) distracting action detection (through alertness) after distraction occurs, and (3) collision avoidance when a potential collision is expected [3]. Imposing strict fines, government legislation, and raising public awareness are methods used to decrease the number of accidents caused by distracted driving through preventing distraction sources before it happens. When a potential collision is expected, collision avoidance systems are implemented in most newly manufactured cars through lane control, automatic emergency braking, and forwardcollision warning. Distraction alertness is critical and can be more effective in preventing

**Citation:** Aljasim, M.; Kashef, R. E2DR: A Deep Learning Ensemble-Based Driver Distraction Detection with Recommendations Model. *Sensors* **2022**, *22*, 1858. https://doi.org/10.3390/s22051858

Academic Editors: Andrei Velichko, Dmitry Korzun and Alexander Meigal

Received: 25 December 2021 Accepted: 23 February 2022 Published: 26 February 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

driver distraction; thus, accurate real-time driver action detection methods are essential for this driver distraction category. Distraction alertness can be approached with different modalities. A camera was used to detect distracted driving behavior while driving in many applications. A Controller Area Network Bus can assess the vehicle's performance, such as wheel angle and brake level. Moreover, the system can detect if the driver is distracted or focused on driving based on the in-car collected information. Finally, sensors, such as the electrocardiograph and the electroencephalograph, can be used to estimate the emotional and physiological states of a driver, which can be associated with the level of distraction and fatigue in drivers [4]; however, there is a lack of reflection on the fact that they are invasive sensors for a driver. The driver's action or behavior, such as gaze, head pose, and hand position, can be detected through deep learning models and the analysis of the car information [5]. Existing methods in detecting driver distraction fall short in providing accurate detection and recommendations in real-time. Ensemble learning has shown better classification performance compared to individual models, as it combines the benefits of multiple models while overcoming their drawbacks [6]. While authors in [6] used a fixed architecture of only three deep learning models including the residual network (ResNet), the hierarchical recurrent neural network (HRNN), and the Inception network, there is a research gap in the literature in using a scalable and incremental stacking-based ensemble learning with real-time recommendations to achieve high accuracy in detecting distracted driving activities with minimal computational overhead. Thus, in this paper, we have proposed a novel scalable model that uses ensemble learning, focusing on stacking that combines two or more baseline models and generates an ensemble with better performance than the adopted models. Our method aims to enhance generalization, reduce overfitting, increase performance, and provide real-time recommendations. This paper first examines the performance of several state-of-the-art image deep learning classification methods. An Ensemble-Based Distraction Detection with Recommendations model is designed, namely (E2DR), with the goal of improving the accuracy of distracted behavior detection. In the proposed E2DR model, two or more deep learning models are aggregated in a stacking ensemble. A recommendation layer is also provided for real-time recommendations to drivers in each case of the distracted behaviors to allow drivers or autonomous vehicles to take the best course of action when drivers are detected under distracted behaviors. Experimental results show that state-of-the-art image classification models achieve a test accuracy ranging between 82–88% in detecting driver distraction. Furthermore, results show an average improvement of 5–8% in detection accuracy when the proposed E2DR is used with a real-time data splitting based on the driver IDs. Similar results are obtained for other metrics such as Precision and F1 score. The rest of this paper is as follows: Section 2 discusses deep learning driver distraction detection systems and the related work. Sections 3 and 4 introduce the adopted and proposed models, respectively. Section 5 presents the experimental results and analysis. We conclude the paper with future directions in Section 6.

#### **2. Related Work on Deep Learning Driver Distraction Detection**

There are three main types of distracted driving: cognitive, visual, and manual distraction. Cognitive distraction occurs when the driver's mind is not entirely focused on driving. Driver gaze and talking to passengers are examples of cognitive distraction. Even drivers listening to music or the radio are at risk. The audio or music can shift the driver's attention from driving and overall surroundings. Visual distraction is when the driver is not looking at the road ahead. Drivers who observe shop signs and billboards on the side of the road are considered visually distracted. Looking at electronic devices such as GPS devices, smartphones, and digital entertainment devices while driving is under the category of visual distraction. Finally, manual distraction occurs when the driver, for any reason, takes their hands off the steering wheel. Drivers who smoke while driving, eat and drink in the car, or try to get something from anywhere in the vehicle are under the risk of manual distraction. Texting while driving is the most dangerous driver distraction

as it combines all three types of distractions. When drivers take their eye off the road to send a message or check a notification, it is long enough to cover the length of a football field at 80 km/h [5]. Various research studies have been proposed to address the drivers' distraction detection problem. This section will survey the most recent state-of-the-art research work using Single-based vs. Hybrid-based deep learning to address this problem.

#### *2.1. Single-Based Deep Learning Models*

A gaze estimation model called X-Aware is introduced in [7] to analyze the driver's face along with contextual information. The model visually improves the fusion of the captured environment of the driver's face, where the contextual attention mechanism is directly attached to the output of convolutional layers of the InceptionResNetV2 networks. The accuracy of their best model outperformed the other baseline models in the literature. The dynamics of the driver's gaze and their use to understand other attentional mechanisms are addressed in [8]. The model is built based on two questions, where and what is the driver looking at. The model is trained through coarse-to-fine convolutional networks on short sequences from the DR(eye)VE dataset [8]. Their experiments showed that the driver's gaze could be learned to some extent, considering its highly subjective challenges and the scene's irreproducibility showing the driver's gaze for each sequence. The results showed that the model could achieve accurate results and could be integrated into practical applications. In [9], the authors proposed a deep learning model that detects drivers' behavior and actions during travel. The deep learning model classifies the driver actions into ten classes. The first class represents safe driving, and the other nine classes represent unsafe drivers' actions such as fixing makeup and texting. The driver receives an alert if an unsafe action is detected. They used Convolutional Neural Networks (CNNs) to perform training and detection. The core of the deep learning system is ResNet50. A dense net architecture followed the ResNet50 architecture to make classifications. The dataset used is the State Farm dataset and included images of different drivers' actions that cause distracted driving. The model achieved high accuracy in detecting the driver's actions. A facial expression recognition model in [10] monitors drivers' emotions and operates in low specification devices installed in cars. A Hierarchal Weighted Random Forest Classifier (HRFC) is used and trained on the similarity of sample data. Geometric features and facial landmarks are detected and extracted from input images. The features are vectorized and implemented in the Hierarchal Random-Forest Classifier to detect facial expressions. The method was evaluated on the MMI dataset, the Keimyung University Facial Expression of Drivers (KMU-FED) dataset and the Cohn-Kahnde dataset. The results showed that the proposed model had similar performance to other state-of-the-art methods. The study in [11] introduced a computationally efficient distracted driver detection system based on convolutional neural networks. The authors proposed a new architecture called mobileVGG. The architecture is based on depth-wise separable convolutions. The authors used a simplified version of the VGG16 model, allowing the proposed architecture to be suitable for real-time applications with decent classification accuracy. The datasets used are the State Farm dataset and the American University in Cairo Distracted Driver (AUCDD) detection dataset. The results showed that the proposed architecture outperformed other approaches while being computationally simple. The driver's face pose is detected by training CNNs [12]; the CNNs then identify if the driver's head position is considered under the category of distracted driving. The model consists of five CNNs followed by three fully connected layers. The results showed that the proposed model has better accuracy when compared to non-linear and linear embedding algorithms. In [13], the authors proposed a driver action recognition system called dilated and deformable Faster Region-Based Convolutional Neural Networks (R-CNN). It detects driver actions by detecting motion-specific objects exhibiting inter-class similarity and intra-class differences. The irregular and small features, such as cell phones and cigarettes, are extracted through the dilated and deformable residual block. Then, the region proposal optimization network algorithm decreases the number of features and improves the model's efficiency. Finally, the feature pooling module is replaced with a deformable one, and the R-CNN network is trained as the classifier of the network. The authors established the dataset and contained images of different driver actions. Results showed that the model demonstrates acceptable results. Authors in [14] implemented a driver distraction detection model that uses a light-weight octave-like convolutional neural network. The network consists of octave-like convolutional blocks called OLCMNet. The OCLM block splits the feature map into two branches through point-wise convolution. Average pooling and depth-wise convolution are performed on the feature map. A DC operator captures the fine details in the highfrequency branches. Lastly, the OCLMNet exchanges further information between layers. The model performed well on the Lilong Distracted Driving Behavior dataset while being implemented on a limited computation budget. A unique approach is proposed in [15] that uses both spatial and temporal information of electroencephalography (EEG) signals as an input to a deep learning model. The relationship between the driver distraction and the EEG signal in the time domain is mapped through gated recurrent units (GRUs) and CNNs. Twenty-four volunteers were tested while doing activities that cause a distraction while driving, and their EEG response was recorded. Then, the proposed deep learning network was trained based on the EEG information. The deep learning approach consisted of a temporal–spatial information network (TSIN), combining CNNs and GRUs to better detect spatial and temporal features from EEG signals. The authors of [16] proposed a modifier deep learning approach for distraction detection. They used the OpenPose library for a two-category problem of distraction detection. The library draws 43 points on the facial skeleton to detect the human face. The detection is sent to a deep neural network that uses the ResNet50 model. The results demonstrated good accuracy and outperformed other residual network architectures. The work in [17] introduced a new approach to distracted driver detection using wearable sensing and deep neural networks. The study included information from twenty participants through wearable motion sensors attached to their wrists. The participants performed five distraction activities under instructions in a driving simulator. The captured data were sent to a deep learning model that consisted of recurrent neural networks and long short-term memory (RNN-LSTM) which classified distraction tasks. The results showed a good potential for the wearable proposed sensing approach.

#### *2.2. Hybrid-Based Deep Learning Models*

A hybrid model is designed in [18] with an ensemble of weighted CNNs for driver posture classification. The authors proved that using a weighted ensemble classifier using the genetic algorithm resulted in a better confidence score for classification. Additionally, the effect of variable visual elements is analyzed, such as face and hands, in detecting distracted driver action through localization of face and hands. The dataset used is the Distracted Driver dataset, which contains ten classes of driver actions. The best model has an accuracy of 96%, and a smaller version of the ensemble model achieved an accuracy of 94.29%. In [19], a distracted driver detection technique using pose estimation is introduced. The model is an ensemble of ResNets and classifies drivers through pose estimation images. ResNet and HRnet are used to generate pose images. Then, ResNet50 and ResNet101 classify the original and pose estimation images. The grid search method identifies the optimum weight for predictions from both models. Classifying pose estimation images is useful when used with the original image classification model as it increases classification accuracy. The dataset used is the AUCDD dataset. The results showed that the introduced model achieved an accuracy of 94.28%. The study in [20] detected driver–vehicle volatilities using driving data to detect the occurrence of critical events and give appropriate feedback to drivers and surrounding vehicles through analyzing multiple real-time data streams such as vehicle movements, instability of driving, and driver distraction. The deep learning model consisted of a Convolutional Neural Network (CNN) model and a long shortterm memory (LSTM) model. The data were collected from more than 3500 drivers and included 1315 severe and 7566 normal events. The model achieved high accuracy and was effective in detecting accidents. A CNNs-based model to identify driver's activities

is introduced in [21]. The driving activities are divided into several classes, in which 4 are considered normal driving activities, and the other three are classified as distracted driving. The Gaussian mixture model detects the driver's body from the background before sending the image to the CNN model. The authors used transfer learning to fine-tune three pre-trained state-of-the-art CNN models: Restnet50, AlexNet, and GoogLeNet. The model was trained as a binary classifier to detect whether the driver was distracted or not. The authors collected the data from 10 drivers involved in the most common driving activities. The results showed that the model was effective as a binary classifier. In [22], a model for distracted driving action recognition is proposed using a hybrid of two convolutional neural network architectures, Xception and Inception V3, to detect 10 classes of driver actions. The authors used ImageNet weights for transfer learning. The performance of both models was analyzed under different weighting schemes. Using pre-trained weights helped the network learn basic shapes and edges without starting training from scratch, which allowed the model to achieve good results in under 10 epochs of training, applied to the State Farm Distracted Driver dataset. The results showed that the Inception model had a better performance compared to the Xception model. A distracted driver action recognition system is introduced in [23] based on the Discriminative Scale Space Tracking (DSST) and Deep Predictive Coding Network (PCN) algorithms, dynamic face tracking, location, and face detection. Then, the YOLOv3 object detection model detects distracting behavior around the driver's face, such as phone calls and smoking. The dataset used is a self-built dataset of people making phone calls and smoking. The results demonstrated that the model could detect a driver's behavior with high accuracy. A hybrid driver distraction detection model is presented in [24]. The model uses CNNs and Bidirectional Long Short-Term Memory (BiLSTM). The proposed model captures the spatio-spectral features of the images and consists of two steps: (1) detect the spatial features of different postures using CNNs automatically, and (2) the spectral components from the stacked feature map are extracted through the BiLSTMs. They used the AUCDD dataset. Results showed that their model performed better than most state-of-the-art models. The work in [25] used deep learning to detect driver inattentive and aggressive behavior. They classified inattentive driver behavior into driver fatigue, downiness, driver distraction, and other risky driver behavior such as driving aggressiveness. All these risky driving behaviors are associated with various factors that include driving age, experience, illness, and gender. The authors used CNNs, RNNs and LSTMs. They showed that the CNNs achieved the best performance. The algorithm in [26] detects driver manual distraction using two modules; in the first module, the bounding boxes of the driver's right ear and right hand are detected from RGB images through YOLO, a deep learning object detection model. Then, the bounding boxes are taken as an input by the second module, a multi-layer perceptron, to predict the distraction type. The dataset consisted of 106,677 frames extracted from a video obtained from 20 participants in a driving simulator. The proposed algorithm achieved comparable results with other models in the same field. Table 1 provides a comparative study of the recent work in driver distraction detection systems. There is a research gap in the literature in using stacking-based ensemble learning to achieve high accuracy in detecting distracted driving activities with minimal computational overhead. Thus, in this paper, we have proposed a framework that uses ensemble learning, focusing on stacking that combines two baseline models and generates an ensemble with better performance than the adopted models.


#### **Table 1.** Comparative Analysis of Driver Distraction detection systems.

#### **3. Adopted Deep Learning Models**

#### *3.1. ResNet50 Model*

ResNet50 is a 50-layer deep convolutional neural network (Figure 1) [27]. The pretrained version of the network can be imported from the ImageNet database [28], trained on over a million photos. The network is trained to classify images into 1,000 different object categories, including pencils, tables, mice, and various animals and objects. As a result, the network has learned a variety of rich feature representations for a large variety of images [29]. In the final classification model, ResNet50 is used as the convolutional base, and the pre-trained model is used to learn the patterns in the data. ResNet50 requires the size of the input images to be 224 × 224 (width, height) [29]. All the experiments were conducted with color images. The dimensions of each image sent to the classification model were (224, 224, 3), where 3 indicates the number of channels in the images. The three channels indicated are color channels composed of Green, blue, and red.

**Figure 1.** The ResNet50 blocks [27].

#### *3.2. VGG16 Model*

Oxford's VGG16 architecture from the Visual Geometry Group (VGG) (Figure 2) has an advantage over AlexNet by replacing large kernel-sized filter 5 and 100 in the second and first convolutional layers, respectively, with several kernel-sized filters of size 3 × 3 one after another [30]. Similar to the ResNet50 model, the input to the network is a 224 × 224 × 3 image, where 3 indicates the RGB channels. The image is passed through a series of convolutional layers with small receptive field filters with a size of 3 × 3 [31]. This filter size is the smallest to capture center, up/down and left/right notions. Additionally, the configuration utilizes 1 × 1 convolutional filters that can be considered a linear transformation of the input channel [32].

**Figure 2.** The VGG16 blocks [33].

#### *3.3. Inception Model*

The Inception model (Figure 3) is a significant milestone in the evolution of CNN classifiers. Inception changed the traditional approach of adding more and deeper convolutional layers to improve performance [34]. Inception went through several versions and developed with time. Inception V1 [35] uses multiple filters that operate simultaneously, making the network wider rather than deeper. Then, the authors introduced Inception V2

and V3 [36]. Inception V2 significantly improved performance and computational speed by using two 3 × 3 convolutional operations instead of a single 5 × 5 convolution which is 2.78 times more computationally expensive [37]. Inception V3, the model used in the experiment, includes all upgrades in Inception V2 and factorized 7 × 7 convolutions, which improved performance even more, and added Label Smoothing to decrease the chances of overfitting [38]. The main contribution of Inception V4 is adding reduction blocks to change the width and height of the grid [39].

**Figure 3.** The Inception v3 architecture [39].

#### *3.4. MobileNet Model*

MobileNet (Figure 4) is the first mobile computer vision model based on Tensor-Flow [40]. The name Mobile implies the ability of the model to function in mobile applications [41]. MobileNet is based on separable depth-wise convolutions, which significantly reduces the number of parameters, especially when compared to networks with regular convolutions that have the same depth of the nets. This makes MobileNet a lightweight deep neural network suitable for mobile applications [42]. The depth-wise separable convolution is made from two primary operations: depth-wise convolution and point-wise convolution. The depth-wise convolution came from the idea that the filter's spatial and depth dimensions can be separated. The filter is separated by its height and width dimensions, and then the depth dimension is separated from the horizontal (width×height) dimension. The point-wise convolution is a 1 × 1 convolution that changes the dimension of the previous layer.

**Figure 4.** The MobileNet Architecture [43].

#### **4. The Proposed E2DR Model**

Existing driver distraction detection systems in the literature only use a single model trained for classification. Moreover, most recent work uses a single state-of-the-art classifier or a network of convolutional neural layers to get the best performance. In this paper, an Ensemble-Based Distraction Detection with Recommendations system is designed, namely E2DR, to improve the accuracy of driver distraction detection and provide recommendations. In the proposed E2DR model, two deep learning models are aggregated in a stacking ensemble. A recommendation layer is also provided for real-time recommendations in each case of distracted behaviors. The E2DR model enhances the generalization of the detection process and reduces overfitting. The E2DR model allows drivers or autonomous vehicles, depending on the technology in the vehicle, to take the best action when drivers are detected under distracted behaviors.

#### *4.1. The Ensemble-Based Distraction Detection with Recommendations Model (E2DR)*

Stacked generalization (SG) was first introduced in [44]; it was shown that stacking reduces the bias of the single model concerning the training set, where bias is the average difference between actual and predicted results [44]. The deduction results from the stacking model's ability to harness the capabilities of more than one well-performing model on a regression or classification task to generate better performance and predictions than base classifiers in the ensemble, which reduces the error and bias. Inspired by the theory of stacking generalization, the E2DR model combines two or more detection models to provide better and more accurate predictions than individual models. The stacking algorithm takes the output of the base model as an input to another model, sometimes called a meta-learner, which learns how to combine the predictions of base models to generate better predictions. In more detail, the stacking architecture has two levels. The first level contains the base models, and the second level includes a meta-learner that concatenates the outcomes of base models to provide final predictions. Figure 5 shows the pair-wise stacking in the E2DR architecture (i.e., only two base models are combined).

**Figure 5.** The E2DR (Pair-wise stacking) Architecture.

#### *4.2. E2DR Variants*

In this paper, six models are developed using variants of base model 1 and base model 2, such as E2DR (A1, A2) where A1 and A2 ∈ {ResNet, VGG16, MobileNet, Inception}. The E2DR model combines 2 of the mentioned models using the Stacked Generalization (SG) ensemble method. The SG ensemble method uses the outputs from the pre-trained base models, concatenates them, and sends them to a meta-learner model at level 2 consisting of a dense layer for classification. Once the distracted behavior is classified, a set of recommended actions are provided to ensure safety. We calculate an assessment measure for each E2DR (A1, A2) using Accuracy, Loss, F-measure, Precision, and Recall.

#### *4.3. Computational Complexity*

Assume TA1 and TA2 are the computational time taken by both base models A1 and A2, respectively. Assume the meta learner needs overhead of TM and the recommendation layer needs O(k) to retrieve recommendations based on the output classification, where k is the

number of recommendations. We assume a linear search algorithm for recommendations extraction. The overall computational complexity of the E2DR model is computed as the maximum of TA1 and TA2 in addition to the overhead of concatenation and recommendation retrieval, as shown in Equation (1).

$$\mathbf{T\_{E\Sigma DR}} = \mathbf{O}(\text{Max}(\mathbf{T\_{A1}}, \text{and } \mathbf{T\_{A2}}) + \mathbf{T\_M} + \mathbf{O}(\mathbf{k}) \tag{1}$$

#### *4.4. Adopted Base Model Architectures*

All adopted architectures use CNNs, a deep learning model that learns from spatial features of images by creating feature maps using filters and kernels (sliding windows). Many variations of CNNs have been studied recently to detect driving postures and actions. The models adopted in this paper are among the highest-performing CNN models. For all models, we used a learning rate of 0.001 and Categorical Cross Entropy as the loss function. The models are explained in detail in Section 3.

*ResNet50 Model:* When building the CNN model, the classification top of the ResNet50 Model (Figure 6), which was originally designed to classify 1000 classes [45], was dropped from the network to adapt the new CNN architecture to the dataset used that included only 10 classes. To avoid the problem of overfitting, a drop-out layer was added, and performance on the validation set was observed after every epoch. The hyperparameters, such as batch size, learning rate, and the number of neurons, were adjusted to optimize the model's performance and enhance the model's generalizability. We used different batch sizes: 128, 64, 32, and 16 to understand the significance of selecting the batch size as well as its effect on the network, with 32 as the best performing batch size.

**Figure 6.** ResNet50 Architecture.

*VGG16 Model*: The VGG16 model (Figure 7) is among the largest models with many parameters. The padding is 1 pixel for 3 × 3 convolutional layers to preserve the spatial padding after undergoing convolution. The five-max pooling layers carry out the spatial pooling that follows some convolutional layers [46]. The max pooling is applied through a 2 × 2 pixel widow that has a stride value of 2. The same filter size is applied several times, allowing the network to represent complex features [47]. This "blocks" concept became more common after VGG was introduced [46]. As with the ResNet50, the classification layer (top layer) is adapted to the used dataset with 10 classes. The hyperparameter was chosen to optimize performance and training time with a batch size equals to 32.

**Figure 7.** Dense net architecture (VGG16).

*Inception model:* The Inception model (Figure 8) is a widely-used image classification model with medium complexity. Its continual evolution resulted in the creation of multiple versions of the model. The one used in the experiments is the third version of the network, which has similar parameters to the ResNet50 model discussed earlier. The batch size equals 32, and the model is trained for 5 epochs. The model uses smaller convolutions which can be significant in decreasing computational time. In addition, factorizing convolutions reduces the number of parameters. Therefore, the Inception model has a lower training time compared to the ResNet50 and VGG16 models.

*MobileNet model:* we used the MobileNet architecture (Figure 9) to classify drivers' actions to examine its performance, considering it is the smallest state-of-the-art image classification network in terms of size and number of parameters. The main difference compared to other models is that MobileNet uses a 3 × 3 depth-wise convolution and 1 × 1 point-wise convolution instead of the traditional 3 × 3 convolution layer in most CNN models. The dense network is similar to the ResNet and Inception model networks discussed earlier. The batch size is 32, and the model was trained for 5 epochs. The top layer was removed from the network as with other models to modify the classifications according to the dataset used.

**Figure 8.** The Inception Model Architecture.

**Figure 9.** Model Architecture (MobileNet).

#### *4.5. Recommendations*

Adding a recommendation after the classification of each class can be a great improvement to the distracted driver detection implementations. In most cases, the best action to consider, especially in driving, is as easy as keeping the driver's focus on the road with no distraction. Reminding drivers is the best action to consider avoiding losing their attention during driving. An alert system can be effective with alternatives and actions to remind drivers of options to ensure their safety, especially drivers that need to take an important phone call or send an urgent text. In the case of autonomous vehicles, the recommendations can be sent to the vehicle to perform the best course of action. However, this is limited by the technology available in the vehicle. Table 2 provides the list of recommended actions as alertness signals to the driver based on the class of distracting activities while driving. The category of the distracting activities is retrieved from the ensemble classification layer in the E2DR model.

**Table 2.** Recommendations for distracted drivers.


#### **5. Experimental Analysis and Results**

#### *5.1. Dataset*

The dataset used in the experiment is the State Farm Distracted Drivers dataset [48]. There are 22,424 images of drivers in distracted positions that can be used for training and testing. The images in the dataset were taken with the contribution of 26 unique subjects in different cars (random numbering (i.e., not in consecutive order)). The dataset has 10 classes: safe driving, texting—right, talking on the phone—right, texting—left, talking on the phone—left, operating the radio, drinking, reaching behind, hair and makeup, and talking to a passenger. A representation of each class in the dataset is shown in Figure 10.

**Figure 10.** State Farm Distracted Driver dataset class representation [47].

#### *5.2. Experimental Setup*

A MacBook Pro and Google collab (Pro) are used to train and test the models.

#### *5.3. Preprocessing and Splitting Strategy*

The Distracted Driver dataset is ingested, then preprocessed as follows: (1) the images and the driver\_imgs\_list.csv file are stored, (2) the images are loaded, converted from BGR to RGB, and resized to 244 × 244 × 3 as this is the size used by most models and architectures for transfer learning, (3) the data are split into training, validation, and test sets; the validation and test set was created based on the subject (Driver ID); the subjects chosen for validation were: p18, p27, and p39, and the subjects chosen for testing were: p015, p022, p050, and p056. Finally, the labels were converted into categorical values. As a result, the training set had 15,963 images, the validation set had 2769 images and the test set had 3692 images.

#### *5.4. Evaluation Metrics*

Precision, Recall, F1 score, and Accuracy [49–54] are the most well-known evaluation metrics to assess the performance of a classifier. Precision finds pertinent instances among the gathered instances. It can be defined as the ratio between the True Positives (TP) and the sum of True Positives and False Positives (FP) as shown in Equation (2).

$$\text{Precision} = (\text{TP}) / (\text{TP} + \text{FP}) \tag{2}$$

Recall, also known as Sensitivity, is defined as the ratio of True Positives and the sum of True Positives and False Negatives (FN).

$$\text{Recall} = (\text{TP}) / (\text{TP} + \text{FN}) \tag{3}$$

F1 score: It is defined as the harmonic mean of Precision and Recall as shown in Equation (4).

$$\text{F1} - \text{Score} = (2 \ast \text{Precision} \times \text{Recall}) / (\text{Precision} + \text{Recall}) \tag{4}$$

Accuracy finds the correct predictions among the total predictions. It is defined as the ratio between the sum of True Positives and True Negatives and the sum of True Positives, False Negatives, True Negatives, and False Negatives.

$$\text{Accuracy} = (\text{TP} + \text{TN}) / (\text{TP} + \text{FP} + \text{TN} + \text{FN}) \tag{5}$$

#### *5.5. Performance Evaluation: Base Models*

The performance of all base models is shown in Table 3. ResNet50 performs best with the highest accuracy and recall on the test set. The VGG16 performs just as well as the ResNet50 model, but the training time was significantly longer than ResNet50 and all other tested models since it has many variables. The Inception model had a test accuracy of 0.83, lower than ResNet50 and VGG16. The Inception model had the highest loss, and the training time was close to the ResNet50 model. Finally, the MobileNet had a similar performance to the Inception model with a lower loss and significantly faster training time. This is because MobileNet is a low-power, low-latency, light model parameterized to meet the constraints of computational and time resources of various applications. Choosing which optimum model to use depends on the trade-off between performance and computational complexity. ResNet50 and VGG16 can be used when powerful and robust computational resources and flexible time constraints are available. The MobileNet can be used for faster training and decent accuracy, which is not as good as VGG16 and ResNet50 but is still a good choice for limited computational resources. The Inception model did not perform well and lagged other models in most aspects, making it not favorable compared to the other tested models. Figure 11 shows how each model performed on the training, validation, and test sets. As shown in Figure 11, all models have a gradual increase in validation accuracy with the increasing number of epochs except for the MobileNet model, which has the highest validation accuracy at the third epoch and

decreases afterwards. Similarly, the loss of all models decreases as epochs increase except for the MobileNet, which had the lowest loss value at epoch number three, which shows that the model was overfitting after epoch three. We even increased the number of epochs up to one, and we observed that the individual baseline models suffered from overfitting and could not show an increase in performance even when training was continued for a larger number of epochs. The performance report (Figure 12) shows how each model performs for each class. All models perform well for classes 1–5 and perform poorly for class 8, which is "hair and makeup", because it is a challenging class to be detected and usually confused with class 7, as shown in the confusion matrices in Figure 13. Moreover, the data quality for this class might not be on the same level as other classes.


**Table 3.** Deep learning image classification models performance.

**Figure 11.** Accuracy and loss graphs for the models on the training and validation sets.


**Figure 12.** Performance reports for deep learning models on the test set.

**Figure 13.** Confusion matrices for the deep learning models.

#### *5.6. The E2DR Performance Evaluation*

#### 5.6.1. Settings

In the first layer of the E2DR model, the individual models were trained. Their layers are executed in parallel (after optimizations) when loaded in the ensemble, so their weights are not altered during training. Each of the two base models has 10 outputs representing each class in the dataset. The output of the base models is sent to a concatenation layer, which is then sent to a dense layer with 10 neurons (equal to the number of classes). A SoftMax activation function is used to perform classification. A learning rate of 0.001 and Adam optimizer are used to compile the model with a batch size equal to 32. The loss function used is the Categorical Cross Entropy loss function. The E2DR was trained for five epochs similar to base models.

#### 5.6.2. Results and Discussion

The results of the E2DR models showed a remarkable improvement in performance compared to the individual models. The best performing E2DR model was the stacked ensemble combination of ResNet and VGG16 with a test accuracy of 92%. The lowestperforming E2DR model with a test accuracy of 88% was the MobileNet–Inception E2DR variant, which was also expected, as the base models did not perform very well individually. The performance of each variant of the E2DR model is shown in Table 4. The improvement in accuracy was around 5–8%, which is a significant improvement considering that the base models could not exceed the late 80% in their accuracy. The fact that E2DR models reached accuracies exceeding 90% proves that the E2DR models effectively improved generalization compared to the individual base models. Other metrics, such as Precision, Recall, and F1 scores, showed similar results and improvements to accuracy, which further validates the model's performance. The loss function used in all experiments is the Categorical Cross Entropy function, representing the confidence of predictions made by the model. The loss of the base models and the E2DR variants were in the same range with a small improvement in the E2DR models. This is because the classification confidence did not significantly improve, which means that despite making more correct classifications, which led to an increase in accuracy, the model did not have high certainty in making those predictions. Although the ensemble model in [6] achieved a higher performance with the traditional percentage-based data split, our method provides further credibility as it was tested on completely new data, which simulates real-world scenarios. This was performed by choosing subjects (Driver IDs) that are not included in the training set when constructing the validation and test sets, allowing the model to be tested on data it had not seen before. This approach is not followed in other implementations. The batch size used when recording the training duration is 32. The performance and confusion matrix of the highest performing E2DR model, which includes ResNet50 and VGG16, is presented in Figure 14. When looking at the performance report of this E2DR variant, the strongest classes from the ResNet50 model were 0 and 6, while the VGG16 performed best for classes 3 and 4. However, after analyzing the performance report of the E2DR model, our method combined the strong classes from each model in a single robust model. This is one of the most useful advantages of the E2DR model, where the model combines the skills and strong points of different models into one model, allowing the base models to complement each other in terms of performance. Similarly, Figure 15 shows the performance report and the confusion matrix of the lowest-performing E2DR variant that uses MobileNet and Inception as base models. Although it is the lowest-performing variant, it still showed a huge performance boost compared to the base models' performance. The E2DR model effectively addressed the weak points of the Inception model (classes 0 and 7) and MobileNet model (classes 2 and 9) by boosting the classification performance for those classes in the E2DR model. Comparing the confusion matrices of the base models and the E2DR variants also improves classification performance, especially for class 8, where the confusion rate with class 7 has decreased compared to the base models. Figures 16–18 visualize the performance evaluation across different metrics for the base models and the E2DR variants. The E2DR variants outperform the baseline models measured by the test Accuracy, Precision, Recall, F1 score, and Loss value, as shown in Figures 16–18. As recorded in Equation (1), the computational time to fully develop the E2DR models would be the maximum training duration of the combined base models in addition to the overhead of concatenation and recommendation retrieval. The execution of the E2DR models after training can be applied in real time. The additional overhead in training the E2DR models is shown in Figure 19. It can be observed that there is an average overhead of 7% in using the E2DR models, as

illustrated in Figure 19. However, due to the limited GPU computational power in the experimental hardware used as discussed in Section 5.2., we anticipate that this overhead will be significantly reduced if additional GPUs are used in the training phase. Using the baseline models, the recognition time of one image is 14.45 ms on average with 15.21 ms (on average) when using the ensemble E2DR models.


**Table 4.** E2DR models performance on the test set.

**Figure 14.** Best performing E2DR model classification report and confusion matrix.


**Figure 15.** Lowest performing E2DR model classification report and confusion matrix.

**Figure 16.** Accuracy of base models and E2DR models.

**Figure 17.** The Precision, Recall, and F1 score of base models and E2DR models.

**Figure 18.** The Loss of base models and E2DR models.

**Figure 19.** Training Time (Proposed E2DR vs. baseline deep learning models).

#### **6. Conclusions and Future Directions**

This paper examines different deep learning classification models for distracted driver classification [55–59] and proposes a model that improves performance and provides recommendations. We explored the performance of different models: ResNet50, VGG16, MobileNet, and Inception. All models provided viable means in detecting distracted driver actions. This paper proposes E2DR, a new model that uses stacking ensemble methods to improve accuracy, enhance generalization and reduce overfitting. Additionally, a set of recommendations are added by the model. The highest performing E2DR variant, which included the ResNet50 and VGG16 models, achieved an accuracy of 92%, while the highest performing single model was the ResNet50 with 88% accuracy. The lowest-performing E2DR model was the MobileNet–Inception variant, which achieved an accuracy of 88%, and the lowest-performing individual model was the MobileNet, with an accuracy of 82%. The accuracy difference between the highest and lowest performing models for the E2DR models and the individual models shows a significant increase in performance when using our proposed E2DR model. Other metrics were recorded and presented to evaluate the classification performance of the tested base models and E2DR variants such as Recall, Precision, and F1 score, which showed a similar increase in performance. Furthermore, the performance reports and confusion matrices showed that the E2DR models effectively addressed the weak points of the base models and boosted their classification performance. The computational complexity when developing the E2DR models from scratch is considered a limitation. Since computational speed is important in real-time applications, a light model such as MobileNet can be integrated with ResNet50 or VGG16, which in our experiment showed a significant boost in performance without adding much computational complexity.

For future work, the performance of more than two models combined in the stacking ensemble method can be examined; it was infeasible to test multiple combinations with the limited computational resources used to conduct the experiments. Furthermore, the model can be associated with the police departments to fine violators and identify drivers' actions in case of accidents. The model can also be integrated with face recognition and alarm systems [60,61] capabilities that can allow the model to be used in a wide range of applications, such as driver authentication and theft prevention. Finally, the model can be developed and used in autonomous vehicles to detect critical conditions or situations that might endanger the driver's health and safety, such as strokes, heart attacks, and other sicknesses. The model can recommend the vehicle to ensure the safety and health of the driver and others.

**Author Contributions:** Conceptualization, M.A. and R.K.; methodology, M.A. and R.K.; software, M.A.; validation, M.A. and R.K.; formal analysis, M.A. and R.K.; investigation, M.A. and R.K.; resources, M.A.; data curation, M.A.; writing—original draft preparation, M.A. and R.K.; writing review and editing, M.A. and R.K.; visualization, M.A. and R.K.; supervision, R.K.; project administration, R.K.; funding acquisition, R.K. All authors have read and agreed to the published version of the manuscript.

**Funding:** This paper was funded by Ryerson University.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The State Farm Distracted Drivers dataset can be accessed in [47].

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **CNN Architectures and Feature Extraction Methods for EEG Imaginary Speech Recognition**

**Ana-Luiza Rusnac \* and Ovidiu Grigore \***

Department of Applied Electronics and Information Engineering, Faculty of Electronics, Telecommunications and Information Technology, Polytechnic University of Bucharest, 060042 Bucharest, Romania **\*** Correspondence: ana\_luiza.dumitrescu@upb.ro (A.-L.R.); ovidiu.grigore@upb.ro (O.G.)

**Abstract:** Speech is a complex mechanism allowing us to communicate our needs, desires and thoughts. In some cases of neural dysfunctions, this ability is highly affected, which makes everyday life activities that require communication a challenge. This paper studies different parameters of an intelligent imaginary speech recognition system to obtain the best performance according to the developed method that can be applied to a low-cost system with limited resources. In developing the system, we used signals from the Kara One database containing recordings acquired for seven phonemes and four words. We used in the feature extraction stage a method based on covariance in the frequency domain that performed better compared to the other time-domain methods. Further, we observed the system performance when using different window lengths for the input signal (0.25 s, 0.5 s and 1 s) to highlight the importance of the short-term analysis of the signals for imaginary speech. The final goal being the development of a low-cost system, we studied several architectures of convolutional neural networks (CNN) and showed that a more complex architecture does not necessarily lead to better results. Our study was conducted on eight different subjects, and it is meant to be a subject's shared system. The best performance reported in this paper is up to 37% accuracy for all 11 different phonemes and words when using cross-covariance computed over the signal spectrum of a 0.25 s window and a CNN containing two convolutional layers with 64 and 128 filters connected to a dense layer with 64 neurons. The final system qualifies as a low-cost system using limited resources for decision-making and having a running time of 1.8 ms tested on an AMD Ryzen 7 4800HS CPU.

**Keywords:** imaginary speech; convolutional neural network; electroencephalography; signal processing; Kara One database

#### **1. Introduction**

Communication is the basis of interpersonal relationships and is one of the most important ways to connect with other people and to express your needs and feelings. The most common forms of communication are writing or speaking, but the latter is the most natural mechanism involved in the transmission of thoughts. This relatively easy to gain ability is often taken for granted; however, it hides a complex mechanism. Speaking involves translating thoughts into the desired words and transmitting them with the help of motor neurons to a large number of muscles and joint components of the vocal tract that must be positioned differently for each spoken sound. This is why speech takes a large part of cortical motor homunculus [1].

Unfortunately, there are cases when this ability is lost, or the speech cannot be articulated due to some affections such as cerebral stroke, lock-down syndrome, amyotrophic lateral sclerosis, cerebral palsy, etc. In order to overcome this dysfunction, a series of alternative methods were proposed. The purpose of the research in this field was to find an easy and natural way of communication.

The activity of the brain can be measured using different methods such as electroencephalography (EEG), magnetoencephalography (MEG), electrocorticography (ECoG),

**Citation:** Rusnac, A.-L.; Grigore, O. CNN Architectures and Feature Extraction Methods for EEG Imaginary Speech Recognition. *Sensors* **2022**, *22*, 4679. https:// doi.org/10.3390/s22134679

Academic Editors: Andrei Velichko, Dmitry Korzun, Alexander Meigal and Raffaele Bruno

Received: 4 April 2022 Accepted: 17 June 2022 Published: 21 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

functional magnetic resonance imaging (fMRI) and stereoelectroencephalography (sEEG). However, when it comes to developing brain computer interface systems (BCI), the most common methods for brain activity recording are EEG and MEG due to their considerable advantages of being non-invasive techniques and more accessible for signal acquisition. The ECoG signals are also widely used in BCI systems, even though they are invasive. The major advantage of the ECoG signals is the quality of the brain activity measurements by recording the signals directly from the cortex, eliminating in this way the attenuation given by the tissues between the cortex and the electrodes in comparison to EEG. fMRI signals are harder to acquire and more expensive than EEG and MEG, even though it is also a non-invasive method. Nevertheless, the best quality of brain activity measurements is collected using the sEEG technique because the electrodes are implanted deep into the brain. This method is the least used method for BCI due to the invasive approach.

In our study, we chose to focus on the EEG signals for their advantages in developing a low-cost, non-invasive portable device.

#### **2. State of the Art**

One of the first studies that tried to reconstruct the speech from EEG signals dates back to 1967 when the scientist Edmond M. Dewan [2] discovered that we can voluntarily control the alpha wave of the EEG signal. Starting from this point, the scientist used morse code in his developed system in order to obtain letters and, finally, conduct words.

Later studies also focused on creating words from letters for subjects to silently communicate with the computer. For example, in 2000, P. R. Kennedy et al. [3] used implanted neurotrophic electrodes on patients with amyotrophic lateral sclerosis (ALS) or brain stroke and obtained a functional system that uses the movement of a cursor as a form of communication. One of the system paradigms was to form words from letters by moving a cursor on the monitor and choosing the desired letter. Another similar approach was presented in [4] that concentrated on finding the trigger of P300 event-related potential (ERT) when the desired line and column of a matrix with letters and numbers were highlighted. Both methods work properly; however, these approaches represent an inconvenient way to communicate since it takes a long time to form a word.

Recent studies focused on finding patterns in EEG signals acquired during imaginary speaking of words or phonemes rather than finding a trigger, trying to obtain a more cursive way of communicating the thoughts. One attempt at unspoken speech recognition was made by Marek Wester in 2006 [5] for his PhD thesis, with results that reached 50% accuracy in multiple class classification. However, the group later revealed in [6] that the experiment process favored the results because the signal acquisition protocol assumed to speak or think the exact stimulus multiple consecutive times, and this accidentally creates temporal correlation in EEG signals. This was an important discovery in data acquisition protocol for further created databases.

In 2015, an open-source database acquired by Schunan Zhao and Frank Rudzicz at the Toronto Rehabilitation Institute was released [7]. The database contains signals collected from 14 healthy subjects during thinking and speaking of seven phonemes: /iy/, /uw/, /piy/, /diy/, /tiy/, /m/, /n/ and four words: "pat", "pot", "knew", "gnaw". This stimulus was chosen to have a relatively even number of vowels, plosives, and nasals as well as voiced and unvoiced phonemes. The researchers further created five binary classification tasks: consonant versus vocals (C/V), presence or absence of nasal (±Nasal), presence or absence of bilabial (±Bilabial), presence or absence of /iy/ phoneme (±/iy/) and presence or absence of /uw/ phoneme (±/uw/). In the conducted study, the researchers computed various statistical features over 10% of the segment windows with 50% overlap, including mean, median, standard deviation, variance, etc. (the details are specified in Table 1). They used the SVM-quad classifier and obtained maximum accuracy over the /uw/ phoneme: 79.16% and the minimum accuracy when classifying consonants versus vocals: 18.08%.

Later, in 2017, using the same database, the researchers Pengfei Sun and Jun Qin [8] conducted an experimental evaluation of three neural networks based on EEG-speech (NES) with the purpose of recognizing all the eleven phonemes. The three neural network models were: imagined EEG-speech (NES-I), biased imagined-spoken EEG-speech (NES-B) and gated imagined-speech (NES-G), with the last two introducing the EEG signals acquired during actual speech. The best results in this multi-classification problem were obtained using the NES-G network with an overall accuracy of 41.5%.

Another approach for the Kara One database binary task classification was proposed by Pramit Saha and Sidney Fels at the University of British Columbia [9]. In the developed study, the researchers used a mixed deep neural network strategy composed of a convolutional neural network (CNN), a long-short term network (LSTM) and a deep autoencoder. The hierarchical deep neural network used the cross-covariance matrix as the input feature matrix, with this method of feature extraction aiming to encode the connectivity of the electrodes. The obtained results increased the overall accuracy of the above binary tasks by 22.5%, achieving an average accuracy of 77.9% across the five known tasks [7].

However, when it comes to multi-classification of the phonemes and words, the results decrease significantly. In 2018 [10], a group of researchers introduced methods of speech recognition in their imaginary speech recognition from EEG signals using mel-cepstral coefficients (MFCC) as feature extraction and SVM classifier for recognition and broke the ice with an average accuracy of 20.80%—this value rising by 9% over the chance level. The results slightly improved when using MFCC for feature extraction and CNN as a classifier in the study [11]. The CNN neural network improved the overall accuracy, obtaining 24.19%.

Nevertheless, the highest accuracy over the multi-class classification of the Kara One phonemes and words was also obtained by the researchers from the University of British Columbia [12]. In their study, the researchers used the cross-covariance matrix (CCV) as feature extraction and a hierarchical combination of deep neural networks. In the first level of the final architecture of the classifier, a CNN was used to extract the spatial features from the covariance matrix. In parallel with CNN, they applied a temporal CNN (TCNN) to explore the hidden temporal features of the electrodes. Further, the latest fully connected layers from the CNN and TCNN were concatenated to compose a single feature vector, which was introduced to the second level of hierarchy consisting of a deep autoencoder (DAE). In the third level of hierarchy, they introduced the latent vector of DAE into an extreme gradient boost classification layer. The final neural network was first used to train the network for all six phonological tasks of Kara One and then to combine the gained information to further predict individual phonemes from all eleven categories.

Recent studies reported more encouraging results on the multi-class classification system of imaginary speech recognition. Developing an impressive database of eight different Russian words acquired from 270 subjects, the researchers [13] obtained a maximum accuracy of 85% when classifying the nine collected words and 88% for binary classification. The results were obtained using the frequency-domain of the signals and were classified with ResNet18 + 2GRU (gated recurrent unit).

Significant results for the imaginary speech recognition community were also obtained by using MEG signals. In 2020, Debadatta Dash, Paul Ferrari and Jun Wang [14] conducted a study based on MEG signals in order to recognize imagined and articulated speech of three different phrases of the English language. To achieve the final goal, the researchers used the discrete wavelet transform (DWT) in the feature extraction stage using a Daubechies (db)-4 wavelet with a seven-level decomposition. Further, they compared artificial neural networks (ANNs) and different configurations of CNNs. The best results were recorded using Spatial Spectral Temporal CNN, reaching an accuracy for the specific three classes of imaginary speech of 93.24%.

ECoG signals were also used for speech recognition and synthesis by Christian Herff et al. in [15]. The researchers managed to synthesize the vocal signals after analyzing motor, premotor and inferior frontal cortices and obtained an accuracy of 66.1% ± 6% in the correct identification of the word of 55 volunteered subjects. This approach offered very encouraging results for a real-time system; however, the brain signals were acquired in articulated speech (not imaginary speech), and the signals were collected using an invasive method.

Another recent study published in 2022 on ECoG signals for imaginary speech recognition was conducted by Thimotheé Proix et al. [16]. For binary classification using an SVM classifier, they managed to obtain an accuracy of over 60% for a patient-specific system. In the feature extraction stage, they used the analytic Morlet wavelet transform. The bands of interest were theta (4–8 Hz), low-beta (12–18 Hz), low-gamma (25–35 Hz) and broadband high-frequency activity (80–150 Hz).

An important role in the research community for EEG signal classification was also taken by the long-short term memory (LSTM) neural networks. LSTM neural networks are considered an improvement of the recurrent neural networks (RNN) due to the inclusion of the "gates" in the algorithm. These "gates" have the purpose of resolving the gradient problem, and they allow more precise control over the information that is kept in its memory [17]. Considering the highly dynamic behavior of the EEG signals, often the LSTM networks offered significantly better performance over different applications of EEG signals, such as emotion recognition, confusion detection and decision-making predictions [18–20]. A great success of LSTM neural networks for articulated speech recognition from EEG signals was presented in [21] for an automatic speech recognition (ASR) system. The researchers used MFCC as features and predicted the coefficients using different types of recognition systems: generative adversarial neural networks (GAN), Wasserstein generative adversarial neural network (WGAN) and LSTM Regression. The results showed an average of the root mean square (RMS) of 0.126 for the LSTM regression compared with 0.193 and 0.188 registered for the GAN and WGAN networks, respectively.

The most significant results from the state-of-the-art, regarding the imaginary speech recognition systems using surface EEG of the Kara One Database are presented in Table 1, along with the most relevant characteristics of the systems: pre-processing method, feature extractions and the classifier used.

This paper contains a study of EEG signals with the main purpose of recognizing seven phonemes and four words acquired during the development of the Kara One database. Our study was conducted on eight different subjects and is meant to be a subject's shared system. By a subject's shared system, we mean a system that can only be used by subjects in the database. However, it is not a subject-specific device that would require different training for each new subject but assumes that only a fine-tuning will be performed when adding a new subject.

This paper also aims to develop a study of two different features computed over different windows of a signal. We used as feature extraction the cross-covariance over the channels in time and frequency domains for data reduction and to encode the variability of the electrodes during the imaginary speech. This hypothesis is based on the fact that speaking is a complex mechanism, implying the connectivity of different areas of the brain during the entire process. We also studied the results obtained after applying a mean filter over the spectrum band with different window dimensions (3 and 5 samples).

Another study conducted in this paper was based on analyzing three different timeframes: 0.25, 0.5 and 1 s. Regarding this study, we aimed to determine the best analysis window dimension for EEG imaginary speech phoneme and word recognition. In a time series, the statistics of the entire signal is different from the statistics of smaller windows—a fact that can lead to a significant impact on the final results of the system.

In the second part of the study, we focused on different CNN architectures for feature classification in order to determine which one fits our data best.


**Table 1.** State-of-the-art EEG speech recognition of Kara One database phonemes and words.

#### **3. Materials and Methods**

#### *3.1. Preparing Database*

In this paper, we used the Kara One database described in [7]. The database contains signals acquired from 12 healthy subjects in 14 sessions during rest, speaking and thinking eleven different stimuli from which seven are phonemes (/iy/, /uw/, /piy/, /tiy/, /diy/, /m/, /n/) and four are words ("gnaw", "knew", "pat", "pot"). Each prompt was presented 12 times, meaning a total of 132 recorded signals for each subject, except for the subjects MM05 and P02, with a total of 165 trials.

The signals were acquired following a given protocol in order to obtain repeatability in the database signals. The protocol started with a 5 s state of rest in which the subject needed to relax for the next stage. Afterward, the stimulus appeared on the prompt for 2 s, and the utterance of the prompt was heard by the subject. This was followed by a 5 s stage in which the subject was instructed to imagine speaking the prompt. Finally, the subject was also asked to speak the prompt aloud.

Our goal was to identify the imagined speech, so in this paper, we only used the signals corresponding to the 5 s state of imaginary speaking of the prompt. Next, we eliminated the first and last 0.5 s of the signal, considering that these intervals correspond to a transition state, obtaininga4s signal in the end.

The signals resulting from the database were visually analyzed by an expert. In the first step of visual data analysis, it was discovered that six of the fourteen sessions presented signals with high noises or unattached ground wires. This situation was also discussed by the developers of the database, Shunan Zhao and Frank Rudzicz, in their paper [7]. Considering that discovery, we discarded all signals from the six contaminated sessions. Afterward, the expert visually analyzed all signals corresponding to thinking indexes and eliminated from the study the ones with high noise contamination. After this process of data analysis, we finally obtained a database with 624 signals to work with during the study. All signals from the database were collected using the 10-20 system for electrode positioning. In this paper, we used 62 electrodes. The electrodes and their position in the 10-20 system used are detailed in [7]. Finally, the signals were filtered using a notch filter in order to remove the 60 Hz power line artifact and all multiples of 60 Hz smaller than the Nyquist frequency.

#### *3.2. Feature Extraction*

In the feature extraction stage, we aimed to analyze the performance of the system when using the time- versus frequency-domain feature extraction methods for silent speech recognition. Another comparison study conducted in this stage was based on computing the features using different timeframes: 0.25, 0.5 and 1 s without overlapping. During this study, we aimed to find the time window in which the signal is quasi-stationary, but also contains all the needed information regarding the utterance.

All signals were segmented using these timeframes, and 50% of the timeframes from each recording were randomly distributed in the training set and 50% in the testing set.

EEG data usually produce a high-dimension time series due to the multiple electrodes. To decrease the dimension of EEG data, usually a data compression stage is conducted based on feature selection in order to extract the essential information from the signals [10] or to reduce the number of channels based on their informational relevance in relation to the system goal [22]. A new approach to reducing the data dimension of the EEG signals was presented by Pramit Saha and Sidney Fels in their study [9], where they computed the cross-covariance between the channels in the time domain in order to encapsulate the variability of the electrodes. In this study, we also used this technique of feature extraction and expanded it in the frequency domain.

The cross-covariance between two channels (c1 and c2) was defined in this study as:

$$\text{Cov}\left(\mathbf{X}^{\text{c1}}(\mathbf{t}), \mathbf{X}^{\text{c2}}(\mathbf{t})\right) = \text{E}\left[\left[\mathbf{X}^{\text{c1}}(\mathbf{t}) - \mathbb{E}(\mathbf{X}^{\text{c1}}(\mathbf{t}))\right] \left[\mathbf{X}^{\text{c2}}(\mathbf{t}) - \mathbb{E}(\mathbf{X}^{\text{c2}}(\mathbf{t}))\right]\right],\tag{1}$$

where Xc1(t) represents the EEG signal acquired for channel c1, Xc2(t) is the EEG signal acquired for channel c2, and E[Xch(t)] represents the expected value (where ch corresponds to the specific channel c1 or c2) and is computed as:

$$\mathbb{E}(\mathsf{X}^{\mathrm{ch}}(\mathsf{t})) = \frac{1}{\mathsf{W}} \sum\_{\mathbf{i}=\mathsf{0}}^{\mathrm{W}-1} \mathsf{x}\_{\mathbf{i}}^{\mathrm{ch}} \tag{2}$$

The W value of Equation (2) corresponds to the window dimension for which the features are computed.

The second method of feature extraction analyzed in this paper assumes the transformation of the time domain series of EEG signals into the frequency domain using the Fast Fourier Transform (FFT). The Fourier transform is a method used to decompose the signal into sinus and cosine waves.

The FFT of a channel was computed using the following:

$$\text{F}\mathbf{X}^{\text{ch}}(\mathbf{f}) = \sum\_{\mathbf{t}=0}^{n-1} \mathbf{X}\_{\mathbf{t}}^{\text{ch}} \mathbf{e}^{-\frac{j2\pi\mathbf{t}\mathbf{t}}{n}} \tag{3}$$

where Xch <sup>t</sup> represents the EEG signal acquired for channel ch.

After computing the signals corresponding to the frequency-domain of desired channels using Equation (3), we computed the cross-covariance between the Fourier transform of the channels.

Figures 1 and 2 present examples of a 2D feature matrix with a 62 × 62 dimension, corresponding to the time and frequency domain, respectively, for a 0.25 s window timeframe.

**Figure 1.** Example of a 2D feature matrix computed in the time domain for 0.25 s time window.

**Figure 2.** Example of a 2D feature matrix computed in the frequency domain for a 0.25 s time window.

#### *3.3. Classification*

Convolutional neural networks (CNN) are powerful networks when applied to images. They have the power to understand the image content and to extract the deep information encoded in the input data. Nowadays, many systems are based on this type of neural network. CNN showed a great success in understanding biomedical images for classification, segmentation, detection and localization [23] for different types of input images, offered a great false prediction rate in seizure prediction systems based on EEG signals [24], and is widely used in BCI systems for imaginary motion recognition [25–27] and assisting in the diagnosis of Parkinson's disease [28]. In the imaginary speech recognition domain, the CNN was a great resource for EEG signal classification [9,27].

The great success of CNN is due to the design of the hidden convolutional layers working as a decoder for the disguised essential information of the two-dimension matrix offered as input. It has the power to extract features and feed them to the dense layers designed to classify these computed features. The component of a CNN starts with an input layer that receives the given data. Then, it continues with the hidden layers corresponding to the convolutional layers in the first phase, which interprets the data received from the input. The output of the last convolutional layer is flattened and introduced into one or multiple dense layers having the purpose of learning the extracted features. Finally, the neural network contains an output layer, which usually has the role of classifying the data into the desired classes [29–31]. A general CNN block diagram is presented in Figure 3.

**Figure 3.** Block diagram of general convolutional neural networks.

In our research, we tested different architectures of the CNN neural network with the purpose of finding the one offering the best performance with respect to the complexity, memory, and the running time. We started with a low complexity architecture with one convolutional layer and one dense layer, and we increased the complexity up to three convolutional layers and one dense layer, having a larger number of filters and neurons.

In the training phase, we used a learning rate of 0.0001, categorical cross-entropy as loss and Adam as optimizer. We divided the training set into 75% training and 25% validation and used k-fold cross-validation in order to obtain a more accurate performance result. Figure 4 presents an example of the architecture used in the classification stage, with two convolutional layers of 64 and 128 filters, respectively, and one dense layer with 64 neurons.

**Figure 4.** Block diagram of the CNN architecture used in the classification stage.

#### **4. Results**

During the development of the system, we aimed to study five different variables capable of influencing the performance of the imaginary speech recognition. Our study of system performance analysis included: (a) the influence of CNN hyperparameters; (b) modification of the network architecture; (c) the impact of the different activation functions that can be used in the CNN; (d) different features capable of encoding the speech hidden information by computing the covariance of the signals over the channels in time and frequency (B0) domain; (e) different window dimensions for the feature extraction method; (f) average filter of three (B3) and five (B5) dimension kernels over the computed spectrum of the data.

For further simplification of displaying the results of different architecture models, we used the abbreviation explained in Table 2. As an example, the architecture Conv2D (64, 128, 64)-Dense (64) corresponds to a CNN network with three convolutional layers, with 64, 128, and 64 filters, respectively, and one dense layer with the number of neurons in the layer equivalent to 64. For all architectures, after the dense layer was introduced, the output layer with 11 neurons corresponded to the 11 different classes.

**Table 2.** Convolutional Neural Network architecture abbreviations.


The Kara One database does not show a significant class imbalance. The number of the samples from each class starts from a minimum of 83 (phoneme \m\) and reaches a maximum of 95 (word "pot") out of a total of 993. The a priori probability rises from 0.083 for \m\ phoneme to 0.095 for the word "pot".

#### *4.1. Comparison of Activation Function: Tanh vs. Relu*

The results obtained over the test set using different architectures of the CNN and different activation functions for the convolutional layers (hyperbolic tangent vs. rectified linear unit) using the covariance of the spectrum without an average filter (B0) computed over 0.5 s windows are detailed in Table 3.


**Table 3.** Results obtained using different CNN architectures for the covariance of spectrum features computed over a 0.5 s window comparing the hyperbolic tangent activation function of convolutional layers with the rectified linear unit.

#### *4.2. Comparison of Features: Time vs. Frequency*

Further in our study, we also compared the differences between the features computed over the signal in the time and frequency domains. The results obtained using different tested architectures are presented in Table 4. It is easy to observe a significant accuracy decrease when using time-domain cross-covariance versus frequency-domain features. The difference between the accuracy of the two feature extraction methods increases to approximately 16%, with the accuracy of frequency features reaching a maximum of 37% and the maximum accuracy of the time-domain features decreasing to 21%. These differences imply that information of speech is more easily decoded by the neural network in the frequency domain rather than in the time domain. The main advantage of the covariance in the frequency domain is given by the elimination of the possible delays of the stimulus propagation over the channels, starting from the source activation of the specific imaginary articulation of the phoneme.

A study of different architectures of the neural network shows (Figure 5) that a CNN with three convolutional layers with 64 and 128 filters and connected with a dense layer with 64 neurons works best for the frequency-domain features (the features that provided the best accuracy rate), obtaining a performance of 37% accuracy. When it comes to the time domain, the best results were obtained using less complex architectures, and the best performance of the system was recorded using only one convolutional layer with 64 filters and one dense layer with 64 neurons.


**Table 4.** The results obtained after computing the different feature extractions: in the time domain and frequency domain over windows of 0.25 s.

**Figure 5.** Accuracy for different architectures with a 0.25 s time window.

The mean confusion matrices for all k-folds for the time and frequency features are presented in Figure 6. In both images, we can see a distinction between phonemes and words. The system has a difficult time recognizing one phoneme against the other but makes a clearer distinction between them and the words. We can also observe that phoneme \diy\ is often confused with similar phonemes such as \iy\ and \piy\. It can also be seen that there is no significant imbalance in the recognition of any of the phonemes and words; however, the words have a higher accuracy rate of recognition.

**Figure 6.** Mean of k-fold confusion matrix in the time and frequency domains.

#### *4.3. Comparison of Time Window Length: 0.25, 0.5 and 1 s*

After we concluded that the system works better with a rectified linear unit as an activation function for the convolutional layers in the frequency domain, we tested the network with different window lengths for the input data. The results are presented in Table 5.

**Table 5.** Comparison of different window length (0.25, 0.5, 1 s) results for the covariance of spectrum features without an average filter (B0).


The 4 s EEG signal containing imaginary speech includes multiple imaginations of the specific stimulus. It is hard to precisely determine the moment containing the desired signal in the whole four seconds of recording, which is why we chose to segment the signal over different window lengths and observe the system behavior. Table 5, as well as Figure 7, shows that the best analysis window is 0.25 s, reaching an accuracy of 37%. Looking at the 0.5 and 1 s window lengths, we can observe that the 0.5 s offered an accuracy close to the 0.25 s window, meaning that the signals are still easier to decode compared to the 1 s window in which the accuracy significantly dropped to 29%. The mean confusion matrices for the 0.5 s window and 1 s window are presented in Figure 8.

**Figure 7.** Maximum accuracy for the analyzed windows.

**Figure 8.** Mean of k-fold confusion matrix in the frequency domain for 0.5 s and 1 s windows.

#### *4.4. Comparison of Mean Filter Kernel: B0, B3 and B5*

Another study conducted in this paper focused on applying different average filter lengths over the spectrum before computing the covariance matrix. We tested two different filter lengths: three samples and five samples. We will further refer to the spectrum without a mean filter as B0, the spectrum with an applied mean filter length of three samples as B3, and the spectrum with an applied mean filter length of five samples as B5. The obtained results can be seen in Table 6. The main motivation for this approach was developed on the assumption that the analysis of multiple values of the spectrum, as opposed to analyzing only the local values, can offer a better perspective of the frequency distribution regarding different classes. This assumption did not stand up because, as can be seen in Table 6 and in Figure 9, the better accuracy results were obtained using the unmodified spectrum. These results imply that every frequency is important for the phoneme and word recognition problem.


**Table 6.** Comparison between the results obtained after applying different kernels for the average filter of the spectrum. The analysis window is 0.5 s.

**Figure 9.** Maximum accuracy for the feature extraction methods analyzed.

#### *4.5. Performance Evaluation Metrics*

For a better understanding of the recorded results and the system performance, we introduced the computed values for all extracted features: the balanced accuracy, kappa and recall [32]. The obtained values are presented in Table 7.



According to Table 7, the balanced accuracy and recall are not significantly different from the computed accuracies for the features, and only the kappa score dropped to a value of approximately 0.7 for all features.

#### *4.6. Complexity and Memory Measurements*

This paper aimed to develop a low-cost system working with limited resources. To achieve this goal, we tested different architectures of CNN networks for different types of features and windows. This research helped us to determine the best CNN architecture, features and window frame that can be implemented on a device with limited resources.

Given the application, the most significant resource consumer is the neural network. For the CNN neural network architecture, the complexity of the algorithm can be estimated as O(k × <sup>N</sup> × <sup>M</sup> × nFL−<sup>1</sup> × nF L), where k is the kernel matrix, N is the number of lines of the input matrix, M is the number of columns of the input matrix, nFL−<sup>1</sup> is the number of filters from the anterior CNN layers and nFL is the number of filters from the current layer. In our case, the input matrix has the same number of lines and columns (M = N = 62), and we can write the complexity as O(k × <sup>N</sup><sup>2</sup> × nFL−<sup>1</sup> × nF L). The details of the complexity, memory and time for the feature extraction stage and the best performance CNN architecture are presented in Table 8.


**Table 8.** Detailed complexity, memory and time computation for the system with the best results.

Using an AMD Ryzen 7 4800HS CPU with 16 GB memory RAM and 2.9 GHz clock frequency, we managed to obtain an average time per recognized input vector of 1.8 × <sup>10</sup>−<sup>3</sup> <sup>s</sup> starting from the feature extraction stage up to the decision making. The time was estimated (Table 8) using the characteristics of the best system in terms of performance, meaning computing the output for the 0.25 s window vector with the C64-128/D64 neural network architecture (Table 5).

A comparison of the methods in terms of execution time, as can be observed in Table 9, show that there are no significant differences between the execution of the different features; however, there is approximately an order of magnitude between the best performance architectures and the most complex one tested.

**Table 9.** Execution time for all tested architectures and features.


#### **5. Discussion**

This paper aims to compare different parameters of an intelligent imaginary speech recognition subject's shared system to observe the performance variation when using different mechanisms of feature extraction and different architectures of CNN in the classification stage.

We used the Kara One database in our study, designed and conducted at the Toronto Rehabilitation Institute by Shunan Zhao and Frank Rudzicz [7], which contains signals acquired during speech and imaginary speech of seven phonemes and four words.

During the recognition process, we pre-processed the signals, and after the visual inspection, eliminated all data from subjects containing electrodes with bad connectivity and the signals with high noise. Furthermore, in the pre-processing stage, we applied a notch filter to remove the 60Hz power line artifact and all multiples of 60Hz smaller than the Nyquist frequency. It is worth mentioning that, in our study, we kept all highfrequency information.

#### *5.1. Time vs. Frequency Features*

After the pre-processing stage, we went through a feature extraction stage where we focused on comparing the feature extraction based on cross-covariance over the channels in the time and frequency domains. The cross-covariance method is based on the fact that speech is a complex mechanism, requiring thinking of the speech stimulus, preparing the vocal tract for the actual vocalization and giving the signal to all components of the vocal tract involved in the actual speaking of the stimulus. For different stimuli, there are different positions and components involved in the process. This mechanism demands the activation of multiple areas of the brain that communicates in a very short time. The connections of different areas are best highlighted by the cross-covariance between the channels. The results presented in Table 4 show that there is a considerable difference between the results obtained using time-domain feature extractions versus frequencydomain feature extractions. When using frequency-domain features, the accuracy increases by approximately 16% to a value of 0.37 compared to 0.21 obtained when using features in the time-domain. This difference is given by the fact that the signal spectrum eliminates the delays of the stimulus propagation over the channels, starting from the activation focus of the specific imaginary articulation of the phoneme.

#### *5.2. Time-Window Analysis*

Another study conducted in this paper aims to compare different sizes of the analysis window in order to observe the signal statistics of different time gaps. During this study, we aimed to find the time window in which the signal is quasi-stationary but also contains all the needed information regarding the utterance. We compared three analysis window sizes: 0.25, 0.5 and 1 s. The obtained results can be seen in Table 5. Comparing the window dimensions, we observed that the best time window length was 0.25 s. The accuracy of the results is significantly higher when using 0.25 s, increasing to a value of 0.37, compared to 0.29 when using a 1 s window. The difference between the accuracy of the 0.25 s window and the 0.5 s window is 1%, which is not very significant. This means that for a 0.5 s window, the utterance of the phonemes and words are still captured by the frame.

Analyzing the results in Table 5 also shows that the maximum accuracy for all timeframe windows was obtained using a low-complex architecture for the CNN.

#### *5.3. Mean Filter over the Spectrum Analysis*

During our research for improvement, we also tried to average the spectrum of the signals with a filter of three and five samples. The main motivation for this approach was developed on the assumption that the analysis of multiple values of the spectrum, compared to analyzing only the local values, can offer a better perspective of the frequency distribution regarding different classes. The details of this research are presented in Table 6. As can be seen, applying an average filter over the spectrum did not increase the accuracy; on the contrary, the accuracy dropped by approximately 9% when using filters with three and five samples.

#### *5.4. CNN Architectures Analysis*

In our final study, we tested different architectures for the CNN network to observe the system performance and shape the way for the future development of similar systems. We concluded that when it comes to the frequency-domain features (the features that provided the best accuracy rate), the best architecture is two convolutional layers with 64 and 128 filters connected to a dense layer with 64 neurons. More complex architectures do not improve the performance of the system, and on the contrary, the performance decreases.

#### **6. Conclusions**

This paper analyses the EEG signals for imaginary speech recognition of seven phonemes and four words. To accomplish our purpose, we developed an intelligent subject's shared system using a processing chain applied to the Kara One database [7]. The first stage in the analysis chain started with pre-processing the input signals in order to obtain better quality data. Further in the feature extraction stage, we compared the results obtained after computing the cross-covariance over the channels in the time and frequency domains. During our research, we also studied different time window lengths: 0.25, 0.5 and 1 s to find the time window in which the signal is quasi-stationary but also contains all the information needed regarding the utterance. We also studied the system behavior when applying a mean filter with kernel sizes of three and five samples assuming that the analysis of multiple values of the spectrum, compared to analyzing only the local values, can offer a better perspective of the frequency distribution regarding different classes. Finally, in the classification stage, we tested multiple architectures of the CNN neural network to determine the best performance of the system.

The best results were obtained using the cross-covariance over channels in the frequency domain using a 0.25 s window length. The best performance of the system was recorded when using a CNN with two convolutional layers and 64 and 128 filters, connected to a dense layer with 64 neurons. With these system characteristics, we achieved an accuracy of 37%, a significant improvement compared to using the Mel-Cepstral Coefficients for feature extraction, where the best accuracy recorded was 20.80% when using an SVM as the classifier [10] and 24.19% when using a CNN as the classifier [11]. During our study, we also showed that cross-covariance in the frequency domain offers a better understanding of the imaginary speech, reporting a better accuracy in comparison to the study made by Pramit Saha, Muhammad Abdul-Mageed and Sidney Fels in [12] where, using the cross-covariance in time and hierarchical deep learning (without phonological features), the best reported accuracy was 28%. However, when using phonological features, the accuracy increased to 54%, but this compromised the complexity and the memory of the system and is more difficult to implement in a low-complexity portable device.

The main limitation of our proposed system includes the acquisition of new data for each new subject before being able to wear the system. The collected data must be included in the database for which a fine-tuning of the network training must be applied. However, this limitation can be overcome in time by enriching the database with new examples.

In this study, we proposed a feature extraction method based on cross-covariance in the frequency domain that offered a significant improvement for the system performance compared to features computed in the time domain. We are confident that these features can be further exploited to obtain even more precise systems for imaginary speech recognition.

In this paper, we achieved our goal of highlighting the importance of using frequency in the feature extraction stage in contrast to the time domain. The advantage of using the frequency domain is given by the elimination of the delays caused by the propagation of the stimulus from one channel to another during the imaginary articulation of the speech. We also showed that a quicker analysis of the signal offers a better understanding of the thinking speech.

Finally, we can say that the proposed system qualifies as a portable, low-cost system using limited resources for decision making. The running time for the best performance CNN architecture was 1.8 ms tested on an AMD Ryzen 7 4800HS CPU.

**Author Contributions:** Conceptualization, O.G.; Data curation, A.-L.R.; Methodology, A.-L.R. and O.G.; Software, A.-L.R.; Supervision, O.G.; Writing – original draft, A.-L.R.; Writing – review & editing, O.G. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Ethical approval was obtained from both the University of Toronto and the University Health Network, of which Toronto Rehab is a member.

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study.

**Data Availability Statement:** http://www.cs.toronto.edu/~complingweb/data/karaOne/karaOne. html. (accessed on 16 June 2022).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Diagnosis and Prognosis of COVID-19 Disease Using Routine Blood Values and LogNNet Neural Network**

**Mehmet Tahir Huyut 1,\* and Andrei Velichko 2,\***


**Abstract:** Since February 2020, the world has been engaged in an intense struggle with the COVID-19 disease, and health systems have come under tragic pressure as the disease turned into a pandemic. The aim of this study is to obtain the most effective routine blood values (RBV) in the diagnosis and prognosis of COVID-19 using a backward feature elimination algorithm for the LogNNet reservoir neural network. The first dataset in the study consists of a total of 5296 patients with the same number of negative and positive COVID-19 tests. The LogNNet-model achieved the accuracy rate of 99.5% in the diagnosis of the disease with 46 features and the accuracy of 99.17% with only mean corpuscular hemoglobin concentration, mean corpuscular hemoglobin, and activated partial prothrombin time. The second dataset consists of a total of 3899 patients with a diagnosis of COVID-19 who were treated in hospital, of which 203 were severe patients and 3696 were mild patients. The model reached the accuracy rate of 94.4% in determining the prognosis of the disease with 48 features and the accuracy of 82.7% with only erythrocyte sedimentation rate, neutrophil count, and C reactive protein features. Our method will reduce the negative pressures on the health sector and help doctors to understand the pathogenesis of COVID-19 using the key features. The method is promising to create mobile health monitoring systems in the Internet of Things.

**Keywords:** COVID-19; biochemical and hematological biomarkers; routine blood values; feature selection method; LogNNet neural network; Internet of Medical Things; IoT

#### **1. Introduction**

The new severe acute respiratory syndrome coronavirus (SARS-CoV-2), first identified in 2019, has rapidly affected the world and caused a pandemic [1,2]. The disease, identified as coronavirus 2019 (COVID-19), can cause severe pneumonia and fatal acute respiratory distress syndrome (ARDS) [3–6]. While the disease may be asymptomatic, severe ARDS is thought to be caused by an inflammatory cytokine storm that may be encountered during the disease period [6,7]. The pathogen can cause a serious respiratory disorder that requires special intervention in intensive care units (ICUs) and, in some cases, may cause death [6,7]. Moreover, the symptoms of COVID-19 induced by the new SARS-CoV-2 are difficult to distinguish from known infections in the majority of patients [6,8,9].

Previous studies have demonstrated the clinical importance of changes in routine blood parameters (RBV) in the diagnosis and prediction of prognosis of infectious diseases [1–4,10–12]. Similarly, many abnormalities have been reported in the peripheral blood of patients infected with COVID-19 [6,7,11]. However, Jiang et al. [13] and Zheng et al. [14] emphasized that information on early predictive factors for particularly severe and fatal COVID-19 cases is relatively limited and further research is needed. Huyut et al. [6] and Lippi et al. [15] described that the rapid spread of disease in pandemics overwhelms health systems and raises concerns about the need for intensive care treatment [6,15]. In addition, the detection of severe and mild patients in COVID-19 is an important and clinically

**Citation:** Huyut, M.T.; Velichko, A. Diagnosis and Prognosis of COVID-19 Disease Using Routine Blood Values and LogNNet Neural Network. *Sensors* **2022**, *22*, 4820. https://doi.org/10.3390/s22134820

Received: 24 May 2022 Accepted: 23 June 2022 Published: 25 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

difficult process in terms of morbidity and mortality [6]. Despite these clinical features of COVID-19, studies with large samples representing laboratory abnormalities of patients are needed [3,16]. Therefore, the relationship between COVID-19 disease and RBVs should be supported by large datasets.

Studies have sought how to determine whether patients who are likely to benefit from supportive care and early intervention are at risk and how to identify them [6,11]. While new tests are being developed for the diagnosis of COVID-19, Banerjee et al. [8] stated that these applications require specialized equipment and facilities. Estimating the diagnosis and prognosis of diseases without using advanced devices and methods can help with various problems, such as patient comfort, as well as health system and economic inefficiencies. For this purpose, Beck et al. [17] and Xu et al. [18] have reported that more economical and faster alternative methods are being developed to assist clinical procedures.

Uncertainties in the routine blood values of COVID-19 patients, in addition to difficulties in diagnosis and treatment have increased the interest in machine learning (ML) and artificial intelligence (AI) approaches. Artificial intelligence models have the power to reveal hidden relationship structures between features [19]. Artificial intelligence approaches are frequently used in real-time decision making to reduce drug costs, improve patient comfort, and improve the quality of healthcare services [5,19].

There are several artificial intelligence methods to predict the diagnosis and mortality of COVID-19 [4,17]. Most of these studies have relied on computed tomography (CT) [19], while far fewer studies relied on RBVs [4,5,20]. Imaging-based solutions are costly, timeconsuming, and require specialized equipment [20]. Diagnosis based on RBV values can provide an effective, rapid, and cost-effective alternative for the early detection and prognosis of COVID-19 cases [5,20,21].

Previous AI studies did not use most of the RBV parameters and reported relatively poor classifier performance compared to the current study [2,3,5,6]. In addition, previous studies [8,19–25] have generally focused on the early diagnosis of COVID-19 disease and have addressed relatively smaller samples. Artificial intelligence studies on predicting the prognosis of the disease and detecting severely or mildly infected patients in the early period based on RBVs alone are insufficient. New studies could reduce the intensity of the ICU and help health services by detecting severe and mildly infected patients with COVID-19 early [2,5,19,20].

Most ML approaches involve the process of transforming the feature vector from the first multidimensional space to the second multidimensional space and detecting the vector by a linear classifier [26]. The differences between ML models generally lie in the transformation algorithms and their number and order. In addition, transformation algorithms can be in the form of reducing and increasing the space dimension. The popular machine learning classifier algorithms used for data analysis are: multilayer perceptron (feedforward neural network with several layers, linear classifier) [27], support vector machine [28], K-nearest neighbors method [29], XGBoost classifier [30], random forest method [31], logistic regression [32], and decision trees [33].

ML algorithms typically require a sufficiently large number of samples. However, in our case, the dataset has to be reduced to avoid dimensionality problems by finding a matrix that has fewer columns and is similar to the original matrix. Since the new matrix consists of fewer features, it can be used more efficiently than the original matrix. Dimensionality reduction is the process of finding matrices with fewer columns. Feature selection is one of the techniques used to reduce dimensionality, when irrelevant and redundant features are discarded [26,34]. In addition, the selection of appropriate features can reduce the measurement cost and provide a better understanding of the problem [26]. Feature selection methods can be classified as filters, embedded methods, and wrappers (forward selection, backward elimination, recursive feature elimination) [26,34]. Because feature selection is part of the training process in embedded methods, our method lies between filters and wrappers. Searching for the best subset of features is performed during

training of the classifier, e.g., when optimizing weights in a neural network. Therefore, embedded methods present a lower computational cost than wrappers [26].

Most of the feature selection methods are filters, although we can find representative methods for all three categories [26]. The large number of available feature selection methods complicates the selection of the best method for a given problem [34]. The latest methods that have become popular among researchers are feature selection based on correlation (CFS) [35], filtering based on consistency [36], INTERACT [37], knowledge gain (InfoGain) [38], ReliefF [39], recursive feature elimination for support vector machines (SVM-RFE) [40], Lasso editing [41], and the minimum redundancy maximum relevance (mRMR) algorithm (developed specifically for dealing with microarray data) [26].

In [42], a classifier based on the LogNNet neural network was described using a handwriting recognition example from the MNIST database. Velichko [43] demonstrated the use of the LogNNet to calculate risk factors for the presence of a disease based on a set of medical health indicators. The LogNNet neural network is a feedforward network that improves classification accuracy by passing the feature vector through a special reservoir matrix and transforming it into a feature vector of different size [44]. Previous studies have shown that the higher the entropy of a chaotic mapping that fills a reservoir matrix, the better the classification accuracy [45]. Therefore, the procedure for optimizing chaotic map parameters plays an important role in the presented data analysis method using the LogNNet neural network. In addition, due to the characteristics of chaotic mapping, RAM usage by a neural network can be significantly reduced. In [43], the operation of the LogNNet algorithm on a device with 2 kB of RAM was presented. This result demonstrated that LogNNet can be used in Internet of Things (IoT) mobile devices.

In this study, we apply the LogNNet neural network for the diagnosis and prognosis of COVID-19 using the RBV values measured at the time of admission to the hospital. The wrapper-type backward feature elimination algorithm has been successfully adapted to LogNNet. The novelty of the presented method is the approach to the diagnosis and prognosis of COVID-19 using routine blood values.

The paper has the following structure. Section 2 describes the data collection procedure, the basic LogNNet architecture, and K-fold cross-validation technique. Section 3 presents examples of using the feature selection methodology for two datasets. In this section, the most important RBVs (features) effective in the diagnosis and prognosis of the disease were selected. Using various feature combinations, the performance of the LogNNet model in the diagnosis and prognosis of the disease was calculated. Section 4 discusses the results and compares them with known developments. In conclusion, a general description of the study and its scientific significance are given.

#### **2. Materials and Methods**

This study was conducted in accordance with the Declaration of Helsinki, 1989. Data were collected retrospectively from the information system of Erzincan Binali Yıldırım University Mengücek Gazi Training and Research Hospital (EBYU-MG) between April and December 2021. The study had three main stages: data collection, LogNNet training with selection of main features, and testing of feature combinations (Figure 1).

The RBV of the patients consisted of biochemical, hematological, and immunological tests. Patients admitted to the ICU were defined as severely infected, while patients who could not be admitted to the ICU (non-ICU, subjects in all wards) were defined as mildly infected. The dataset SARS-CoV-2-RBV1 included information on *n* = 2648 COVID-19 positive outpatients and *n* = 2648 COVID-19 negative (control group), for a total of 5296 patients. The dataset SARS-CoV-2-RBV2 contained information of *n* = 203 ICU and *n* = 3696 non-ICU COVID-19 patients. Raw data records included patients' diagnoses (COVID-19, heart disease, asthma, etc.), treatment units (ICU or non-ICU), age, and RBV data. The entire recording process took 20 h. In the raw data, RBV data were on a quantitative scale, diagnostic data were on a multinomial scale, and treatment units were on a binomial scale. In the data preprocessing stage, the string data were converted into numerical

data. Categorical data were coded, repeated measurements were averaged, duplicates were removed, and quantitative data were normalized. The missing RBV data were complemented by the mean of the respective parameter distribution.

**Figure 1.** The main stages of the study for the diagnosis and prognosis of COVID-19 using the routine blood values: data collection, LogNNet training with the selection of main features, testing combinations of the most important features that influence the diagnosis and prognosis of the disease.

#### *2.1. Characteristic of Participants, Workflow and Define Datasets*

In the EBYU-MG hospital, only the cases that were detected as SARS-CoV-2 by real-time reverse transcriptase polymerase chain reaction (RT-PCR) in nasopharyngeal or oropharyngeal swabs during the dates covered by this study were diagnosed with COVID-19. The research only included individuals over the age of 18. In order to prevent various complications, RBV results at the first admission were recorded.

The first SARS-CoV-2-RBV dataset (SARS-CoV-2-RBV1) includes the information of 2648 patients diagnosed with COVID-19 and receiving outpatient treatment in hospital on the specified dates, and the same number of patients (control group) whose COVID-19 tests were negative. The control group was randomly selected from individuals over the age of 18 who had applied to the emergency COVID-19 service but had a negative RT-PCR test. With the feature selection procedure, the most important RBV features that are effective in the diagnosis of the disease were selected from the SARS-CoV-2-RBV1 dataset. The selected features were fed into LogNNet neural network to examine the method's performance in diagnosing COVID-19 disease.

The second SARS-CoV-2-RBV dataset (SARS-CoV-2-RBV2) includes the information of 3899 patients who were treated for COVID-19 in hospital on the specified dates. The treatment units of these patients at the first admission were examined. The SARS-CoV-2-RBV2 dataset contains *n* = 203 ICU and *n* = 3696 non-ICU COVID-19 patients. Then, with the feature selection procedure, the most influential RBV traits in the prognosis of the disease were selected from the SARS-CoV-2-RBV2 dataset. Selected features were fed into the LogNNet neural network to examine the performance of this method in determining the prognosis and severity of COVID-19 disease.

The SARS-CoV-2-RBV1 and SARS-CoV-2-RBV2 datasets are presented in Tables 1 and 2. SARS-CoV-2-RBV1 and SARS-CoV-2-RBV2 datasets include immunological, hematological, and biochemical RBV parameters and each dataset consists of 51 features. In the SARS- CoV-2-RBV1 dataset, positive COVID-19 test results were coded as 1 and negative as 0 (COVID-19 = 1, non-COVID-19 = 0).


**Table 1.** Feature numbering for SARS-CoV-2-RBV1 datasets.

CRP: C-reactive protein; INR: international normalized ratio; PT: prothrombin time; PCT: Procalcitonin; ESR: erythrocyte sedimentation rate; aPTT: activated partial prothrombin time; LYM: lymphocyte count; NEU: neutrophil count; PLT: platelet count; WBC: white blood cell count; BASO: basophil count; EOS: eosinophil count; HCT: hematocrit; HGB: hemoglobin; MCH: mean corpuscular hemoglobin; MCHC: mean corpuscular hemoglobin concentration; MCV: mean corpuscular volume; MONO: monocyte count; MPV: mean platelet volume; PDW: platelet distribution width; RBC: red blood cells; RDW: red cell distribution width; ALT: alanine aminotransaminase; AST: aspartate aminotransferase; ALP: alkaline phosphatase; CK-MB: creatine kinase myocardial band; D-Bil: direct bilirubin; GGT: gamma-glutamyl transferase; HDL-C: high-density lipoprotein-cholesterol; CK: creatine kinase; LDH: lactate dehydrogenase; LDL: low-density lipoprotein; T-Bil: total bilirubin; TP: total protein; eGFR: estimating glomerular filtration rate; UA: uric acid.

**Table 2.** Feature numbering for SARS-CoV-2-RBV2 datasets.


ALT: alanine aminotransaminase; AST: aspartate aminotransferase; ALP: alkaline phosphatase; CK-MB: creatine kinase myocardial band; D-Bil: direct bilirubin; GGT: gamma-glutamyl transferase; HDL-C: high-density lipoprotein-cholesterol; CK: creatine kinase; LDH: lactate dehydrogenase; LDL: low-density lipoprotein; T-Bil: total bilirubin; TP: total protein; eGFR: estimating glomerular filtration rate; UA: uric acid; BASO: basophil count; EOS: eosinophil count; HCT: hematocrit; HGB: hemoglobin; LYM: lymphocyte count; MCH: mean corpuscular hemoglobin; MCHC: mean corpuscular hemoglobin concentration; MCV: mean corpuscular volume; MONO: monocyte count; MPV: mean platelet volume; NEU: neutrophil count; PDW: platelet distribution width; PLT: platelet count; RBC: red blood cells; RDW: red cell distribution width; WBC: white blood cell count; CRP: Creactive protein; INR: international normalized ratio; PT: prothrombin time; PCT: procalcitonin; ESR: erythrocyte sedimentation rate; aPTT: activated partial prothrombin time.

In the SARS-CoV-2-RBV2 dataset, severely infected (ICU) COVID-19 patients were coded as 1, while mildly infected (non-ICU) COVID-19 patients were coded as 0. Datasets are available for download in the Supplementary Materials.

#### *2.2. LogNNet Architecture*

Figure 2 demonstrates the principle of operation of the neural network LogNNet [43].

**Figure 2.** LogNNet architecture [43].

An object in the form of a feature vector, denoted as *d*, is inputted to LogNNet. The feature vector contains *N* coordinates (*d1*, *d2*, ... , *dN*), where the number *N* is defined by the user. The classifier output determines the object class to which the input feature vector *d* belongs. The number of possible classes is denoted as *M*. LogNNet contains a reservoir with a special matrix, denoted as *W*. The matrix *W* was filled in a row-by-row pattern with numbers generated by the chaotic mapping *xn*. We use chaotic mapping based on the congruential generator Equation (1) (see Table 3) and the algorithm of matrix *W* filling shown in Algorithm 1. Vector *d* is converted into a vector *Y* of dimension *N* + 1 with an additional coordinate *Y0* = 1, and each component is normalized by dividing by the maximum value of this component in the training base. The next step is a multiplication of a special matrix *W* with the dimension (*N +* 1) × *P* and a vector *Y*. The result is a vector *S'* with *P* coordinates, which is normalized [42] and converted into a vector *Sh* of dimension *P + 1* with zero coordinate *Sh* [0] = 1, which plays the role of a bias element. In this way, the primary transformation of the feature vector *d* into the second *(P* + 1)-dimensional space is completed. Then, the vector *Sh* is fed to a two-layer linear classifier, with the number of neurons *H* in the hidden layer *Sh*2, and the number of outputs *M* in the output layer *Sout*. To indicate the parameters of the neural network, the following designation LogNNet *N*:*P*:*H*:*M* is used.

**Table 3.** Chaotic map equation and list of optimized parameters with limits.



The training of the linear classifier LogNNet was carried out using the backpropagation method [42].

#### *2.3. Optimization of Reservoir Parameters*

The optimal chaotic mapping parameters were selected using a special algorithm. The ranges of the parameters are indicated in Table 3. Before optimization, it is necessary to set the following values of the constant parameters of the model: the value *P* + 1, which determines the dimension of the vectors *Sh* and *Sh2*, the number of layers in the linear classifier, the number of epochs *Ep* for backpropagation training, and the number of neurons in the classifier's hidden layer, in the case of a two-layer classifier. The training of the LogNNet network is performed by two nested iterations [46]. The inner iteration trains the output LogNNet classifier by backpropagation of error on the training set, and the outer iteration optimizes the model parameters.

During the optimization process, the training and validation bases coincided and were equivalent to the initial datasets (SARS-CoV-2-RBV1 or SARS-CoV-2-RBV2). The outer iteration implements the particle swarm method with fitness function equal to classification accuracy. Outer iteration ends either when the desired values of the classification accuracy are reached, or when the specified number of iterations in the particle swarm method is completed. As a result, the optimized model parameters (chaotic mapping parameters) at the output allow us to obtain the highest classification accuracy on the validation set.

#### *2.4. Classification Accuracy, K-Fold Cross-Validation and Balancing Techniques*

The K-fold cross-validation technique was used to test LogNNet. This method is well suited for the medical databases, which are not split into test and training sets. The elements of the set (SARS-CoV-2-RBV1 or SARS-CoV-2-RBV2) are divided into *K* parts (*K* = 5). One of the parts is taken as the test sample, and the remaining *K*-1 parts are used for the training sample. Then, the average value of the metrics is calculated for all *K* cases when one of the *K* parts of the set becomes the test sample in turn. A distinctive feature of the method is that the separate test data are not needed for the training process. Applying the K-fold cross-validation technique, we calculate the classification metrics: classification accuracy, *A*, precision, recall, and F1-metric. Wherever we talk about the classification accuracy *A* in this article, we imply the value obtained by the K-fold cross-validation method.

To obtain a higher value of *A*, the training *K*-1 parts of the sets were balanced as in [43]. The balancing implies equalizing the number of objects for each class, supplementing the classes with copies of already existing objects, and sorting the training set in sequential order. The balancing process can be illustrated by the following example. The training set consists of 10 objects divided into 2 classes. Each object is assigned a feature vector *d*zm, where *z* is the object number *z* = 1, ... , 10, *m* is the class number *m* = 1, ... , 2. For example, we have 7 objects of class 1 (*d*11, *d*21, *d*41, *d*51, *d*61, *d*71, *d*101) and three objects of class 2 (*d*32, *d*82, *d*92). We find the maximum number of objects (*MAX*) in the classes, and *MAX* equals 7 for class 1. We supplement the remaining groups with copies of the already existing objects (duplication) to equalize the number to *MAX*. Therefore, for class 2, we acquire the group (*d*32, *d*82, *d*92, *d*32, *d*82, *d*92, *d*32). Then, we compose a balanced training data set, choosing one object from each group in turn. As a result, we achieve the following training set: (*d*11, *d*32, *d*21, *d*82, *d*41, *d*92, *d*51, *d*32, *d*61, *d*82, *d*71, *d*92, *d*101, *d*32), which consists of 14 vectors and has the same number of objects in every class.

#### *2.5. Threshold Approach*

The simplest approach for classifying by one feature in the presence of only two classes is based on determining the threshold value separating the classes *Vth*. For the SARS-CoV-2-RBV1 dataset, we introduce an additional designation of the type of threshold value Type 1 or Type 2 in accordance with the rule:

 Type 1 : if feature value > *Vth* then "COVID-19" else "non-COVID-19" Type 2 : if feature value <sup>&</sup>gt; *Vth* then "non-COVID-19" else "COVID-19" (2)

The threshold type indicates which side of the threshold the sick and healthy classes are on.

For the SARS-CoV-2-RBV2 dataset (after balancing, see Section 2.4), we introduce a similar relation for the type of threshold value:

> Type 1 : if feature value > *Vth* then "ICU" else "non-ICU" Type 2 : if feature value <sup>&</sup>gt; *Vth* then "non-ICU" else "ICU" (3)

Threshold accuracy after balancing datasets (see Section 2.4) is determined as

$$Ath = \frac{TP + TN}{TP + TN + FP + FN} \tag{4}$$

were *TP* denotes true positive, *TN* true negative, *FP* false positive, and *FN* false negative.

K-fold validation is not used when calculating *Ath*.

The threshold value *Vth* was determined by stepwise enumeration and finding the maximum value of *Ath*.

The threshold method reflects the dependence of one feature and COVID-19 and indicates the classification success (Equations (2)–(4)). In practical applications, the LogNNet is a more powerful classification tool than the simple threshold method, revealing more information between features and COVID-19.

#### *2.6. Feature Selection Method*

The feature selection method is based on a wrapper-type backward feature elimination algorithm and has two consecutive steps. First, redundant features and features that make training of the neural network difficult are removed. In backward elimination, the algorithm starts with all the features and removes the least significant feature at each iteration. The features are removed by zeroing the corresponding components of the input vectors *d*. The second stage includes sorting the remaining features according to their contribution to the classification metric.

Features selection for the dataset SARS-CoV-2-RBV2 illustrates this method. Let us suppose a reservoir optimization was carried out and an accuracy of *A*<sup>51</sup> = 93.665% was obtained (using K-fold cross-validation), where the designation *A*NF means the classification accuracy when using *NF* = 51 features. Let us introduce additional pointers, denote the set of removed features by *FR*, and denote the set of selected features by *FS*. For example, *A*49(*FR* [3,33]) denotes accuracy at *NF* = 49 features with features *z* = 3 and *z* = 33 removed, and *A*4(*FS* [1,22,33,41,55] denotes accuracy at *NF* = 4 features with the main features from the set *FS*, *z* = 1, 22, 33, 41, 55. Next, we plot the dependence of the value of *dA*<sup>51</sup> on the number of the removed feature *z* (see Figure 3a), where

$$dA\_{51}(z) = A\_{51} - A\_{50}(FR[z])\tag{5}$$

**Figure 3.** Function of the feature strength *dA*51(*z*) (**a**), *dA*50(*z*) (**b**), *dA*49(*z*) (**c**), *dA*48(*z*) (**d**).

Dependence *dA*(*z*) is a function of the feature strength. The value *A*50(*FR*[z]) characterizes the classification accuracy of the neural network using *NF* = 50 features, after deleting the feature with number *z*. Positive feature strength *dA*<sup>51</sup> (Figure 3a and Equation (5)) means that the removal of the feature reduces the classification accuracy of the network and the feature is useful. Negative *dA*<sup>51</sup> means that the feature interferes with learning (redundant) and its removal leads to an increase in the classification properties of the neural network. After the first selection iteration, the seven most useful features can be identified having numbers *z* = 49, 36, 42, 19, 12, 3, 21 (Figure 3a). The feature that makes learning the most difficult is number *z* = 44 (in Figure 3 it is indicated by the index 'Minimum'). Its removal makes *A*50(*FR* [44]) = 94.075%, which exceeds the previous value *A*<sup>51</sup> = 93.665%.

The next iteration involves calculating the dependence of *dA*50(*z*) (Figure 3b), where

$$dA\_{50}(z) = A\_{50}(FR[44]) - A\_{49}(FR[44, z])\tag{6}$$

Equation (6) implies the exclusion of the worst feature *z* = 44 and the exclusion of all other features in turn. As a result, the next feature to exclude will be the feature *z* = 45, and the best accuracy will be *A*49(*FR* [44,45]) = 94.28%.

Iterations continue until all *dA* values are greater than or equal to zero. Figure 3c,d shows graphs for Equations (7) and (8)

$$dA\_{49}(z) = A\_{49}(FR[44, 45]) - A\_{48}(FR[44, 45, z])\tag{7}$$

$$dA\_{48}(z) = A\_{48}(FR[44, 45, 14]) - A\_{47}(FR[44, 45, 14, z])\tag{8}$$

The graph in Figure 3d reflects the dependence *dA*48(z) that has positive values. Thus, the best classification accuracy corresponds to *A*48(*FR* [14,44,45]) = 94.434%, after removing the features z = 44, 45, 14. During the selection, the set of the seven best features with highest feature strength *dA* also changed from the set [3,12,19,21,36,42,49] (Figure 3a) to [3,12,36,39,40,42,49] (Figure 3d, red circle).

The second stage arranges the features according to their strength in descending order of peak values *dA*. For the considered example, the sequence contains the following first 12 values [3,4,9,12,21,29,35,36,39,40,42,49] (Figure 3d).

#### **3. Results**

#### *3.1. Dataset SARS-CoV-2-RBV1*

LogNNet 51:50:20:2 architecture was used for SARS-CoV-2-RBV1 dataset. Reservoir optimization following the method from Section 2.3 with the number of epochs *Ep* = 50 led to the parameters of the congruential generator listed in Table 4.

**Table 4.** Optimal reservoir parameters.


Feature selection was performed with the number of epochs *Ep* = 100. Prior to selection, the *dA*51(*z*) shape is plotted in Figure 4a. After feature selection, the redundant features have the numbers z = 21, 37, 42, 49, 40, and the *dA*46(*z*) plot is shown in Figure 4b. The influence of features with numbers *z* = 20, 19, 10, 17 has increased.

**Figure 4.** Function of the feature strength *dA*51(*z*) (**a**), *dA*46(*z*) (**b**).

The dependence of *A*46(*FR* [21,37,40,42,49]) on the number of epochs is shown in Figure 5, and the values of other metrics are shown in Table 5.

*Ep* = 100 will be taken as the optimal value of the number of epochs. The RBV values found most important in the diagnosis of COVID-19 are the features listed in Table 6. The most important of these are MCHC, MCH, and aPTT. MCHC in a blood test allows to find out the average amount of hemoglobin in an erythrocyte.

**Figure 5.** Dependence of *A*46(*FR* [21,37,40,42,49]) on the number of epochs *Ep*.

**Table 5.** Classification metrics depending on the number of training epochs *Ep*.


**Table 6.** The seven features found to be most important in the diagnosis of COVID-19.


MCH: corpuscular hemoglobin; MCHC: corpuscular hemoglobin concentration; aPTT: activated partial prothrombin time; HCT: hematocrit; HDL-C: high-density lipoprotein-cholesterol; MONO: monocyte count; RBC: red blood cells.

The efficiency of LogNNet in determining the diagnosis of COVID-19 using only seven features and their combinations is shown in Table 7.

Using only one feature 20 (MCHC) or 36 (HDL-C) in determining the diagnosis of COVID-19 provides a high classification accuracy of *A*1(*FS* [20]), *A*1(*FS* [36]) ~94%. The combination of 2 features 20 (MCHC) and 19 (MCH) allows to reach accuracy *A*2(*FS* [19,20]) ~99.15%.

The accuracy of the model in diagnosing the disease with seven features was almost equal to the accuracy rate in using all 46 features (*A*7~99.4 vs. *A*46~99.59) (Table 7).


**Table 7.** LogNNet efficiency for various combinations of features.

Threshold Accuracy on One Feature

Table A1 in Appendix A contains threshold accuracy *Ath*, threshold values *Vth*, type, and change limits for all features. Values of threshold accuracy *Ath* are sorted in descending order. Case distribution histograms for features with the highest threshold accuracy (LDL, HDL-C, Cholesterol, MCHC, Triglyceride, Amylase) are shown in Figure 6. An LDL level lower than 116.1 mg/dL, HDL-C level lower than 43.1 mg/dL, Cholesterol level lower than 206.3 mg/dL, Triglyceride level lower than 163.3 mg/dL, MCHC level higher than 31.3 g/dL, and Amylase level higher than 76.3 u/L mg/dL are critical levels for the detection of sick individuals. Considering any of these critical levels, the patients and healthy individuals could be detected with accuracy between *Ath* = 85% and *Ath* = 94%.

**Figure 6.** Case distribution histograms for LDL (**a**), HDL-C (**b**), Cholesterol (**c**), MCHC (**d**), Triglyceride (**e**), Amylase (**f**) from sick and healthy individuals and the threshold values *Vth* of these features (blue line) in the diagnosis of the disease. Histogram bin sizes are listed in Table A1.

For features from Table 6 not included in Figure 6, case distribution histograms (MCH, aPTT, HCT, MONO, RBC) are demonstrated in Figure 7. The success of these features alone in detecting sick and healthy individuals was less than 60% (Figure 7). However, the combination of MCHC with MCH and the combination of MCHC with HDL-C in detecting sick and healthy individuals is higher than their individual performance (Table 7). Revealed high-level mutual information among these variables helps LogNNet to diagnose COVID-19. The combinations of MCH, aPTT, HCT, MONO, and RBC features are not effective in the diagnosis of the disease (*A*5(*FS* [10,17,19,22,25]), Table 7). We think that there is a low correlation between these features and COVID-19.

**Figure 7.** Case distribution histograms for MCH (**a**), aPTT (**b**), HCT (**c**), MONO (**d**), RBC (**e**) from sick and healthy individuals and the threshold values *Vth* of these features in the diagnosis of the disease. Histogram bin sizes are listed in Table A1.

#### *3.2. Dataset SARS-CoV-2-RBV2*

LogNNet 51:50:20:2 architecture was used for the SARS-CoV-2-RBV2 dataset. The result of reservoir optimization obtained following the method from Section 2.3 with the number of epochs *Ep* = 50 led to the parameters of the congruential generator indicated in Table 4. Feature selection was carried out with the number of epochs *Ep* = 150. Prior to selection, feature strength corresponded to *dA*51(*z*) (Figure 3a). After feature selection, the redundant features are with numbers z = 44, 45 and 14, and the *dA*48(*z*) graph is shown in Figure 3d.

The dependence of *A*48(*FR* [14,44,45]) on the number of epochs is shown in Figure 8, and the values of other metrics are shown in Table 8.

*Ep* = 150 is be taken as the optimal value of the number of epochs. The metrics for the "ICU" case are significantly worse than for the "non-ICU" case because of limited data for the "ICU" case. The most important RBVs in identifying severely and mildly infected COVID-19 patients are the features listed in Table 9. The most important of these are ESR and NEU.

**Figure 8.** Dependence of *A*48(*FR* [14,44,45]) on the number of epochs *Ep*.





ESR: erythrocyte sedimentation rate; NEU: neutrophil count; CRP: C-reactive protein; RBC: red blood cells; RDW: red cell distribution width; ALP: alkaline phosphatase; TP: total protein; MPV: mean platelet volume; HGB: hemoglobin.

The efficiency of LogNNet when using only the 12 features and their combinations to identify severely and mildly infected COVID-19 patients are shown in Table 10.


**Table 10.** LogNNet efficiency for various combinations of features.

The recall value indicates what percentage of individuals diagnosed as mild or severe patients by the specialist could be recognized as mild or severe patients by our model. In other words, the recall value indicates the success of our model in distinguishing mild or severe patients. The precision value indicates the percentage of the individuals diagnosed as mild or severe patients by our model who were also defined as mild or severe patients by the specialist. In other words, the precision value shows the success of our model in diagnosing mild or severe patients.

The accuracy of the model run with 12 features to identify mildly and severely infected patients was close to the accuracy rate of the model run with 48 features (*A*12~90.9 vs. *A*48~94.94) (Table 10). The accuracy with the seven features model run was 89.3%, where the model success in diagnosing the mildly infected (precision value) was 99.1%, and success in recognizing mildly infected patients (recall value) was 89.6%. The metrics for the "ICU" case are significantly worse than for the "non-ICU" case. Here, our model decided in favor of the diagnosis of mildly infected (high precision for non-ICU, low precision for ICU) due to the sample number unbalance of our mildly infected and severely infected patients.

#### Threshold Accuracy on One Feature

Table A2 in Appendix A contains values of threshold accuracy *Ath*, threshold values *Vth*, as well as types and limits of change for all features. Rows in the table are sorted in descending order of threshold accuracy *Ath*. Case distribution histograms for features with the highest threshold accuracy (NEU, Albumin, WBC, CRP, Urea, Calcium) are shown in Figure 9.

**Figure 9.** Case distribution histograms for NEU (**a**), Albumin (**b**), WBC (**c**), CRP (**d**), Urea (**e**), Calcium (**f**) from mildly and severely infected COVID-19 patients and the threshold values *Vth* of these features (blue line) in the prognosis of the disease. Histogram bin sizes are listed in Table A2.

Cases with an NEU level higher than 6.2 × 103/μL, WBC level higher than 7.93 × 103/μL, CRP level higher than 15 mg/dL, Urea level higher than 46.9 mg/dL, Albumin level lower than 32.2 g/L, and Calcium level lower than 8.5 mg/dL most likely require intensive care treatment (Figure 9). Considering any of these critical levels, patients requiring intensive care and patients not requiring intensive care could be correctly identified with the accuracy between *Ath* = 72% and *Ath* = 78%.

For features from Table 9 not included in Figure 9, case distribution histograms (ESR, RBC, Chlorine, RDW, ALP, TP, Glucose, MPV, HGB) are demonstrated in Figure 10. The success of these features alone in detecting mildly and severely infected patients varies between *Ath* = 54.3% and *Ath* = 71.5% (Figure 10). However, the performance of the combination of the ESR, NEU, and CRP features in detecting mild and severely infected patients was higher than their individual performance (Table 10). In addition, combinations of these properties with the Albumin, RBC, Chlorine, and RDW properties improved performance in detecting severely and mildly infected patients [A3(FS [36,42,49] = 82.7% vs. A7(FS [3,12,36,39,40,42,49] = 89.3% (Table 10). We think that there is a low level of correlation between the characteristics of ALP, TP, Glucose, MPV, and HGB and the severity of COVID-19 (A7(FS [3,12,36,39,40,42,49])) = 89.4% vs. A12(FS [3,4,9,12,21,29,35,36,39,40,42,49]) = 90.9% (Table 10). Therefore, the combination of the ESR, NEU, CRP, Albumin, RBC, Chlorine, and RDW blood values is an important source of variation in determining the severity of the disease, and high-level confidential information may be found among these variables. The combination of these features may have important effects in the prognosis of COVID-19 disease and in identifying patients in need of intensive care.

**Figure 10.** Case distribution histograms for ESR (**a**), RBC (**b**), Chlorine (**c**), RDW (**d**), ALP (**e**), TP (**f**), Glucose (**g**), MPV (**h**), HGB (**i**) from mildly and severely infected COVID-19 patients and the threshold values *Vth* of these features (blue line) in the prognosis of the disease. Histogram bin sizes are listed in Table A2.

#### **4. Discussion**

COVID-19 is a systemic multi-organ damage disease that causes severe acute respiratory syndrome, death, and continues to spread [3,47]. Despite the use of vaccines, the spread of the disease cannot be stopped, and important mutations have been detected in the structure of the virus [1]. It is likely that COVID-19 will continue to be present in our lives. Despite the large number of studies on COVID-19, some of these studies were contradictory and pathological aspects of the disease could not be fully determined [48]. Changes in many RBVs and hematological abnormalities were observed during the course of the disease [6,48]. The fact that most patients lost their lives in case of severe infection has led to a fight against the disease all over the world [10,49]. In addition, Brinati et al. [19] and Zhang et al. [49] pointed out that various complications may occur during the treatment process of COVID-19, and this makes it important to predict the prognosis of the disease in the early period. Similarly, Merto ˘glu et al. [1] and Huyut and ˙ Ilkbahar [3] stated that the early prediction of the diagnosis and prognosis of the disease are important in the first response to severely infected COVID-19 patients.

As with immunodiagnostic testing, RT-PCR testing may present difficulties in identifying true positive and negative individuals infected with COVID-19 [4,50]. Indeed, Teymouri et al. [50] and D'Cruz et al. [51] suggested that to increase the sensitivity of the RT-PCR test, the test should be repeated on multiple samples and the application methodology should be improved. However, these procedures represent a troublesome process for health personnel and patients. These difficulties in diagnosing COVID-19 have further increased the importance of RBVs methods [1,2]. In this context, it is possible to determine both the diagnosis and the prognosis of the disease with RBVs (biomarkers), which are easier to obtain, more economical, and faster to measure [1–6].

In an ML study for the diagnosis of COVID-19 based on RBVs, Brinati et al. [19] explained that AI models are based on clinical features and can be used for processes, such as disease diagnosis and prognosis. AI models that use the RBVs can be both an adjunct and an alternative method to rRT-PCR [20]. In addition, AI application results can provide information about the infection risk level and can be used in the rapid triage and quarantine of high-risk patients [20].

In this study, the most effective RBV biomarkers in the diagnosis and prognosis of COVID-19 were determined by a two-step feature selection procedure for use in peripheral IoT devices with low computing resources. Our LogNNet neural network model, fed with selected features, identified sick and healthy individuals, and especially mildly infected patients, with high accuracy.

In the first dataset used in this study, the RBVs of COVID-19 positive (*n* = 2648) patients and COVID-19 negative (*n* = 2648) individuals were recorded. In the second dataset, the RBVs of 3899 patients (*n* = 203 ICU and *n* = 3696 non-ICU) hospitalized with the diagnosis of COVID-19 were recorded. Hence, 51 features of all patients were identified (Tables 1 and 2). A two-stage feature selection procedure (see Section 2.5) was applied on the datasets and features were found for each dataset. The features selected for the first dataset were fed into the LogNNet neural network, and the accuracy of the method in the diagnosis of COVID-19 was calculated. Then, the selected features for the second dataset were fed into LogNNet neural network, and the performance of the method in identifying mildly and severely infected patients (determining the prognosis of the disease) was assessed.

Previous studies on the diagnosis and prognosis of COVID-19 have indicated the changes in most of the RBV parameters and biomarkers [1–3,5]. Mertoglu et al. [1] and Yang et al. [52] reported that the most effective RBV biomarkers in the diagnosis and prognosis of COVID-19 are CRP and LYM. However, other studies conducted for this purpose have reported blood values of CRP, procalcitonin, ferritin, ALT, aPTT, and ESR [3,4,6]. Banerjee et al. [8] used random forest, glmnet, generalized linear models, and ANN neural network models to determine the diagnosis of COVID-19 with 14 RBV values of 81 COVID-19 positive and 517 healthy individuals. Glmnet was found to be the most successful model in the diagnosis of the disease with 92% sensitivity and 91% accuracy [8]. Brinati et al. [19]

used various ML methods with 13 RBV values for diagnosis of the disease (102 COVID-19 negative, 177 positive) and noted that the models with the highest accuracy were random forest (82%) and logistic regression (78%). Similarly, Cabitza et al. [20] used various ML models to rapidly detect COVID-19 using many RBV parameters and found the models with the highest accuracy were random forest (88%), support vector machine (SVM) (88%), and k-nearest neighbor (86%). Joshi et al. [22] developed a trained logistic regression model using some RBVs on a dataset of 380 cases, reporting good sensitivity (93%) but low specificity (43%). Yang et al. [21] applied various ML models on 27 RBV parameters of a large patient population of 3356 individuals (42% COVID-19 positive), and found the gradient boost tree model to be the most successful model in the diagnosis of the disease with 76%-sensitivity and 80%-specificity value. In a COVID-19 study using chest computed tomography (CT) data and RBV parameters, Mei et al. [23] showed a model combining CNN and multilayer sensor and found the success of the model in diagnosing the disease with 84% sensitivity and 83% specificity. Soares [24] proposed a model combining SVM, ensembling, and SMOTE Boost models to diagnose COVID-19 using 15 RBV parameters in a population of 599 individuals, and found the success of the model in diagnosing the disease with 86% specificity and 70% sensitivity. Running various ML models to diagnose COVID-19 with the RBV parameters, Soltan et al. [25] found the XGBoost method to be the most successful model with 85% sensitivity and 90% precision. Huyut [53] used 28 routine blood values with age on a variety of supervised ML models to detect a large population of severely and mildly infected COVID-19 patients. The models with the highest AUC in identifying mildly infected patients were local weighted-learning (0.95%), Kstar (0.91%), Naïve bayes (0.85%), and K nearest neighbor (0.75%).

This study identified the seven most important biomarkers in the diagnosis of COVID-19 (Table 6). Among these features, the most important biomarkers were MCHC, MCH, and aPTT. The overall accuracy rate of the LogNNet model, which was run with seven features, was *A*7(*FS* [10,17,19,20,22,25,36]) ~99.3%, and the precision rate of patient identification was 99.6%. In addition, the different combinations of features that are important in the diagnosis of patients were examined. The overall accuracy of the LogNNet model run only with MCHC and MCH features was *A*2(*FS* [19,20]) ~99.1% and the precision rate of patient identification was 99.4%. The overall accuracy rate of our model using only the MCHC feature was 94.2%, while the overall accuracy rate of the model using only the HDL-C feature was 94.4%. According to the calculated critical levels of the main features, such as LDL, HDL-C, Cholesterol, Triglyceride, MCHC, and Amylase (Figure 6), the health and sickness status of individuals could be determined accurately. The fact that the performance of the combination of MCHC and MCH and the combination of MCHC and HDL-C in the detection of sick and healthy individuals was higher than the individual performances suggested that there is a high level of confidential information between these blood feature combinations and COVID-19. This information was revealed by the LogNNet neural network method. These combinations of features can be used by LognNNet in diagnosis of COVID-19 disease with high results.

Studies indicate that the ALT, AST, LDH, direct bilirubin, and aPTT RBVs are increased in severe COVID-19 patients, while the hemoglobin values are decreased significantly compared to mildly infected patients [6,23,54]. However, in other studies, the LYM, NEU, WBC, MCH, MPV, and RDW hematological RBVs were higher in severe COVID-19 patients, when compared to mildly infected patients [1–3,6]. Mousavi et al. [16], Zhang et al. [54], and Zheng et al. [55] determined that patients with severe COVID-19 had lower EOS, MONO, RBC, hematocrit, hemoglobin, and MCHC hematological values, when compared to mild patients. Huyut et al. [6], in a study of patients who died from COVID-19, showed that the ESR, INR, PT, CRP, D-dimer, and ferritin biomarkers are the most important biomarkers to detect the mortality of the disease. Luo et al. [56] proposed a multi-criteria decision making (MCDM) algorithm combining ideal the solution similarity sequencing technique (TOPSIS) and naive Bayes (NB) as a feature selection procedure to predict the severity of COVID-19 from initial RBV values. With the MCDM model, the WBC, LYM, NEU values, and age were

the most effective features in determining the severity of the disease with 82% accuracy obtained by ROC analysis [56]. Similarly, Ma et al. [57] and Lai et al. [58] noted that the high WBC and NEU values are important manifestations of bacterial infection and indicate a serious disease state that complicates the clinical situation. Numerous studies have shown that other proinflammatory marker levels, including CRP, ferritin, and IL-6, are associated with worse outcomes [59–61]. Cheng et al. [62] reported that high levels of inflammatory markers, such as ESR, CRP, and procalcitonin, may indicate hyperinflammatory reactions in COVID-19 patients. Cavalcante-Silva et al. [63] stated that the neutrophil count was increased in severe COVID-19 patients and the neutrophils are the main effector cells in the development of COVID-19. The different neutrophil mechanisms, e.g., neutrophil enzymes and cytokines, are potential targets for treating particularly severe cases of COVID-19 [63].

This study identifies the twelve most important biomarkers to determine the prognosis of COVID-19 (detecting severely and mildly infected patients) (Table 9). The most important of them are ESR, NEU, CRP, albumin, and RBC biomarkers. The overall accuracy of the LogNNet model, which was run with twelve features, was 90.9%, the success rate in diagnosing mildly infected patients (precision rate) was 99.0%, and the success rate in diagnosing severely infected patients (precision rate) was 36.6% (Table 10). However, the success of the LogNNet model, which was run with twelve features, in distinguishing mild and severe patients according to their real conditions (recall value), was 91.4% and 83.1%, respectively (Table 10).

The calculated critical levels of NEU, WBC, CRP, Urea, Albumin, and Calcium features are important levels in determining the severity of infection of the patients (Figure 9). Moreover, the performance of the combination of the ESR, NEU, CRP, Albumin, RBC, Chlorine, and RDW features in detecting infected patients being higher than their individual performance indicates a high level of confidential information about COVID-19 among these blood features. This information was revealed by the LogNNet neural network. The combinations of features can be used as important biomarkers in the prognosis of the COVID-19 disease and in identifying patients in need of intensive care.

Our model decided in favor of the diagnosis of mildly infected patients (high precision for non-ICU, low precision for ICU) because of the unbalanced sample size of mildly infected and severely infected patients. However, our model showed a high recall value in identifying mildly and severely infected patients. The model run with only three features showed an average of 82.6% agreement with the expert opinion in distinguishing mildly or severely infected patients (Table 10). However, severe patient diagnosis of our model showed low agreement with expert opinion (low precision "ICU") (Table 10), and the success of our model in diagnosing severe patients is low. As a result, the LogNNet model, which is run with the features in Table 10, can be used safely with high sensitivity (recall) to confirm the expert opinion in recognizing mild and severely infected patients. In addition, our model can be an alternative tool for diagnosing mildly infected patients using the features in Table 10. Furthermore, the success of the LogNNet model using few features in distinguishing mild and severe patients and diagnosing mildly infected patients is high.

Other studies [19,64,65] confirming the association of RBV features with COVID-19 highlight the importance of the clinical research direction that our model takes. The poor performance of our model in diagnosing severe patients (low precision for the ICU) is an expected situation. Several studies have stated that severe COVID-19 patients experienced more changes in the RBV values than mildly infected patients, and that various complications could occur during the severe disease process [1–3,6]. There are many factors affecting the intensive care need of an individual with COVID-19 and difficulties in determining this process with only RBV values [1–6]. However, there are few studies on determining the severity of infection in patients with COVID-19 based on the RBV values alone.

Cabitza et al. [20], Soltan et al. [25], and Rabanser et al. [66] stated that the reported performance values are good enough, especially in terms of screening, considering the economic benefits and rapid results of the developed artificial intelligence models. Moreover, Brinati et al. [19] suggested the necessity of conducting studies on the predictability of

arterial blood gas tests in addition to routine blood values for the diagnosis of COVID-19. In this context, we plan our next studies as follows. The first phase is to identify the diagnosis and prognosis of COVID-19 with LogNNet model using the arterial blood gases. The next phase is to determine the mortality of COVID-19 with the LogNNet model using the RBV values.

Velichko [43] reported a method for the estimation of the occupied RAM in the implementation of the LogNNet on Arduino microcontrollers. The LogNNet 51:50:20:2 model, discussed above, takes about 13.7 kB of RAM. As the matrix *W* occupies ~10.4 kB, this memory can be freed due to RAM saving algorithm, and the algorithm will use ~3.3 kB. Therefore, the model can be placed on microcontrollers with a RAM size of 16 kB, e.g., Arduino Nano.

With recent advancements in information and communication technologies due to the adoption of IoT technology, smart health monitoring and support systems have a higher development and acceptability margin to improve wellness [67,68]. The integration of medical technologies into IoT is called the Internet of Medical Things (IoMT) [69].

In this context, the availability of low-cost, single-chip microcontrollers and advances in wireless communication technology have encouraged researchers to design low-cost embedded systems for healthcare monitoring applications [67]. Doctors can use patients' data to remotely monitor their physiological health status and diagnose their disorders [68]. In a study designed for mobile health applications, Hu et al. [70] used various graphical biosensors to monitor conditions, such as heart attack, brain problems, and high blood pressure (seizures, mental disorder, etc.). In a study for a similar purpose, Vizbaras et al. [71] reported that the stretching and bending vibrations of various chemical bonds are moleculespecific. Therefore, certain infrared spectral ranges are of particular interest in biomedical sensing. In addition, this approach can be used to selectively detect important biomolecules, such as glucose, lactate, urea, ammonia, serum albumin, and so on. Clifton et al. [72] demonstrated the use of wearable sensors for routine healthcare in their study of the large-scale clinical adoption of "intelligent" predictive monitoring systems.

Mobile sensors for the measurement of routine blood parameters to be used in the realtime detection of various diseases are being developed rapidly with the advancements of technology [73–76]. The RBV values can be measured using a low-cost, mobile microscope, an ocular camera, and a smartphone [73]. Chan et al. [74] determined PT and INR blood values by monitoring the micro-mechanical movements of a copper particle with a proof-ofconcept using the vibration motor and camera in smartphones. Farooqi et al. [75] followed the diabetic patients with telemonitoring and Bluetooth-enabled self-monitoring devices and produced new solutions for the glycemic control of the patients. Zhang et al. [76] determined various biochemical parameters by electrochemical controls.

In the feature, the data can be obtained in real time and used to provide immediate medical advice before the health problems of the patients occur and progress. The technique presented in this study can be used to create mobile health monitoring systems.

The output of the LogNNet model can be used in different scenarios. The presented feature selection method can be used in conjunction with molecular testing to obtain high sensitivity and certainty regarding suspected cases. In this way, more positive patients can be identified, isolated, and treated in a timely manner. Likewise, the outputs of our model can be used while the results of other tests are awaited. The results of this study demonstrated that the LogNNet neural network model can be used with high productivity for clinical decision support systems and mobile diagnostics.

Various independent biomarkers used in the study need to be tested in the diagnosis and prognosis of other infectious diseases. The low number of ICU patient groups compared to the non-ICU group was one of the limitations of this study.

#### **5. Conclusions**

Determining the mild or severe infection status of COVID-19 patients using various diagnostic tests and imaging results can be costly, time consuming, and is subject to different complications during the process. In this case, the patient's health may be at higher risk and health services may face tragic situations under intense pressure. This study provides a fast, reliable, and economic alternative mobile tool for the diagnosis and prognosis of COVID-19 based on the RBV values measured only at the time of admission to the hospital.

In this study, the most effective RBVs in the diagnosis and prognosis of COVID-19 were determined using a feature selection method for the LogNNet reservoir neural network. The most important RBVs in the diagnosis of the disease were MCHC, MCH, and aPTT. The most important RBVs in the prognosis of the disease were ESR, NEU, CRP, albumin, and RBC. The LogNNet deep neural network model accurately and precisely detected almost all COVID-19 patients using only a few RBV features.

The health and sickness status of individuals could be determined largely accurately using threshold levels of the LDL, HDL-C, Cholesterol, Triglyceride, MCHC, and Amylase features. In addition, the LogNNet neural network revealed that the performance of the combination of MCHC and MCH and the combination of MCHC and HDL-C in the detection of sick and healthy individuals was higher than the individual performances of these features.

Threshold levels of the NEU, WBC, CRP, Urea, Albumin, and Calcium main properties were found to be significant in the detection of severely and mildly infected patients. As revealed by the LogNNet network, the combination of ESR, NEU, CRP, Albumin, RBC, Chlorine, and RDW features is an important source of variation in the prognosis of COVID-19. We propose to use this combination of the features with LogNNet as important biomarkers in the prognosis of the disease and in identifying patients in need of intensive care.

The results of this study can be effectively used in medical peripheral devices of the IoT (IoTM) with low RAM resources, including clinical decision support systems, remote internet medicine, and telemedicine.

**Supplementary Materials:** The following supporting information can be downloaded at: https: //data.mendeley.com/datasets/8hdnzv23x7, SARS-CoV-2-RBV1.sav, SARS-CoV-2-RBV2.sav, SARS-CoV-2-RBV3.sav.

**Author Contributions:** Conceptualization, M.T.H. and A.V.; methodology, M.T.H. and A.V.; software, A.V.; validation, M.T.H. and A.V.; formal analysis, M.T.H.; investigation, A.V.; resources, M.T.H.; data curation, M.T.H.; writing—original draft preparation, M.T.H. and A.V.; writing—review and editing, M.T.H. and A.V.; visualization, M.T.H. and A.V.; supervision, M.T.H.; project administration, M.T.H.; funding acquisition, A.V. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was supported by the Russian Science Foundation (grant no. 22-11-00055, https://rscf.ru/en/project/22-11-00055/ (accessed on 22 June 2022)).

**Institutional Review Board Statement:** The dataset used in this study was collected in order to be used in various studies in the estimation of the diagnosis, prognosis and mortality of COVID-19. The necessary permissions for the collected dataset were given by the Ministry of Health of the Republic of Turkey and the Ethics Committee of Erzincan Binali Yıldırım University. This study was conducted in accordance with the 1989 Declaration of Helsinki. Erzincan Binali Yıldırım University Human Research Health and Sports Sciences Ethics Committee Decision Number: 2021/02-07.

**Informed Consent Statement:** In this study, a dataset including only routine blood values, RT-PCR results (positive or negative) and treatment units of the patients was downloaded retrospectively from the information system of our hospital in digital environment. A new sample was not taken from the patients. There is no information in the dataset that includes identifying characteristics of individuals. It was stated that routine blood values would only be used in academic studies, and written consent was obtained from the institutions for this. In addition, therefore, written informed consent was not administered for every patient.

**Data Availability Statement:** The data used in this study can be shared with the parties, provided that the article is cited.

**Acknowledgments:** We thank the method of Erzincan Mengücek Gazi Training and Research Hospital for their support in reaching the material used in this study. Special thanks to the editors of the journal and to the anonymous reviewers for their constructive criticism and improvement suggestions.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A**

**Table A1.** Threshold method parameters for SARS-CoV-2-RBV1 dataset and histogram bin sizes for Figures 6 and 7.



**Table A2.** Threshold method parameters for SARS-CoV-2-RBV2 dataset and histogram bin sizes for Figures 9 and 10.

#### **References**

1. Mertoglu, C.; Huyut, M.; Olmez, H.; Tosun, M.; Kantarci, M.; Coban, T. COVID-19 is more dangerous for older people and its severity is increasing: A case-control study. *Med. Gas Res.* **2022**, *12*, 51–54. [CrossRef] [PubMed]


## *Article* **Deep Learning and 5G and Beyond for Child Drowning Prevention in Swimming Pools**

**Juan Carlos Cepeda-Pacheco and Mari Carmen Domingo \***

Department of Network Engineering, BarcelonaTech (UPC) University, 08860 Castelldefels, Spain **\*** Correspondence: cdomingo@entel.upc.edu

**Abstract:** Drowning is a major health issue worldwide. The World Health Organization's global report on drowning states that the highest rates of drowning deaths occur among children aged 1–4 years, followed by children aged 5–9 years. Young children can drown silently in as little as 25 s, even in the shallow end or in a baby pool. The report also identifies that the main risk factor for children drowning is the lack of or inadequate supervision. Therefore, in this paper, we propose a novel 5G and beyond child drowning prevention system based on deep learning that detects and classifies distractions of inattentive parents or caregivers and alerts them to focus on active child supervision in swimming pools. In this proposal, we have generated our own dataset, which consists of images of parents/caregivers watching the children or being distracted. The proposed model can successfully perform a seven-class classification with very high accuracies (98%, 94%, and 90% for each model, respectively). ResNet-50, compared with the other models, performs better classifications for most classes.

**Keywords:** deep learning; 5G and beyond; child drowning prevention; network slicing architecture

#### **1. Introduction**

Drowning is a major health problem worldwide. According to the World Health Organization (WHO, Geneva, Switzerland), in 2015, around 360,000 people died from drowning [1]. More than half of these deaths are of people younger than 25.

The WHO Global report on drowning [2] states that the highest rates of drowning deaths occur among children aged 1–4, followed by children aged 5–9 years. In fact, in countries like Australia, drowning is the leading cause of unintentional injury death in children aged 1–3 years, and in the USA, drowning is responsible for more deaths among children aged 1–4 years than any other cause (except birth defects) [3]. Furthermore, drowning is the third leading cause of death worldwide for those aged from 5 to 14. In the Western Pacific Region, children aged 5–14 years die more frequently from drowning than from any other cause.

Drowning happens quickly and quietly and its signs often go unnoticed. Young children can drown silently in as little as 25 s, even in the shallow end or in a baby pool [4]. For all of these reasons, it is important for parents and caregivers to actively supervise their children around water, even if lifeguards are present.

The same report identifies the absence of or inadequate supervision as key risk factors for the drowning of children [1]. Another report [5] from the Royal Life Saving Society Australia (RSLA, Sydney, Australia) linked distracted parents to 77.8% of drownings in children aged 5–9 years in public and commercial pools between 1 July 2005 and 30 June 2015. In the cases of drowning without supervision, the parent or caregiver of the child was missing, or physically near the child but distracted (talking to another adult or attending to another child in his/her care). Furthermore, the German Lifeguard Association (DLRG, Bad Nenndorf, Germany) (the biggest organization of its kind in the world) reported that more than 300 people died in Germany during 2018 (from the beginning of the year through

**Citation:** Cepeda-Pacheco, J.C.; Domingo, M.C. Deep Learning and 5G and Beyond for Child Drowning Prevention in Swimming Pools. *Sensors* **2022**, *22*, 7684. https:// doi.org/10.3390/s22197684

Academic Editors: Andrei Velichko, Dmitry Korzun and Alexander Meigal

Received: 25 July 2022 Accepted: 6 October 2022 Published: 10 October 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

the summer) and associated the growing number of children drowning to their parents' obsession with mobile phones [6]. In addition, Royal Life Saving Australia reported that, between 2002 and 2017, 447 children under the age of four drowned. Roughly 5% of those deaths were a direct result of a failure to supervise owing to the use of electronic devices (smartphone, tablet, laptop, and so on) [7].

In order to solve the problem of inadequate child supervision, in this paper, we propose a novel 5G and beyond child drowning prevention system based on deep learning that detects and classifies distractions of inattentive parents or caregivers. It can be deployed in indoor swimming pools or outdoor locations such as beaches or aquatic recreation locations aided by unmanned aerial vehicle (UAV) (drones). The system detects distracted parents/caregivers in charge of a minor and alerts them to concentrate on the supervision task. A 5G network slicing architecture for child drowning prevention has also been introduced. To the best of our knowledge, this is the first paper that aims to avoid child drowning by detecting and classifying distractions of parents in charge of a minor in aquatic recreational spaces; it is also the first paper to use digital technologies such as artificial intelligence and modern communication technologies (such as 5G and beyond) to detect and alert distracted parents or caregivers. The main contributions of this study are as follows:


The experimental results prove the feasibility of the child drowning prevention system. The proposed model can successfully perform a seven-class classification with very high accuracies (98%, 94%, and 90% for each model, respectively).

The paper is structured as follows. In Section 2, we introduce our proposed 5G-enabled child drowning prevention system. In Section 3, we identify the most relevant key performance indicators (KPIs). In Section 4, we explain the 5G-service-based architecture. In Section 5, we discuss the proposed 5G network slicing architecture for child drowning prevention from a technical perspective. In Section 6, we briefly describe the convolutional neural network architectures used in this research. The experiments and results are presented in Section 7. Finally, Section 8 concludes the paper and highlights some future research directions.

#### *Related Work*

Monitoring and supervision at swimming pools or aquatic recreation locations has drawn the attention of the research community [8], particularly for drowning prevention and early detection of possible drowning [9,10].

Some proposed drowning detection systems [11–13] employ underwater cameras to detect motionless drowned victims sunk at the bottom of the pool using techniques such as background extraction [13], which consists of detecting the moving objects by identifying the difference between the current frame and a reference frame, often called a 'background image' or 'background model'; however, these systems are limited to the victims that have sunk to the bottom of the pool, thus wasting precious time, as they are unable to detect the victims prior to them drowning.

Other proposed methods consist of overhead cameras mounted around the pool (such as our proposed system) [14–16]; these systems consist of two main parts: a vision component that can detect and track swimmers and an event-inference (water crisis) module that analyzes swimmer observation sequences for possible drowning behavior signals. Several

studies have been carried out regarding the detection of swimmers based on overhead cameras [17,18]. This task is still challenging owing to disturbances at the water's surface (e.g., water exhibits random homogeneous blob movements, which could be easily misidentified as foreground objects) [19,20]. In addition, lightning and color variations over time due to ambient brightness even further complicate automated monitoring based on video surveillance. Several works apply background subtraction to solve the swimmer detection problem [13,19,20]. Currently, the development of wearable devices has become a very common practice. It has allowed researchers to develop sensor systems to monitor the physiological signals of high-performance swimming athletes [21,22], to detect pre-drowning symptoms and alert rescue staff [23], and to supervise children. Wearable sensor systems for infants can perceive external threats such as falls or drowning; the methods and techniques applied in wearable sensor systems are analyzed and discussed in [24]. In [20], a real-time detection method for constant monitoring of swimmers at an outdoor swimming pool is presented. A background subtraction scheme is introduced, where the background has been modeled as a composition of homogeneous region processes. Furthermore, to solve the foreground (swimmer) detection problem, a devised thresholding scheme has been proposed to attain a good trade-off between maximizing target detection while minimizing background noises. In addition, to enhance the visibility of the foreground (swimmer), a pre-processing filtering scheme able to classify each pixel of a current frame into different pixel types has been proposed; this way, appropriate filtering actions such as color compensation can be applied when necessary. In [19], a background subtraction scheme based on motion and intensity information has been developed to identify swimmers in each video frame. Image pixels are classified according to motion as random/stationary, ripple, and swimming. A motion map is developed through the computation of dense optical flow that characterizes the motion contents of image pixels over a short sequence of video frames rather than a single image. Intensity information has been modeled using a block-based mixture of Gaussians (MoG). However, these systems ([19,20]) only specify how to detect a swimmer; they do not specify how to detect if he/she is drowning.

Current improvements in computing power have enabled the use of deep learning algorithms for human detection and other computer-vision-related problems. Most state-ofthe-art object detectors use deep learning algorithms to extract features from input images (or videos) and perform classification and localization, respectively [25]. In [26], a method to detect swimmers in low-quality video using two convolutional neural networks (YOLOv2 and Tiny-YOLO) has been proposed. Our proposed 5G and beyond child drowning prevention system is also based on deep learning (convolutional neural networks), but focuses on the detection of distracted parents/caregivers, not swimmer detection (as in [26]). In [27], a real-time vision system to detect drowning incidents using overhead cameras at an outdoor swimming pool is presented. The system uses a model comprising data fusion and hidden Markov modeling to learn of drowning events early. They focus on (1) foreground swimmer silhouette extraction and (2) behavioral recognition. The foreground detection module has already been reported in [20]. The system has analyzed water crisis episodes consisting of victims that suffer distress incidents (victims exhibit involuntary movements such as active struggling or waving [28]) and drowning incidents understood as suffocation. The detection of distress and early drowning episodes is based on visual indicators (instinctive response with repetitive arm movements of extending out and pressing down, perpendicular body (vertical up) in water with small movements in horizontal and diagonal directions). The experiments try to differentiate between three events (water crisis, treading, and normal swimming); the best testing errors obtained are 15.15% and 15.57%, with support vector machine (SVM) and reduced model (RM) classifier, respectively. Furthermore, the false alarm rate is at about one to five cases for each camera in a day. In addition, one challenge of their proposed system is that a drowning incident may happen in a way that is different from the learned instinctive drowning response model. In this case, it must be determined how the system will react to an event for which

it is not trained [27]. Furthermore, specialists emphasize that drowning happens quickly and quietly, and its signs often go unnoticed (see Section 1). For this reason, in our current paper, we propose a novel technique to detect child drowning episodes that focuses on the caregivers of the children. To improve swimming pool safety, we use deep learning to detect a distracted caregiver of a child in a swimming pool, similarly to the detection of drivers' distractions on the road. The behavior of a driver is essential for traffic safety. On-road distractions deteriorate the driver's performance and may lead to the loss of vehicular control and traffic accidents. A distraction is anything that diverts the driver's attention from the primary task of navigating the vehicle and responding to critical events. The authors in [29–31] use deep learning to detect distracted driver behaviors such as texting, operating the radio, drinking, fixing hair and makeup, talking on the phone, and so on.

#### **2. The Proposed 5G and Beyond Child Drowning Prevention System**

In the proposed scenario, families need to register when they arrive at the swimming pool. A facial image of each family member is acquired to recognize them. The swimming pool database registers the age of each child and links the photos of the children with their parents and/or other family member/s. The family decides who is going to be the primary caregiver that is going to watch the children and be responsible for their safety inside the swimming pool and a pager is given to him/her. This task can be shared between the parents (or other family members 18 years or older) simultaneously, which means that none of them should be distracted. It is also possible that there is only one primary caregiver during a certain time slot and another during the next time slot (e.g., the father is the primary caregiver from 15:00 to 17:00 and the mother from 17:00 to 19:00).

After all of these decisions are made using the swimming pool app, the family can access the swimming pool area. The proposed 5G and beyond child drowning prevention system is shown in Figure 1.

If the primary caregiver decides to supervise the children outside of the pool, a specific seat will be assigned to him/her close to the swimming pool. This guarantees that he/she will have a good sight of the swimming pool to supervise the children. In addition, a video camera will be directly facing him/her to detect distractions. The cameras are strategically located at an optimal distance in a way not to obstruct people. In the case of multiple primary caregivers, the same or multiple video cameras can be facing them. Real-time video will be transmitted to the command center. Distractions of primary caregivers will be detected using a deep learning algorithm.

If the primary caregiver decides to supervise the child inside the pool, different video cameras mounted surrounding the pool will detect him/her using computer vision. For this purpose, a high-quality monitoring system is required that consists of video cameras with multiple high-end lenses that can zoom and steer around to detect critical details. The video cameras need to coordinate with each other to be able to track the primary caregivers at any time to detect possible distractions. The video cameras will identify the primary caregiver from different perspectives inside the pool. Automated analysis of the video footage will be carried out. A caregiver can be considered as 'distracted' if the convolutional neural network analyzes the images from all of the video cameras that are simultaneously capturing his/her behavior and he/she is characterized as being 'distracted' by most of them. That is, the images of the parents/caregivers are not combined, but the images from each camera are classified into a category. It is decided if the parent/caregiver is distracted or not by analyzing which category is repeated the most.

**Figure 1.** Proposed 5G-enabled child drowning prevention system.

When a distraction event is detected, an alert will warn the primary caregiver so that he/she focuses on active child supervision. We assume that alerts will be sent immediately if the kids to supervise are 5 years old or under. For kids that can swim (usually older than 5 years), parents will be alerted if the convolutional neural network detects a continuous distracted behavior for more than 10 s, because drowning accidents happen very quickly. Alert messages can be sent to a pager. The pager lights up or vibrates in case the caregiver is distracted. Alert messages can also be heard through the swimming pool speakers located in the closest vicinity of the caregiver. Furthermore, lifeguards will also get these notification messages and act accordingly. This information will be, for example, useful if certain caregivers are notified several times; in this case, lifeguards can supervise the associated children much closer and talk to the parents/caregivers or take other necessary steps if no change in their attitude is observed.

#### **3. Related Key Performance Indicators**

The proposed 5G-enabled child drowning prevention system can be identified as a mission critical communications (MCC) service because it requires real-time and reliable communications for a large number of users, as well as strong security and pre-emption handling [32]. Table 1 summarizes the major key performance indicators (KPIs) for child drowning prevention. The end-to-end latency can be measured as the time interval required to send the packages from a source to a destination, measured at the application level.

**Table 1.** Main KPIs for child drowning prevention.


Mission critical: A quality or characteristic of a communication activity, application, service, or device that requires low setup and transfer latency, high availability and reliability, the ability to handle large numbers of users and devices, strong security, and priority and pre-emption handling.

It would be possible for our use case to connect to the nearest edge server via Wi-Fi 7 (802.11be), because this standard will support a maximum throughput of at least 30 Gbps. Features operating at both the MAC (medium access control) layer and the physical layer (PHY) such as multi-access point coordinated beamforming, time-sensitive networking, and multi-link operation will bring Wi-Fi 7 latency performance into the sub-10 ms realm. These characteristics would be enough to support our high-throughput low-latency child drowning prevention use case. However, the IEEE task group announced draft 2.0 of 802.11be, and the final version will be released in 2024.

IEEE 802.11ax (Wi-Fi 6) received final approval from the IEEE Standards Board on 1 February 2021. This standard offers a theoretical speed of up to 9.6 Gbps and 10 ms latency. Wi-Fi 6 does not perform well in large-scale outdoor coverage scenarios and cannot meet the ultra-low latency requirements (<10 ms).

It has been shown in [33] that Wi-Fi 6 can achieve ultra-reliable low latency performance (i.e., <1 ms packet latency at 99.999% reliability) only when optimized and operating in a low load up to 0.16 bps/Hz that is not appropriate for our use case.

On the other hand, 5G can reach up to 10 Gbps (only slightly higher than Wi-Fi 6), but this technology has been designed to address the requirements of ultra reliable and lowlatency communications (URLLC). URLLC has stringent requirements for capabilities such as latency, reliability, and availability. Some use cases include wireless control of industrial manufacturing or production processes, remote medical surgery, and transportation safety. It has been demonstrated in [33] that 5G NR (new radio)-FDD (frequency division duplex) has superior URLLC performance and meets the sub-ms delay requirement at >5× higher load than Wi-Fi 6.

Therefore, 5G is the appropriate technology for our use case thanks to its better latencies. The proposed system requires that real-time video is backhauled from the video cameras to the command center for remote control and analysis. The number of video cameras will vary depending on the size of the swimming pool. Moreover, 5G can be deployed in indoor swimming pools or even in outdoor locations such as beaches or aquatic recreation locations that extend several kilometers. In these cases where so many video images need to be processed as quickly and efficiently as possible, a 5G network is required to provide sufficiently high uplink data throughput and transmission reliability as well as sufficiently low latency. The short end-to-end latency will enable alert messages to be sent as fast as possible if necessary as drowning happens quickly. Reliability is critical to detect incidents, which means that performance should not be compromised irrespective of the channel conditions.

#### **4. 5G Service-Based Architecture**

Next, the 5G system architecture of the non-roaming case is illustrated in Figure 2 [34]. The user plane (UP) and control plane (CP) are decoupled to obtain scalable and flexible deployments. Whereas the CP is used for network signaling, the UP carries only user traffic.

**Figure 2.** Service-based representation of the 5G non-roaming system architecture [34].

The user equipment (UE) in the user plane is connected to either the radio access network (RAN) or a non-3GPP access network (e.g., wireless local area network, WLAN) as well as to the access and mobility management function (AMF).

Next, we explain the network functions (NFs) of the 5G core network (see the upper part of the figure):


#### **5. A 5G Network Slicing Architecture for Child Drowning Prevention**

Network slicing refers to the division of a physical network into multiple logical networks (network slices), so that each logical network can provide specific network characteristics for a particular use case. Network slicing provides services across multiple network segments and different administrative domains. A 5G slice can combine resources that belong to different infrastructure providers [35]. Network slicing is the best way for network operators to build and manage a network that meets the requirements from a wide range of users. Network slicing provides service flexibility and the ability to deliver services faster with high security, isolation, and according to the quality of service (QoS) requirements of the different applications. This way, network operators can manage their network resources efficiently and provide differentiated and scalable services.

Slices are isolated from each other, which means that faults or errors in one slice do not affect the proper functioning of another slice.

Next, we introduce the main design elements of our proposed 5G network slicing architecture for child drowning prevention (see Figure 3).

**Figure 3.** Network slicing architecture for child drowning prevention.

It is divided into three layers plus an additional management and orchestration layer, whose basic functionalities are summarized as follows:

Infrastructure layer: It refers to all of the parts of the physical network, because slices should be end-to-end. This layer includes the IoT networks, telecommunication networks, satellites, edge computing technologies, and the cloud. It provides the allocation of virtual or physical resources such as computing, storage, network, or radio.

We assume that all network devices are software defined networking (SDN)-enabled switches managed by SDN controllers that are able to program their routing tables.

The 5G core is generally divided into 'core—user plane' in charge of bearer delivery and 'core—control plane (CP)' in charge of control functions. Core—control plane will stay in the central cloud (network function virtualization, (NFV)), but 'core—user plane (UP)' will be distributed to its tens of edge nodes nationwide and be installed in edge clouds (NFV). Security, reliability, and latency will be critical for a 5G slice supporting the child drowning prevention case. For such a slice, all of the necessary (and potentially dedicated) network functions should be instantiated at the edge node. We consider that all the 5G core functions/units (UP) should be in the edge cloud close to the users. Multi-access edge computing (MEC) drastically reduces the latency between network nodes and remote servers in the cloud [36] because video processing servers are placed right where the core functions/units are located. This way, we can minimize the transmission delay to match the requirements of our delay-critical slice for such an MCC application. Furthermore, machine learning is crucial in supporting MCC by enabling a local decision making process at the edge servers [37].

Network function layer: It encapsulates all of the operations related to the configuration and life cycle management of the network functions that offer an end-to-end service. Network function virtualization (NFV) [38] and SDN (software-defined networking) [39] are two fundamental technologies to configure the virtual network resources. NFV decouples specific network functions from dedicated and expensive hardware platforms. This technology can provide software building blocks named VNFs (virtualized network functions) for the data plane that can be connected and chained according to the service type. SDN technology enables the separation of the control plane from the data plane to offer a flexible resource management.

Service layer: This layer provides a unified vision of the service requirements. Each service is represented by a service instance, which embeds all of the network characteristics that satisfy the SLA (service level agreement) requirements such as throughput or latency. A network slice instance (NSI) is a managed entity created by an operator's network with a lifecycle independent of the lifecycle of the service instance(s) [40]. An NSI provides the network characteristics required by a service instance. It is also possible that an NSI is shared across multiple service instances of a network operator.

Based on the main KPIs (see Section 3) and functional requirements of our use case, child drowning prevention, we propose that the drowning prevention slice has ultra-reliable and low-latency communications (URLLC) requirements. URLLC use cases (such as missioncritical applications) have stringent latency, reliability, and availability requirements.

Management and Orchestration (MANO): It is the framework for the management and orchestration of all network resources (computing, networking, storage, and virtual machine) in the cloud. It comprises three functional blocks: NFV orchestrator (NFVO), VNF manager (VNFM), and virtualized infrastructure manager (VIM). NFVO performs onboarding of new network service and VNF packages, network service lifecycle management, and resource management. VNFM manages the lifecycle of VNF instances. VIM controls and manages the lifecycle of virtual resources as requested by the NFVO in an NFV infrastructure (NFVI) domain.

#### **6. Convolutional Network Models**

Convolutional neural networks (CNNs): They were created out of the need to be able to process images effectively and efficiently; nowadays, they are also used for speech recognition. However, their strength is in image processing. Next, we describe the CNNs used in our research.

VGG model: This architecture was proposed by Karen Simonyan and Andrew Zisserman [41]; it was the winner of the ImageNet Large-Scale Visual Recognition Challenge 2012 (ILSVRC12). It was designed with 16 hidden layers in VGG-16 and 19 hidden layers in VGG-19 versions. The architecture processes input images of size 224 × 224 pixels with three channels for color images (RGB). The image is passed through five convolutional blocks (Figure 4). In VGG-19, the first two blocks incorporate two convolutional layers

Image

Image

Conv3-64

Conv3-64

Max pool

Conv3-128

Conv3-128

and the remainders incoporate four convolutional layers. Each convolutional layer uses 3 × 3 filters and rectified linear unit (ReLU) as an activation function; the convolutional blocks also incorporate maxpooling layers to reduce image size and prevent overfitting problems; the upper layers are composed of two full-connected layers with 4096 neurons each, at the top, one output layer for image classification into 1000 different categories. Max pool Conv3-256 Conv3-256 Conv3-256 Max pool Conv3-512 Conv3-512 Conv3-512 Max pool Conv3-512 Conv3-512 Conv3-512 Max pool Conv7-4096 Conv1-4096 Conv1-1000 Soft-max

Soft-max

**Figure 4.** VGG-16 and VGG-19 architecture.

ResNet model: It is a type of advanced convolutional neural network; this model was proposed by Kaiming He in his 2016 document [42]. The ResNet-50 version consists of 50 layers. This model is based on the idea of residual and identity blocks that use skip connections (shortcut) (Figure 5), where the input is passed to a deeper layer. In other words, the simple deep convolutional neural network is inspired by VGG with 3 × 3 filters and a ReLU activation function, which is modified to become a residual network by adding skip connections to define residual blocks. On the top, the architecture contains a fully connected output layer with a softmax activation function for classification. Figure 6 shows the general configuration of the residual network architecture, including ResNet-50, ResNet-101, and ResNet-152.

**Figure 5.** (**a**) ResNet identity block and (**b**) ResNet convolutional block.

**Figure 6.** Configuration of residual network architecture, including ResNet-50, ResNet-101, and ResNet-152.

Inception-v3 model: This convolutional neural network was developed by Google. The first version of inception, called "GoogLeNet", was presented in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14) [43]. This first version of the architecture is made up of 22 layers including convolutional, pooling, and a characteristic layer called inception; the latter is a type of convolutional layer, but it is characterized using only 1 × 1, 3 × 3, and 5 × 5 filters simultaneously (Inception blocks) (Figure 7); this way, the number of parameters to calculate is greatly reduced. This was achieved with what Google called bottlenecks, which were convolutional layers with 1 × 1 filters to reduce the complexity of the network. Google also includes auxiliary classifiers, intending to facilitate the propagation of the gradients backward and to reduce the cost involved Therefore, reducing the number of parameters and complexity resulted in a more powerful network.

**Figure 7.** (**a**) Inception-A block, (**b**) inception-B block, (**c**) inception-C block, (**d**) reduction-A block, and (**e**) reduction-B.

Figure 8 shows the inception and reduction blocks that were set for the third version of this architecture.

**Figure 8.** Inception-v3 architecture.

#### **7. Experiments and Results**

#### *7.1. Dataset*

The dataset is a collection of 38,000 images generated by us in the summer of 2019. The location of the video recording was the facilities of the Fontsanta swimming pool, located at Carrer del Marquès de Monistrol, 30, 08970 in Sant Joan Despí, Barcelona—Spain. Five primary caregivers (people in charge of the children) were involved in the development of these experiments. They were recorded on video, doing different activities (one video for each action related to each of the different categories) both inside and outside the water. The images captured from each video correspond to a specific category so that the images have been identified and labelled manually for each category. The capture was made taking into account that only the participants appear in the video to protect the privacy and confidentiality of other people who are at the swimming pool. The videos were recorded with high-resolution smart mobile devices (1920 × 1880), although the images are preprocessed according to the input data requirements of each model (224 × 224). The images were finally collected and classified into seven (7) categories:


To achieve a great performance during the training process with our own dataset, the videos were not shot from a single angle. Instead, they were shot from different angles, covering all potential perspectives of a caregiver. Furthermore, because the swimming pool is located outdoors, the varying lighting conditions throughout the day provide a richer dataset.

#### *7.2. Experimental Settings*

The dataset consists of approximately 38,000 images; it was split into two parts, keeping a ratio of 8:2, i.e., around 30,000 images for training and 8000 for testing. In addition, data augmentation was used to expand the training set and obtain better generalization. Data augmentation is a technique that expands our original training dataset virtually, through a random series of transformations from the original image, resulting in new plausible-looking images, in order to obtain a larger number of images for training. In computer vision, this technique became a standard for regularization, as well as to improve accuracy, generalization, and control of overfitting in CNNs. For this research, the techniques chosen are as follows: rescale = 1./255, rotation\_range = 2, shear\_range = 0.2, zoom\_range = 0.2, and horizontal\_flip = True.

We have selected the images from a different subject for testing purposes in order not to contaminate the testing set. Figures 9 and 10 show a set of images of each category with their training and testing labels.

**Figure 9.** Image set of each category with their respective training labels.

**Figure 10.** Image set of each category with their respective testing labels.

The algorithms were implemented in several Jupyter Notebooks in version 6.0.3 installed with anaconda programs suite, developed in Python. The experiments were carried out on a Lenovo computer 2.9 GHz Intel (R) Xeon (R) processor with 72 GB RAM, without GPU.

We implemented three different algorithms using the preset models from the python Keras library; each one was specifically adapted to obtain optimal results after each training. The transfer learning technique was used (further details will be provided in Section 7.4) to take advantage of the pre-trained weights. Early stopping and dropout were implemented as techniques to avoid overfitting to achieve an improvement of the generalization capacity. Accuracy was selected during the training process as a metric to evaluate the performance of each algorithm.

The setup of each model to be used is detailed below.

#### *7.3. Convolutional Neural Network Architectures*

In this paper, experiments were performed to evaluate the proposed approach with three different CNN architectures: VGG-19, ResNet-50, and Inception-v3. Table 2 presents a summary of the configuration for each model. For all experiments, we used an image size of 224 × 224 × 3 and a batch size of 64.


#### **Table 2.** Architectures of the three CNN models.

#### 7.3.1. VGG-19

We implemented the VGG-19 version because it has a greater number of layers (deeper network) compared with the VGG-16 version mentioned above. It is made up of a 224 × 224 × 3 input layer, five convolutional blocks with kernel 3 × 3, ReLU activation function, without padding, and a maxpooling layer after each block followed by a flattened layer and two additional blocks; each additional block consists of a fully connected dense layer with 4092 neurons, a BatchNormalization layer, and a dropout layer. The last layer is a dense layer with a softmax activation function that contains seven neurons to classify our categories.

#### 7.3.2. ResNet-50

This model contains an input layer of 224 × 224 × 3, fifty convolutional blocks with their respective skip connections, followed by a global average pooling layer. At the top of the model, we have added two additional blocks; each block consists of a fully connected dense layer with 2048 neurons, a BatchNormalization layer, and a dropout layer. The last layer is a dense layer with a softmax activation function that contains seven neurons for our classification.

#### 7.3.3. Inception-v3

This model is composed of a 224 × 224 × 3 input layer, two convolutional blocks of three and two layers, followed by a maxpooling layer after each block. In the central part, it consists of several types of inception and reduction blocks, along with a global\_averagepooling layer. At the top of the model, we added two additional blocks; each block consists of a dense layer fully connected with 2048 neurons, a BatchNormalization layer, and a dropout layer. The last layer is a dense layer with a softmax activation function that contains seven neurons for our classification.

#### *7.4. Training*

The dataset consists of approximately 38,000 images (N records); it was split into two parts, keeping a ratio of 8:2, i.e., around 30,000 images for the training set (n records) and 8000 for the testing set (N−n records). For the training, we applied cross-validation. Cross-validation is a technique commonly used to validate machine learning models and estimate the performance of the model trained on unseen data. The most robust and widely used method of cross-validation is K iterations or K-fold cross-validation. This method consists of splitting the training dataset into K subsets (see Figure 11). During iterations, each of the subsets are used as validation data or testing folds and the rest (K−1) as training data or training folds. The cross-validation process is performed repeatedly for K iterations, with each of the subsets of validation data. The arithmetic average of the results of each iteration is finally performed to obtain a single result. This method is highly efficient as we evaluate it from K combinations of training and validation data, but it still has a disadvantage, that is, computationally, it is slow. However, the choice of the number of iterations depends on how large the dataset is. Cross-validation is most commonly used with K values ranging from 5 to 10. If the model (estimator) is a classifier and the target variable (y) is binary or multiclass (as in this research), the StratifiedKfold technique is used by default. This approach introduces stratified folds, i.e., by keeping the proportion of samples from each class in all folds. Therefore, the data from the training and testing folds are distributed equally. It is useful when unbalanced datasets are used. To evaluate the results, we used several metrics that are very common in machine learning applications for classification problems.

**Figure 11.** Use of each fold in the cross-validation process (fivefold representation).

#### 7.4.1. Loss or Cost Function

A loss function is employed to optimize a machine learning algorithm. Several different cost functions can be used. Each of them penalizes errors differently. The loss function most commonly used in deep neural networks for classification problems is cross-entropy. In this research, we employed categorical cross-entropy. Categorical crossentropy is a loss function that is used in multi-class classification tasks, where a sample can be considered to belong only to a specific category with a probability of 1 and to other categories with a probability of 0, and the model must decide which category each one belongs to.

#### 7.4.2. Transfer Learning and Early Stopping

A model can be trained from scratch when it is not very large or when the necessary computational capacity for its execution is available. On the other hand, it is possible to take advantage of the benefits of pre-established models and use them in new models. This technique is known as transfer learning; this means that it allows us to transfer learning from a pre-trained model such as VGG-19, ResNet-50, Inception-v3, and so on (pretrained models for 1000 objects' classification) and apply it to new classification algorithms. Furthermore, it is possible to unfreeze some pre-trained layers by adapting the model (fine-tuning) to re-train them along with the new fully connected layers; this method implies increasing the training time to avoid overfitting problems and to obtain optimal performance from the algorithm.

A popular technique to overcome overfitting is early stopping. For this purpose, at each iteration, the training set is divided into training and validation folds. The training folds are used to train the model and the validation folds are used as validation data at each iteration. In each training of the model, the validation folds help us to verify the accuracy of the model at the end of each epoch. Therefore, as soon as the test error starts to increase, the training is stopped.

#### *7.5. Evaluation Metrics*

To evaluate the results, we used several metrics that are very common in machine learning applications for classification problems.

#### 7.5.1. Accuracy

It is defined as the number of predictions made correctly by the model of the total number of records.

$$accuracy = \frac{TP + TN}{TP + FP + FN + TN} \tag{1}$$

where *TP* represents true positives, *TN* represents true negatives, *FP* represents false positives, and *FN* represents false negatives.

#### 7.5.2. Precision

We evaluate our data for its performance of "positive" predictions.

$$precision = \frac{TP}{TP + FP} \tag{2}$$

#### 7.5.3. Recall (Sensitivity) (True Positive Rate)

It is calculated as the number of correct positive predictions divided by the total number of positives.

$$recall = \frac{TP}{TP + FN} \tag{3}$$

#### 7.5.4. Specificity (True Negative Rate)

It is calculated as the number of correct negative predictions divided by the total number of negatives.

$$Specificity = \frac{TN}{TN + FP} \tag{4}$$

#### 7.5.5. F1 Score

It is the weighted average of precision and sensitivity. Therefore, this score takes into account both false positives and false negatives.

$$F1\ score = 2 \times \frac{(precision \times recall)}{(precision + recall)}\tag{5}$$

#### 7.5.6. Loss

Loss is the value that reflects the sum of errors in our model. It indicates whether the model is performing well (high value) or not (low value); on the other hand, the accuracy can be defined as the number of correct predictions divided by the number of total predictions.

Therefore, if we analyze these two metrics together (loss and accuracy) (see Table 3), we can deduce more information about the model performance. If loss and accuracy are low, it implies that the model makes small errors in most of the data. However, if both are high, it makes large errors in some of the data. Low accuracy but high loss would mean that the model makes large errors in most of the data. However, if the accuracy is high and the loss is low, then the model makes small errors in only some of the data, which would be the ideal case.

**Table 3.** Analysis of both loss and accuracy metrics together.


#### *7.6. Experimental Results*

After training with different configurations in the upper layers of each model, the following results were obtained.

#### 7.6.1. Loss and Accuracy

For training, cross validation was performed; therefore, the early stopping technique was used to avoid overfitting (as mentioned above); thus, training is stopped once it has reached the maximum accuracy value. Furthermore, the checkpoint was used to save the weights of the trained model when a new maximum value arises and we can load it in the future. Table 4 shows a summary of the accuracy and loss for the training and testing of each model. We can see that, for training, all models achieve an accuracy above 99% and ResNet-50 achieves a higher loss value compared with the other two models. Furthermore, for testing, ResNet-50 achieves the highest accuracy, but also the largest loss of 98% and 0.3203, respectively. VGG-19 achieves an accuracy of 94% and the lowest loss of 0.0039 and, finally, Inception-v3 achieves an accuracy of 90% and a loss of 0.0364. Based on the accuracy, ResNet-50 has developed much better performance compared with the other trained models.


**Table 4.** Accuracy and loss for VGG-19, ResNet-50, and Inception-v3 model.

Table 5 shows the accuracy achieved by each model with each of the classification categories (seven), evidencing the performance in more detail. VGG-19 achieves an accuracy of 100% for I\_watching and O\_reading categories, an average accuracy of 97.42% for the remaining categories, and a lower value of 72.73% for the O\_chatting category. Similarly, ResNet-50 achieves an accuracy of 100% for the I\_watching and O\_talk\_cell categories and the worst result for the O\_distracted category, with an accuracy of 95.4%. On the other hand, Inception-v3 achieves a high accuracy of 98.68% for the I\_distracted category and a lower accuracy of 66.6% for the O\_talk\_cell category.


**Table 5.** Accuracy of each model with each category.

As this research work focuses on parental distraction detection for child drowning prevention, the "In the water watching the children" (I\_watching) and "Out of the water watching the children" (O\_watching) categories are the most relevant ones to detect if parents/caregivers are really supervising their children. All of the other categories just represent that the caregivers are distracted and should be warned. For I\_watching, the VGG-19 and ResNet-50 models achieve an accuracy of 100% and Inception-v3 achieves an accuracy of 96.83%. Likewise, for O\_watching, the VGG-19 and ResNet-50 models achieve an accuracy of 99.61% and Inception-v3 achieves an accuracy of 84.42% (Table 4).

#### 7.6.2. Precision, Recall, and F1-Score

Accuracy should not be considered as a single metric for measuring model performance when using an unbalanced data set, as it counts the number of correct predictions regardless of the type of category, leaning towards the majority categories. In other words, from a dataset of 100 cases where 95 belong to the category "a" and five to category "b", if only all the cases in the first category are correctly predicted, an accuracy of 95% would be obtained. This value is misleading because 95% refers only to the correctly predicted values of one category (50% of the total predictions).

Because our data are unbalanced, we also consider other metrics such as recall, precision, specificity, and F1-score to evaluate our results. Table 6 shows the values obtained in every category based on the above-mentioned metrics for VGG-19. F1-score is the harmonic mean of precision and recall and it takes into account both false positives and false negatives. The VGG-19 model performs well because it achieves an accuracy between 96% and 99% for most categories and a smaller accuracy of 84% for the *O\_reading* category. We can also observe that, for the most relevant categories (I\_watching and O\_watching), this model reaches an F1-score of 98%, demonstrating good performance in training.


**Table 6.** Evaluation metrics of the VGG-19 model.

Table 7 shows a summary of the already mentioned metrics in every category for the ResNet-50 model. It achieves an F1-score between 97% and 99% for all categories. It should be pointed out that this model reaches an F1-score of 98% and 99% for the most relevant categories (I\_watching and O\_watching), which is the best performance of the three models.


**Table 7.** Evaluation metrics of the ResNet-50 model.

Finally, Table 8 shows a summary of the already mentioned metrics in every category for the Inception-v3 model. This model achieves an F1-score between 91% and 98% for most categories, and a minimum F1-score of 79% for the O\_talk\_cell category. In this case, the Inception-v3 model achieves an F1-score of 98% for the I\_watching category, but the lowest F1-score of 84% for the O\_watching category (most relevant categories).

**Category Precision Recall F1-Score Total Samples** I\_distracted 0.98 0.99 0.98 1291 I\_watching 0.98 0.97 0.98 883 O\_distracted 0.75 0.95 0.84 1458 O\_talk\_cell 0.98 0.67 0.79 1036 O\_reading 0.98 0.94 0.96 1069 O\_chatting 0.92 0.91 0.91 935 O\_watching 0.84 0.84 0.84 507

**Table 8.** Evaluation metrics of the Inception-v3 model.

According to this, we conclude that the ResNet-50 model shows excellent performance for this classification problem, reaching F1-scores of 98% and 99% in the I\_watching and O\_watching categories, respectively (see Table 7). However, the VGG-19 model with a value of 98% in the mentioned categories shows a solid performance as well (see Table 6).

#### 7.6.3. Confusion Matrix, False Positive Rate, and False Negative Rate

Figures 12–14 show the confusion matrices for each model. The main diagonal shows the number of matches found for each category between the true labels (columns) and the predicted labels (rows).

All categories are well predicted. Considering the most relevant categories "In the water watching the children" (I\_watching) and "Out of the water watching the children" (O\_watching) mentioned above, it is possible to have some wrong predictions, which means that, in some cases, certain distractions have not been detected. The three models sometimes classify distracted behaviors of caregivers as 'watching the children' (false positives). These cases represent a risk for children's safety, but fortunately, do not occur often compared with the true positive values for these categories. Inception-v3 obtains less false positives for I\_watching, with 14 versus 27 and 29 cases for VGG-19 and ResNet-50, respectively. ResNet-50 obtains less false positives for O\_watching, with 8 versus 21 and 79 cases for VGG-19 and Inception-v3, respectively. We define the false positive rate as subtracting 1 from the specificity or as dividing false positives by the sum of false positives and true negatives. The false-positive rate for I\_watching and the three models VGG-19, ResNet-50, and Inception-v3 is 0.43%, 0.46%, and 0.22%, respectively. The false-positive rate for O\_watching and the three models (VGG-19, ResNet-50, and Inception-v3) is 0.31%, 0.12%, and 1.18%, respectively. In terms of the false-positive rate, we observe that the obtained values are always very small; VGG-19 and ResNet-50 perform a little worse than Inception-v3 for I\_watching. ResNet-50 shows clearly the best results for O\_watching.

**Figure 12.** Confusion matrix VGG-19.

**Figure 13.** Confusion matrix ResNet-50.

**Figure 14.** Confusion matrix Inception-v3.

Furthermore, the three models sometimes classify "watching the children" as distracted behaviors (false negatives). These cases do not pose any risk, but could be annoying for caregivers who are warned to supervise the children when they actually were doing so. ResNet-50 and VGG-19 do not obtain any false negatives for I\_watching versus 28 cases for Inception-v3. ResNet-50 and VGG-19 obtain less false negatives for O\_watching, with 2 cases each, versus 79 cases for Inception-v3. If we also consider the false-negative rate for the most relevant categories (we define the false-negative rate as subtracting one from recall), we can see that, for I\_watching and the two models VGG-19 and ResNet-50, it is 0% and, for Inception-v3, it is 3.17%. The false-negative rate for O\_watching and the two models VGG-19 and ResNet-50 is 0.39% and, for Inception-v3, it is 15.58%. The false-negative rates obtained are very small (with the exception of the O\_watching category for Inception-v3). These results show that, for VGG-19 and ResNet-50, the child drowning prevention system works correctly with a minimal error rate versus Inception-v3.

#### **8. Conclusions and Future Work**

In this paper, a novel 5G and beyond child drowning prevention system that detects distracted parents or caregivers and alerts them to focus on active child supervision in swimming pools was developed. For this purpose, we evaluated and implemented three well-known CNN models: ResNet-50, VGG-19, and Inception-v3, to process and classify images. The proposed deep CNN models have revealed that they can be used to automatically detect (based on images) possible distractions of a caregiver who is supervising a child and generate alerts to warn them.

The proposed child drowning prevention system can successfully perform a sevenclass classification with very high accuracies of 98% for ResNet-50, 94% for VGG-19, and 90% for Inception-v3. VGG-19 and ResNet-50 achieve the same high performance in the most relevant categories I\_watching and O\_watching, with accuracies of 100% and 99.61%, respectively. For I\_watching, the three models achieve an F1-score of 98%. For O\_watching, they reach a F1-score of 98%, 99%, and 84% for VGG-19, ResNet-50, and Inception-V3, respectively. In terms of false-positive rate, the obtained values are always very small; VGG-19 and ResNet-50 perform a little worse than Inception-v3 for I\_watching. ResNet-50 shows the best results for O\_watching. The false-negative rates obtained are also very small (with exception of the O\_watching category for Inception-v3). VGG-19 and ResNet-50 models perform quite well with a minimal false-negative rate versus Inception-v3 for I\_watching and O\_watching of 0% and 0.39%, respectively. ResNet-50, compared with the other models performs a better classification for most categories. According to the results reached in this research, the proposed system was tested in a swimming pool, but we think it could also be implemented even in swimming lakes or beaches to avoid possible child drowning.

On the other hand, special attention must be paid to security/privacy. Although there is no doubt that distracted parent detection can save lives, associated privacy and security issues need to be analyzed to make our child drowning system socially acceptable. These issues include access rights to data (video images), storage of data, security of data transfer, data analysis rights, and the governing policies. The proposed child drowning prevention system may be vulnerable to a variety of active and passive security attacks (such as eavesdropping) with disastrous consequences (especially if unauthorized parties access underage images). For this reason, security and privacy risks should be minimized by applying existing technical solutions such as encryption, authentication mechanisms, and cryptographic access control during data collection and transmission, encryption message digests, and hashing to assure the integrity of data during data storage and processing. In addition, further works are also required to maintain the security and confidentiality of data by introducing advanced encryption-based techniques. All of these security and privacy challenges must be addressed so that the proposed child drowning prevention system comes out as a promising way to increase swimming pool safety.

We can define the total reaction time as the time elapsing from an observation (image), its transmission to the edge server, the image processing for activity recognition, and the transmission of an alert (if necessary) based on the observation (*D* = *DUE* + *DUplink* + *Dprocessing* + *DDownlink* ). As future work, we would like to run the entire system (processing of the images with the neural network and transmission using 5G) in real time. The expected response time for our child drowning prevention system would be around twenty milliseconds (see Table 1). Neural networks have an infinitesimal response time once the weights and the topology have been defined [44]. Further, 5G has been designed to address the requirements of ultra reliable and low-latency communications (URLLC). URLLC has stringent requirements for capabilities such as latency, reliability, and availability. Some use cases include wireless control of industrial manufacturing or production processes, remote medical surgery, and transportation safety. Therefore, 5G is the appropriate technology for our use case.

**Author Contributions:** Conceptualization, J.C.C.-P. and M.C.D.; formal analysis, J.C.C.-P. and M.C.D.; investigation, J.C.C.-P.; methodology, J.C.C.-P.; software, J.C.C.-P.; supervision, M.C.D.; validation, J.C.C.-P.; writing—original draft, J.C.C.-P.; writing—review & editing, J.C.C.-P. and M.C.D. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the Agencia Estatal de Investigación of Ministerio de Ciencia e Innovación of Spain under project PID2019-108713RB-C51 MCIN/AEI/10.13039/501100011033.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare that they have no known competing financial interest or personal relationships that could have appeared to influence the work reported in this paper.

#### **References**


## *Article* **Machine Learning Sensors for Diagnosis of COVID-19 Disease Using Routine Blood Values for Internet of Things Application**

**Andrei Velichko 1,\*, Mehmet Tahir Huyut 2,\*, Maksim Belyaev 1, Yuriy Izotov <sup>1</sup> and Dmitry Korzun <sup>3</sup>**


**Abstract:** Healthcare digitalization requires effective applications of human sensors, when various parameters of the human body are instantly monitored in everyday life due to the Internet of Things (IoT). In particular, machine learning (ML) sensors for the prompt diagnosis of COVID-19 are an important option for IoT application in healthcare and ambient assisted living (AAL). Determining a COVID-19 infected status with various diagnostic tests and imaging results is costly and timeconsuming. This study provides a fast, reliable and cost-effective alternative tool for the diagnosis of COVID-19 based on the routine blood values (RBVs) measured at admission. The dataset of the study consists of a total of 5296 patients with the same number of negative and positive COVID-19 test results and 51 routine blood values. In this study, 13 popular classifier machine learning models and the LogNNet neural network model were exanimated. The most successful classifier model in terms of time and accuracy in the detection of the disease was the histogram-based gradient boosting (HGB) (accuracy: 100%, time: 6.39 sec). The HGB classifier identified the 11 most important features (LDL, cholesterol, HDL-C, MCHC, triglyceride, amylase, UA, LDH, CK-MB, ALP and MCH) to detect the disease with 100% accuracy. In addition, the importance of single, double and triple combinations of these features in the diagnosis of the disease was discussed. We propose to use these 11 features and their binary combinations as important biomarkers for ML sensors in the diagnosis of the disease, supporting edge computing on Arduino and cloud IoT service.

**Keywords:** COVID-19; biochemical and hematological biomarkers; routine blood values; feature selection method; LogNNet neural network; machine learning sensors; Internet of Medical Things; IoT

#### **1. Introduction**

Identified in 2019, COVID-19 is an infectious disease caused by the novel severe acute respiratory syndrome coronavirus (SARS-CoV-2) [1,2]. Since the World Health Organization (WHO) declared the SARS-CoV-2 infection as a pandemic, the epidemic still maintains its severity to this day [3,4]. The early diagnosis of patients is extremely important to manage this unprecedented emergency [5,6]. The preferred gold standard method for detecting SARS-CoV-2 infections is the reverse polymerase chain reaction (PCR) or reverse transcriptase-PCR (RT-PCR) technique [7]. However, the execution of the test is time consuming (not less than 4–5 h under optimum conditions) and many favorable conditions must be met, such as the use of special equipment and reagents, the collection of samples and the necessity of trained personnel [8]. Machine learning (ML) and artificial intelligence (AI) models provide a powerful motivation to uncover insights from patients' data in tragic events such as the COVID-19 pandemic or in situations wherein guidelines have not yet been created [9]. ML and AI methods select the relevant biomarkers, revealing their predictive importance and consistently detecting their interactions with

**Citation:** Velichko, A.; Huyut, M.T.; Belyaev, M.; Izotov, Y.; Korzun, D. Machine Learning Sensors for Diagnosis of COVID-19 Disease Using Routine Blood Values for Internet of Things Application. *Sensors* **2022**, *22*, 7886. https:// doi.org/10.3390/s22207886

Academic Editor: Joel J. P. C. Rodrigues

Received: 15 September 2022 Accepted: 14 October 2022 Published: 17 October 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

each other. Moreover, the diagnostic performance of these methods has the ability to be improved [9–11]. AI studies for the early detection, diagnosis and prognosis of COVID-19 relied on computed tomography (CT) and RBVs. However, imaging-based solutions are costly and require specialized equipment. Machine learning (ML) and AI studies based on RBVs features are a more economical and rapid alternative method for the early detection, diagnosis and prognosis of COVID-19 [7,11,12]. Previous studies have indicated that this disease can accompany multi-organ dysfunction and cause a variety of symptoms [3,13–15]. COVID-19 can cause severe pneumonia and severe ARDS due to inflammatory cytokine storms [5,14]. The excessive and uncontrolled release of proinflammatory cytokines was considered the most important primary cause of death, similar to other infections caused by pathogenic coronaviruses [16].

The pathogen may require special attention in intensive care units (ICUs) and cause a serious respiratory disorder, in some cases leading to death [14,16]. Moreover, it is difficult to distinguish symptoms of COVID-19 from known infections in the majority of patients [14,17,18]. This predictive analytics is especially required in medical information systems (MISs) to support clinical or managerial decisions.

COVID-19 may be part of a broader spectrum of hyperinflammatory syndromes characterized by the cytokine release syndrome (CRS), such as secondary hemophagocytic lymphohistiocytosis (sHLH) [19–21]. The activation of the monocyte–macrophage system just before the disease leads to pneumonia [22,23]. During this period, changes in many routine laboratory parameters such as D-dimer and fibrinogen have been reported in COVID-19 patients [1,2,4,5,14,22,24]. High ferritin, D-dimer, lactate dehydrogenase and IL-6 levels are indicators of poor prognosis and risk of death in patients [25–27]. In addition, Winata and Kurniawan [28] reported increased D-dimer and fibrinogen degradation product (FDP) in all patients in the late stage of COVID-19. This indicates that D-Dimer and FDP levels are elevated due to increased hypoxia in severe COVID-19 conditions and are significantly associated with coagulation. Kurniawan et al. [29] reported that hyperinflammation, coagulation cascade, multi-organ failure, which play a role in the etiopathogenesis of COVID-19, and biomarkers associated with these conditions, such as CRP, D-Dimer, LDH and albumin, may be useful in predicting the outcome of COVID-19.

The previous studies detected the clinical significance of changes in the routine blood values (RBVs) in the diagnosis and prognosis of infectious diseases [1,2,4,5,30,31]. However, Jiang et al. [32], Zheng et al. [33] and Huyut [11] noted that information on early predictive RBVs should be supplemented with large samples, especially for severe and fatal cases of COVID-19.

The uncontrolled spread of the disease in pandemics distresses health systems. The early detection of patients in pandemics is an important but clinically difficult process in terms of morbidity and mortality [14,24]. The diagnosis and prognosis of COVID-19 with the use of advanced devices can provide support in improving patient comfort, health system and tackling economic inadequacies [6,11,12]. In this context, studies are carried out to diagnose and determine the severity of the disease in the early period by using ML and AI-based methods as well as RBVs data [7,11,12]. The basic element in ML approaches is to determine the feature vector with a linear classifier [30]. Since ML algorithms require a sufficiently large number of samples, the problem of dimensionality in these methods is inevitable. To minimize this problem, the dataset should be reduced by finding a less dimensional attribute matrix. The dimensionality problem can be minimized by discarding irrelevant features with the feature selection procedure [30,31].

Feature selection methods can be summarized under three main headings: embedded methods, filters and wrappers (backward elimination, forward selection, recursive feature elimination) [30,31]. Feature selection in embedded methods is part of the training process and, therefore, this method lies between filters and wrappers. In the embedded methods, the determination of the best subset of features is performed during the training of the classifier (for example, when optimizing weights in a neural network). In terms of computational cost, embedded methods are more economical than wrappers [30].

Although we can find many case studies for all three feature selection methods, most feature selection methods are filters [30]. The existence of a large number of available feature selection methods complicates the selection of the best method for a particular problem [31]. The popular feature selection methods include correlation-based feature selection (CFS) [34], consistency-based filtering [35], INTERACT [36], information gain [37], ReliefF [38], recursive feature elimination for support vector machines (SVM-RFE) [39], Lasso editing [40] and minimum redundancy maximum relevance (mRMR) algorithm (developed specifically for dealing with microarray data) [30].

We examined the SARS-CoV-2-RBV1 database using the LogNNet neural network [12]. LogNNet can be defined as a feed forward network that increases the classification accuracy by chaotic mapping that fills a reservoir matrix. It is important to optimize the chaotic map parameters in data analysis by applying the LogNNet neural network. In addition, by taking advantage of chaotic mapping, it is possible to significantly reduce the RAM usage by a neural network. These results show that LogNNet can be used effectively in Internet of Things (IoT) mobile devices.

The main point for many digital health solutions during the pandemic process is the production of effective, fast and inexpensive alternative methods for the early diagnosis and treatment of COVID-19 patients. However, even the most knowledgeable and experienced physicians can interpret little of the information contained in routine blood laboratory results, and it is extremely difficult to determine the severity of COVID-19 patients based on RBVs findings alone [41]. In this context, ML classification models run with RBV-based data to determine the preliminary diagnosis of COVID-19 can be an effective tool in clinical decision support systems with an accuracy of over 95%. In this study, 13 ML models and LogNNet neural networks were applied in the diagnosis of suspected cases with an alternative device, based on LogNNet and Andrunio solutions, as only RBV-based, and the most important features were determined. We made a clinical interpretation of the relationship between these features and their various combinations with the disease. We achieved the performance of all models in detecting the diagnosis of the disease and reached up to 99.8% accuracy. ML sensors (Sensors 1.0 type) for the diagnosis of the COVID-19 disease have been successfully tested in the IoT environment, and the diagnosis of the disease has been implemented in offline and online modes. In offline mode, ML sensors were run on an Arduino board with a LogNNet neural network with a total RAM consumption of ~4 kB. Obtaining the findings in this study over a large sample is an important advantage in terms of the validity of the study. We believe that this study will help to identify suspected patients with a high probability of being infected with COVID-19 at the time of admission to the hospital with a fast and economical method, which will make important contributions to the detection of the disease before it progresses.

The paper has the following structure. Section 2 present the related studies, Section 3 describes the data collection procedure, correlation analysis of features, machine learning methods and the implementation of LogNNet on an Arduino board. Section 4 presents the results from the correlation analysis of dataset, classification results, one, double, triple and 11 feature combinations in the detection of sick and healthy individuals, and the ML sensor concept for IoT. Section 5 discusses the results and compares them with known developments. Section 6 presents the limitations of the study. In conclusion, a general description of the study and its scientific significance are given.

#### **2. Related Studies**

The prompt diagnosis of COVID-19 seems to be a promising advancement for applying at-home health care and AAL [42]. The digitalization of healthcare calls for effective applications of human body sensors [43] and human sensing [44], including ML sensors, to continuously monitor various parameters of the human body in everyday life with the help of the IoT [45]. Everyday human body sensors testify to the growing number of applications in IoT-enabled ambient intelligence (AmI) systems [46]. The paradigms of ML sensors [47] and artificial intelligence (AI) sensors [48,49] are similar in meaning. The ML

sensor paradigm was further developed by Warden et al. [47] and Matthew Stewart [50], wherein the authors introduced the terms Sensors 1.0 and Sensors 2.0 devices. Sensors 2.0 devices are a combination of both a physical sensor and a machine learning module in one package. Sensors 2.0 devices process data internally, ensuring data security, while in Sensors 1.0 devices, these modules are physically separated. In addition, the authors proposed the concept of creating a datasheet of ML sensors. Therefore, the development of technology for creating ML sensors for the diagnosis of the COVID-19 disease is an urgent problem.

In previous studies, the RBVs of people who lived and died from COVID-19, or patients with COVID-19 and healthy individuals, were statistically compared [1,3,14,22– 24,26,51]. In addition, differences in many RBVs characteristics are known between mild and severe COVID-19 patient groups according to statistical methods. However, this study demonstrates that ML models using only one or two features can detect COVID-19 patients from a large group of patients with high accuracy. Therefore, this study will be an alternative approach with extremely high sensitivity for the diagnosis of COVID-19. ML algorithms allow for an easy interpretation of complex association structures in data by simultaneously evaluating the cumulative effects of numerous biomarkers to discover higher-order interactions [4,6,9]. With this benefit, the strengths of using ML in clinical medicine are considered as an opportunity. Although various clinical studies [7,52,53] have highlighted how blood test-based diagnosis can provide an effective and low-cost alternative for the early detection of COVID-19 cases, relatively few ML models have been applied to blood parameters [7,54].

An evaluation of lung CT images to predict lung cancer using deep learning with an improved abundant clustering technique and instant trained neural networks approach was performed in [55]. The authors achieved an accuracy of up to 98.42% in cancer diagnosis with a minimum classification error of 0.038. Cui et al. [56] examined the distribution of pixels in the images with the fuzzy Markov random field segmentation approach using positron emission tomography (PET) and computed tomography (CT) images of the affected area associated with lung tumor. The developed method provided an accuracy of 0.85 in recognizing the lung tumor region. Tomita et al. [57] ran a logistic regression (LR) analysis, support vector machine (SVM) and deep neural network (DNN) models with biochemical findings, lung function tests and bronchial challenge test features to predict the initial diagnosis of adult asthma. In the pre-diagnosis of adult asthma, the DNN model showed 0.98%-ACC, the SVM model 0.82%-ACC and the LR model 0.94%-ACC. Ryu et al. [58] used various ML models and a deep neural network model for the pre-diagnosis of diabetes mellitus using various physical and routine blood values features. The deep neural network has been the most successful model with a value of 0.80- AUC in diabetes mellitus. Kolachalama et al. [59] used a six-convolutional deep learning architecture (CNN) with histological images, biopsy results and some clinical phenotypes to classify kidney disease severity. The CNN model was found to be more successful with AUC values of 0.878, 0.875 and 0.904, respectively, than the pathologist-predicted fibrosis score (0.811, 0.800 and 0.786 AUC, respectively) for assessing 1-, 3- and 5-year renal survival. In a study conducted to identify patients at risk of early diagnosis of fatty liver disease, Wu et al. [60] used an artificial neural network model with three ML models. The most successful model in the diagnosis of risky patients was the random forest with 87.48-ACC and 0.92-AUC values. Oguntimilehin et al. [61] used an ML technique on a set of labeled typhoid fever contingent variables for the diagnosis of typhoid fever and to establish explicable rules. The labeled database is divided into five different levels of typhoid fever severity, with classification accuracies on both the training set and the test set of 95% and 96%, respectively. The application was implemented using Visual Basic as the front-end and MySQL as the back-end. Kouchaki et al. [62] used various ML methods to predict the resistance to MTB in Mycobacterium tuberculosis (MTB) patients given a specific drug in a timely manner and to identify resistance markers. Compared to the traditional molecular diagnostic test, the AUC values of the best ML classifiers were higher for all

drugs. Logistic regression and gradient tree reinforcement methods performed better than other techniques. Taylor et al. [63] ran six machine learning algorithms with 10 features consisting of patient demographics, RBVs results and drug information for the diagnosis of and treatment decisions for urinary tract infection. The best performing model, XGBoost, diagnosed the presence of a urinary tract infection with a high AUC value (0.826–0.904 confidence interval).

Yang et al. [64] ran four ML models on 3356 patients (42% COVID-19 positive) using 27 features covering both blood count and biochemical parameters. A gradient boosted decision tree model was the most successful model in the diagnosis of the disease with a value of 0.85-AUC. Booth et al. [9] operated 26 RBVs data elements with a support vector machine to detect COVID-19 patients at high mortality risk and determined prognostic biomarkers with a value of 0.93-AUC. Huyut [11] classified severely and mildly infected patients from a large population of COVID-19 patients using 12 supervised ML models and 28 routine blood values. The models with the highest AUC for identifying mildly infected patients were found to be local weighted learning at 0.95%, Kstar at 0.91%, Naive bayes at 0.85% and K nearest neighbor at 0.75%. Brinati et al. [65] ran 13 RBVs with various ML methods to detect COVID-19 patients (102 negative and 177 COVID-19 positive). They noted that the models with the highest accuracy in the diagnosis of the disease were random forest (82%) and logistic regression (78%). Huyut and Velichko [12] determined the diagnosis and prognosis of the COVID-19 disease by running the LogNNet neural network model on 51 RBVs features. The model achieved an accuracy of 99% in the diagnosis of the disease and 84% in its prognosis. Zhang et al. [66] used a variety of demographic indicators and 21 RBVs using a Lasso-based neural network model to detect predictors of mortality from COVID-19. The success of the model in determining the clinical status of the patients was 98%-AUC. Alle et al. [67] applied the XGboost and logistic regression model on a dataset of various clinical and laboratory tests to predict COVID-19 mortality and found accuracy rates of 83% and 92%, respectively. Gao et al. [68] applied an ensemble model derived from support vector machine (SVM), gradient augmented decision tree (GBDT) and neural network (NN) algorithms using 28 immune/inflammatory features to detect COVID-19. The developed model reached 0.99 AUC in detecting infected patients. Vaishnav et al. [69] used various machine learning models to predict mortality from COVID-19, and the decision tree regression model produced a 70% accuracy and the random forest regression model a 76% accuracy. Huyut and ˙ Ilkbahar [5] used various biomarkers with the CHAID decision tree to detect the diagnosis and prognosis of COVID-19. The model produced an 81.6% accuracy in recognizing the disease and a 93.5% accuracy in determining the prognosis of the disease. Huyut and Üstünda ˘g [6] used 23 blood gas parameters with the CHAID decision tree to predict the diagnosis and prognosis of COVID-19. The model produced a 68.2% accuracy in recognizing the disease and a 65.0% accuracy in determining the prognosis of the disease. Kukar et al. [70] constructed a machine learning model based on 35 RBVs to diagnose 5333 negative and 160 positive COVID-19 patients with various bacterial and viral infections. The model showed an 81.9% sensitivity and a 97.9% specificity in detecting patients. Mei et al. [71] developed a model combining CNN and multilayer sensor to detect COVID-19 using computed tomography (CT), various clinical information elements and some RBVs data. The model reached an 84% sensitivity and an 83% specificity in recognizing the disease.

AI studies on the risk of poor outcome for COVID-19 patients need further validation with larger samples [11,25,72]. Furthermore, previous AI studies using RBVs for COVID-19 diagnosis and prognosis which covered the early stages of the outbreak included less blood values and reported poorer performance. Therefore, to detect the disease in the later stages of the epidemic, it is necessary to study ML models on a larger sample, which can achieve higher accuracy and use most RBVs.

#### **3. Data and Methods**

The data used in this study were collected retrospectively from the information system of Erzincan Binali Yıldırım University Mengücek Gazi Training and Research Hospital (EBYU-MG) between April and December 2021. The data used in this study are shared as open access under the name of "SARS-CoV-2-RBV1" in [12].

During the dates covered by this study, a diagnosis of SARS-CoV-2 was made by real-time reverse transcriptase polymerase chain reaction (RT-PCR) on nasopharyngeal or oropharyngeal swabs at the EBYU-MG hospital. RBVs results at first admission were recorded to prevent various complications.

#### *3.1. Characteristic of Participants, Workflow and Datasets*

Between the specified dates, the digital system of our hospital was scanned and patients diagnosed with COVID-19 (n = 2648) were selected from a large patient population (a dataset of approximately 80 thousand patients was scanned). The routine laboratory information of these patients was examined. The parameters (features) that were measured from at least 80% of the patients were used. Missing data were completed with the mean of the distribution and normalized. A total of 51 routine blood values calibrated from approximately 70 parameters were recorded. In addition, a group (control group) with the same number of negative COVID-19 tests (n = 2648) was identified and 51 characteristics of these individuals were recorded. Our control group arrived at the hospital only with the suspicion of COVID-19. Chronic disease information of the patients could not be reached. Only data of individuals over the age of 18 were recorded.

These two datasets were combined and named "SARS-CoV-2-RBV1" dataset. The SARS-CoV-2-RBV1 dataset includes immunological, hematological and biochemical RBVs parameters and consists of 51 features (Table 1). In the SARS-CoV-2-RBV1 dataset, positive COVID-19 test results were coded as 1 and negative test results as 0 (COVID-19 = 1, non-COVID-19 = 0).


**Table 1.** Feature numbering for SARS-CoV-2-RBV1 dataset [12].

CRP: C-reactive protein; INR: international normalized ratio; PT: prothrombin time; PCT: procalcitonin; ESR: erythrocyte sedimentation rate; aPTT: activated partial prothrombin time; LYM: lymphocyte count; NEU: neutrophil count; PLT: platelet count; WBC: white blood cell count; BASO: basophil count; EOS: eosinophil count; HCT: hematocrit; HGB: hemoglobin; MCH: mean corpuscular hemoglobin; MCHC: mean corpuscular hemoglobin concentration; MCV: mean corpuscular volume; MONO: monocyte count; MPV: mean platelet volume; PDW: platelet distribution width; RBC: red blood cells; RDW: red cell distribution width; ALT: alanine aminotransaminase; AST: aspartate aminotransferase; ALP: alkaline phosphatase; CK-MB: creatine kinase myocardial band; D-Bil: direct bilirubin; GGT: gamma-glutamyl transferase; HDL-C: high-density lipoprotein cholesterol; CK: creatine kinase; LDH: lactate dehydrogenase; LDL: low-density lipoprotein; T-Bil: total bilirubin; TP: total protein; eGFR: estimating glomerular filtration rate; UA: uric acid.

The features in this dataset are calibrated and contain almost all of the routine blood values that are the subject of studies on COVID-19 in the literature. Therefore, we believe that the bias of our study using this dataset was minimized in comparison to the literature. In addition, the use of our dataset, which we share as open access, is important in terms of showing the reproducibility and auditability of the results.

#### *3.2. Correlation Analysis of Features*

To determine the level of correlation between diagnosis and biochemical blood parameters, the original dataset was analyzed using the point-biserial correlation test [73]. Pearson correlation coefficient was calculated for each feature–feature pair, and a correlation matrix was compiled. The correlation matrix makes it possible to judge the strength and structure (positive or negative) of the linear relationship between diagnosis–feature and feature–feature pairs. The correlation matrix was created using the pandas software package [74].

#### *3.3. Machine Learning Methods, Hyperparameters, Accuracy Estimation*

Machine learning algorithms can be applied to a wide range of problems such as classification, clustering, regression analysis, time series forecasting, etc. [75]. The SARS-CoV-2-RBV1 dataset under study has an output parameter divided into two classes (positive or negative diagnosis for COVID-19), so the task of the machine learning algorithm is reduced to binary classification based on 51 features. This study compared the accuracy of the most popular binary classification algorithms: multinomial naive Bayes (MNB), Gaussian naive Bayes (GNB), Bernoulli naive Bayes (BNB), linear discriminant analysis (LDA), K-nearest neighbors (KNN), support vector machine classifier with linear kernel (LSVM), support vector machine classifier with non-linear kernel (NLSVM), passive-aggressive (PA), multilayer perceptron (MLP), decision tree (DT), extra trees (ETs) classifier, random forest (RF), histogram-based gradient boosting (HGB).

Each classifier model has hyperparameters, for which optimization is necessary to obtain the most accurate models. For optimization, the software package "auto-sklearn" [76] was used.

Before training the models, the initial data were subjected to preprocessing, which makes it possible to speed up the training of the models and improve the accuracy of the classification. Preprocessing includes two stages: (1) normalization of numerical values of the input data, (2) generation of additional features. Normalization is a procedure consisting of bringing numerical data to a single format, which has the following options: quantile transformer (QT)—transforms feature values so that they correspond to a uniform or normal distribution; robust scaler (RS)—subtracts the median values for each feature and scales according to the interquartile range; MinMax (MM)—scales feature values so that they are all in the range from the minimum to the maximum value. The procedure for generating additional features transforms the original set of features into a set of features with a different dimension. This helps to select the most important features, compose additional features from them or present the input data in a special format for the ML algorithm. The following methods for generating additional features were used: polynomial (PN) creates features that are polynomial combinations of the original features; random trees embedding (RTE)—creates a multidimensional sparse feature representation, in which the data in each new feature are represented by binary values; extra trees preprocessor (ETP)—selects a part of the most important features that are evaluated using the extra trees algorithm; linear SVM preprocessor (LSVMP)—selects some of the most important features that are evaluated using the support vector machine algorithm; independent component analysis (ICA)—selects a set of statistically independent features from the entire original set; Nystroem sampler (NS)—transforms a set of initial features using a low-rank matrix approximation by the Nystroem method.

The accuracy of models *A*NF was assessed by the K-fold cross-validation method (*K* = 5) encapsulated in software packages, wherein the designation *A*NF refers to the classification accuracy when using *NF* features. K-fold cross-validation method splits the original dataset into *K* parts and sequentially trains the model. One of the *K* parts of the dataset is used as a test sample, and the other parts as a training sample. Then, the obtained values of the classification accuracy on the test samples are averaged. The division of the base into parts is performed using stratification. Such approach makes it possible to reliably estimate the accuracy of models.

In this study, we used a less common ML algorithm based on the LogNNet neural network. The LogNNet 51:50:20:2 configuration was used, and a detailed configuration description is given in [75]. The LogNNet architecture is IoT-oriented and can run on devices with low computing resources (Section 3.4).

Each algorithm was given the same amount of time (1 h) to optimize the hyperparameters. A computer with an AMD Ryzen 9 3950X processor and 64 GB DDR-4 RAM was used to train the models.

#### *3.4. Implementing LogNNet on an Arduino Board*

The Arduino Nano 33 IoT board was chosen as a prototype IoT edge device with limited computing resources. It is based on a 32-bit Microchip ATSAMD21G18 microcontroller with an ARM Cortex-M0+ computing core, a clock frequency of 48 MHz, 256 KB of flash memory and 32 KB of RAM. The neural network LogNNet 51:50:20:2 from [12] was programmed on the Arduino board and tested. Arduino Nano 33 IoT test circuit, LogNNet architecture and board are presented in Figure 1.

**Figure 1.** Arduino Nano 33 IoT test circuit and LogNNet architecture.

3.4.1. LogNNet Program for Arduino Board

LogNNet transforms the input feature vector *d* into a normalized vector *Y*, which is multiplied with the reservoir matrix *W* filled with a chaotic mapping. We used the mapping congruent generator (1) with the parameters indicated in Table 2 and the data [12]. Then, the transformed vector passes the output classifier (two-layer feed-forward neural network with two hidden layers).



Let us denote the matrix of weight coefficients between the layers *S*<sup>h</sup> and *S*h2 as *W*1, and the matrix between the layers *S*h2 and *S*out as *W*2. At the output, there are two neurons

for two classes (COVID-19 and non-COVID-19). Matrices of weight coefficients and values of normalization coefficients were calculated on a computer with high performance and saved in a separate library file. In addition, the library file (supplementary materials) contains the values of *K*, *D*, *L* and *C* required for calculating the *W* matrix, as well as data on configuration of the LogNNet 51:50:20:2 neural network.

When LogNNet is running, the values of the elements of the matrix W (2550 values) are sequentially calculated using the congruent generator method (1) each time a feature vector is input. This approach does not store the matrix W in the RAM memory of the controller, and it leads to memory saving; however, it slows down the calculations of the neural network.

The Arduino IDE development environment was used to implement the algorithm. The library file with the matrices *W*<sup>1</sup> and *W*<sup>2</sup> and other coefficients necessary for the operation of the neural network were loaded at the beginning of the program. The complete code of the program is presented in Appendix A, Algorithm A1. The algorithm is divided into functions and procedures:


The scaling factor "scale\_factor = 1000" makes it possible to convert data from a floating point type to an integer (and vice versa), by multiplying (dividing) by a factor and rounding. In the Arduino, a float variable takes 4 bytes of RAM, and an integer variable takes 2 bytes of RAM. Therefore, storing matrices *W*1, *W*<sup>2</sup> and other data in integer format are more efficient, and during library initialization, the data takes 2 times less RAM memory.

#### 3.4.2. Test Scheme

Neural network testing is the serial sending of SARS-CoV-2-RBV1 data to the Arduino board and counting the correct network responses. The data is generated on an external computer (Figure 1). For sending data, a protocol was implemented that separates the elements of the feature vector *Y* using the symbol "T" to avoid data gluing. At the end of the vector *Y*, special characters "FN" are placed, indicating the end of the data transfer. On the Arduino side, a protocol is implemented that recognizes the input data. In the "void loop" block, a loop is organized to check the availability of data in the serial port buffer using the Serial.Available function. This function returns "True" as soon as the Arduino receives data.

#### **4. Results**

#### *4.1. Correlation Analysis of Dataset SARS-CoV-2-RBV1*

Figure 2 presents the results of the correlation analysis of the diagnosis–feature and feature–feature pairs in the form of a "heatmap" over the entire volume of the SARS-CoV-2-RBV1 database. The first column of Table 3 shows the features most highly associated with the survivors and non-survivors of COVID-19 with a point-biserial correlation (*r*pb) coefficient exceeding 0.5. Here, the negative or positive result of the point-biserial correlation coefficient provides information about the direction of the relationship between the diagnosis of the disease and the quantitative characteristics. As seen in the first column of Table 3, the features most associated with disease diagnosis are MCHC, HDL-C, cholesterol and LDH. The second column of Table 3 shows the accuracy of the cut-off values calculated by the threshold approach [12] for each trait in classifying COVID-19 patients. The features

presented in the second column are the predictors that classify patients with the highest accuracy. For comparison, the results of the threshold classification *A*th from [12] are presented, and features with *A*th ≥ 70% are shown. The threshold classification method and the point-biserial correlation method give an intersecting set of features, but the threshold classification provides more diagnosis-related features. While the point-biserial correlation coefficient reveals the level of association between living and deceased COVID-19 patient traits, the diagnosis has only two values (1 and 0). However, when separating these two classes, the threshold method considers all the data of the relevant feature, and it has high sensitivity.

**Table 3.** Features most strongly correlated with the diagnosis according to the point-biserial correlation coefficient and the threshold correlation method.


**Figure 2.** Correlation of the SARS-CoV-2-RBV1 dataset for diagnosis–feature (point-biserial correlation) and feature–feature pairs (Pearson coefficient).

An analysis of the correlation of features among themselves (Figure 2) reveals several features that are linearly dependent on each other. The most strongly correlated pairs with the Pearson coefficient exceeding 0.6 modulo are shown in Table 4. The same table presents Pearson's coefficients separately for a variety of COVID-19 positive and negative participants. Full heatmaps by class (COVID-19, non-COVID-19) are shown in Figures 3 and 4.


**Table 4.** The features most strongly correlated with each other by the Pearson coefficient for the entire database and separately for classes (positive or negative COVID-19 status).

Three main types of pair correlations can be distinguished. The High–High type is the pairs of features for which the correlation has a high value does not depend on the presence or absence of COVID-19 disease. The High–Low type is the pairs of features that are highly correlated only in sick patients. The Low–High type is the pairs of features that are highly correlated only in healthy patients. In general, the features are more correlated in patients with COVID-19 (Figure 3). From a medical point of view, pair correlation will be reviewed in the Discussion Section.

**Figure 3.** Pearson correlation analysis results for positive diagnoses for COVID-19 from the SARS-CoV-2-RBV1 dataset.

**Figure 4.** Pearson correlation analysis results for negative diagnoses for COVID-19 from the SARS-CoV-2-RBV1 dataset.

#### *4.2. Classification Results for Dataset SARS-CoV-2-RBV1*

Table 5 presents the results of the machine learning algorithms optimized to obtain the maximum classification values using 51 features. The results are sorted in descending order of algorithm efficiency. For each algorithm, the average training and inference time of the model and methods for preprocessing of the input data are given.


**Table 5.** Results of assessing the classification accuracy of machine learning models for the SARS-CoV-2-RBV1 dataset.

The accuracy of the algorithms *A*<sup>51</sup> ranged from 98.56% to 100%, indicating that all models were good at identifying the association of features with the diagnosis of COVID-19. The most efficient model is based on the histogram-based gradient boosting classifier with a 100% accuracy.

Figure 5 presents the learning curves for the HGB model using all the features from the dataset. The red curve (training accuracy) shows the training ability of the model, and the green curve (cross-validation accuracy) shows the generalization ability of the model depending on the number of training examples. Each point on the graph was obtained using five different splits into a test (20%) and training (80%) subsets.

**Figure 5.** Learning curves for the histogram-based gradient boosting classifier model using 51 features from the SARS-CoV-2-RBV1 dataset.

The red curve represents the accuracy of the model on the training samples. The model has sufficient complexity to recognize all training samples with a 100% accuracy. The green curve represents the accuracy of the model on the test subset, the samples of which were not involved in model training. With an increase in training samples, the cross-validation accuracy of the model grows. The curves converge with each other and completely coincide when the number of training samples is more than 2500. The dots on the graph represent the average accuracy using five different splits, and the shaded areas represent the standard deviation.

Unlike other models, HGB does not require data preprocessing. The training time of the HGB model is about 6 s, which makes it possible to effectively use it to enumerate input features when searching for optimal combinations. The LogNNet model was used to implement the classification on the Arduino board, so its algorithm has a compact form suitable for IoT devices.

The HGB model was used to study the most significant combinations of the first, second and third features.

#### 4.2.1. Investigation of the Effectiveness of the HGB Model Operating on One Feature

Table 6 presents the classification result of the SARS-CoV-2-RBV1 dataset for the HGB model using a single input feature. The features are sorted in descending order of *A*<sup>1</sup> classification accuracy. The most effective features are the first six features: LDL (№ 43), cholesterol (№ 39), HDL-C (№ 36), MCHC (№ 20), triglyceride (№ 48) and amylase (№ 31). The same features are dominant in assessing the correlation between the sign and the diagnosis from Table 3.

**Table 6.** Classification efficiency of SARS-CoV-2-RBV1 datasets using the single feature for the Histogram-based Gradient Boosting classifier.


4.2.2. Investigation of the Effectiveness of the HGB Model Operating on Two Features

Table 7 presents the classification result of the SARS-CoV-2-RBV1 dataset for the HGB model using two input features. The pairs of features are sorted in descending order of classification accuracy *A*2.


**Table 7.** Classification efficiency of SARS-CoV-2-RBV1 dataset using 2 features for the Histogrambased Gradient Boosting classifier.

The resulting pairs contain the most effective features: LDL (№ 43), cholesterol (№ 39), HDL-C (№ 36), MCHC (№ 20), triglyceride (№ 48) and amylase (№ 31), which have the best *A*<sup>1</sup> score (Table 7). The best result (*A*<sup>2</sup> = 99.81) was obtained for the MCHC–MCH feature pair. At the same time, the pair contains the MCH (№ 19) feature with low efficiency (*A*<sup>1</sup> = 52.13) and Pearson correlation ~0.041. Such a combination of features with high and low correlation is observed very often, and this combination results in a high classification efficiency. Among the features from Table 7, the following have a low linear correlation with the diagnosis: MCH (0.041), UA (0.066), amylase (0.03) and LDH (0.071). Pearson's coefficient from the distribution in Figure 3 is indicated in brackets.

There are pairs consisting entirely of effective features, for example, LDL–MCHC (№ 43-№ 20), HDL-C–MCHC (№ 36-№ 20), etc. Figure 6 shows the relationship between the feature pairs for the top 50 results. The main six features are in the center. Asterisks indicate the features that most often form a pair with the main features: UA (№ 51), LDH (№ 42), CK-MB (№ 32) and ALP (№ 30). The main feature LDL (№ 43) forms the largest number of effective pairs for classification.

**Figure 6.** Pairs of features with high classification efficiency SARS-CoV-2-RBV1 dataset for the Histogram-based Gradient Boosting classifier.

To find the reasons for the effectiveness of the pairs of features from Table 7, twodimensional distributions of the diagnosis (attractors) were constructed for the first six pairs (Figure 7). For the healthy patients (non-COVID-19), there are clear linear and cruciform attractors, while for people diagnosed with COVID-19, these attractors shift and become chaotic. This difference in the shape of the attractors allows for classifiers to effectively distinguish between the two classes. The best separation of attractor shapes is observed for the MCHC–MCH pair (Figure 7a) that explains its highest classification ability. For the pairs in Figure 7b–e, shifted cruciform attractors are observed, which also contributes to their effective separation by classifiers. In Figure 7f, two attractors are blurred, but due to their weak intersection, the classification efficiency is high.

**Figure 7.** Two-dimensional distributions (attractors) of a COVID-19 and non-COVID-19 diagnosis in the coordinates of feature pairs MCHC–MCH (**a**), LDL–CK-MB (**b**), HDL-C–CK-MB (**c**), Triglyceride– CK-MB (**d**), LDL–Cholesterol (**e**), LDL–MCHC (**f**).

When using two features, the maximum accuracy is *A*<sup>2</sup> = 99.81% and this value is lower than when using the 51 features *A*<sup>51</sup> = 100%. However, feature reduction is important to simplify the classification of patients in practical terms. More accurate models can be obtained using three features.

4.2.3. The Study of the Most Significant Combination of Three Features of the HGB Model

Table 8 presents the classification result of the SARS-CoV-2-RBV1 dataset for the HGB model using three input features.

**Table 8.** Classification efficiency of SARS-CoV-2-RBV1 dataset using 3 features for the Histogrambased Gradient Boosting classifier.


An analysis of Table 8 reveals that no new features have been added in the first twenty most accurate models compared to Table 7. The MCH and MCHC features are found only in pairs. With the addition of the third feature, the maximum classification efficiency increased from *A*<sup>2</sup> = 99.81% to *A*<sup>3</sup> = 99.91%.

#### 4.2.4. The Study of the Most Significant Combination of 11 Features of the HGB Model

Tables 7 and 8 include only 11 features: LDL (№ 43), cholesterol (№ 39), HDL-C (№ 36), MCHC (№ 20), triglyceride (№ 48), amylase (№ 31), UA (№ 51), LDH (№ 42), CK-MB (№ 32), ALP (№ 30) and MCH (№ 19). The classification accuracy of the HGB model using 11 features was *A*<sup>11</sup> = 100%. Therefore, 11 features are sufficient to determine the presence of COVID-19 using machine learning methods based on the histogram-based gradient boosting classifier.

#### *4.3. LogNNet Implementation on Arduino for Edge Computing*

A compact 77-line LogNNet algorithm was created for diagnosing and predicting COVID-19 disease using routine blood values on an Arduino controller.

LogNNet testing on Arduino revealed an accuracy of *A*<sup>51</sup> = 99.7%, which coincides with the accuracy on the model computer program [46]. The classification time for the input vector is about 0.1 s. An estimate of the RAM used by the Arduino controller is shown in Figure 8.

**Figure 8.** Estimation of the RAM used by the Arduino controller when working with the neural network LogNNet 51:50:20:2.

Global variables (arrays *S*h, *S*h2 and variables) occupy 294 bytes of RAM, and incoming data is written to array *Y*, which occupies 208 bytes. The Arduino uses the Serial system library to operate the serial port. It is loaded at initiation in the "void setup" block and takes 310 bytes of RAM. The data stored in the LogNNet.h library are also loaded into the RAM during the program's initialization and take 2526 bytes, the maximum contribution made by the matrix *W*1—2142 bytes. For local computations within functions and procedures, at least 1012 bytes must be reserved. The total RAM consumption is of 4350 bytes.

#### *4.4. Machine Learning COVID-19 Sensor for IoT*

The LogNNet network can be easily imported to various microcontrollers and used to predict a diagnosis based on blood biochemical parameters. However, our experimental results in Sections 3.2 and 3.3 are significantly inferior in accuracy to resource-intensive machine learning algorithms. Therefore, we proposed two architectures of the IoT system (Figure 9), which include an IoT device with LogNNet implementation (edge computing) and a cloud service containing a trained HGB model (AI computing). These configurations implement the prognosis of the disease in offline and online modes with ML sensors for diagnosis of the COVID-19 disease (Sensor 1.0 type).

**Figure 9.** Two architectures of the IoT system, which includes an IoT device with LogNNet implementation (edge computing) and a cloud service containing a trained HGB model (AI computing).

In the IoT device, the results of a biochemical blood test are entered manually or transmitted directly from the laboratory equipment. If the cloud service is unavailable or if the blood tests are performed on site using a mobile laboratory in remote areas, the diagnosis is made by the LogNNet network. If the IoT device has access to the network, it sends a network request to the cloud service, wherein the diagnosis is determined using the HGB model. The cloud service sends a response with a diagnosis, which is displayed on the IoT device using an LED indication or on an LCD display.

#### **5. Discussion**

#### *5.1. Analysis of Results from a Medical Perspective*

COVID-19 has a higher mortality and infectivity than influenza [3,13]. The disease still causes death and continues to spread [1,6,15]. The use of vaccines did not stop the spread of the disease, and important mutations were detected in the structure of the virus during the epidemic [1]. Most of the infected patients had mild symptoms and a good prognosis. However, some patients developed severe symptoms, such as severe pneumonia, acute respiratory distress syndrome (ARDS) and multiple organ dysfunction syndromes (MODSs) [2,5,24]. A need for studies to determine the prognosis and immune conditions of the COVID-19 disease remains [3,74]. Therefore, the early evaluation of patients who need intensive care and have high mortality expectations as well as the effective identification of relevant biomarkers are important to reduce the mortality of the disease [5,6,25].

Various complications may be encountered during the treatment process of COVID-19 and, therefore, the course of the disease should be predicted earlier [64,77]. It is important to diagnose and predict the prognosis of the disease at an early stage so that the first response to severely infected COVID-19 patients can be conducted properly [2,5].

Although many studies on COVID-19 have been published, the relationships between the pathological aspects of the disease and routine blood values have not been fully determined [77]. Previous studies have reported that changes in many RBVs and hematological abnormalities are observed during the course of the disease [14,77].

In this study, according to the *A*th threshold classification result based on [12], the most effective features in the diagnosis of the disease were found to be LDL with 96.47%, HDL-C with 94.73%, cholesterol with 94.47% and MCHC with 94.35% (Table 3). Indeed, in previous studies, large changes in these features were reported in severe and fatal COVID-19 patients, and these features may be important biomarkers for the prognosis of the disease [1,2,5,14].

Considering the linear dependency structures of the features among each other, the most effective combinations of dual features in the diagnosis of the disease were obtained and the Pearson correlation values were calculated (Table 4). The highly positive linear correlation structure of some trait pairs with positive and negative individuals was remarkable. The highest positive and negative linearly correlated trait pairs (High–High) were HCT–HGB, MPV–PDW and HCT–RBC (96%, 93% and 87%, respectively). These features vary greatly in severe COVID-19 patients and may be associated with the prognosis and mortality of the disease [1,2,14]. The high positive association of trait pairs expressed as the High–High type with both positive and negative COVID-19 individuals led us to believe that various comorbidities such as hypertension, obesity and diabetes may exist in our negative COVID-19 population. Considering hospital admissions of negative COVID-19 patients, these trait pairs are highly associated not only with COVID-19 but also with various inflammatory syndromes and infections [1,2,78]. Djakpo et al. [79] stated that the abnormalities of HGB, HCT and RBC or anemia observed in patients with comorbidities are due to the inability of the bone marrow to produce enough RBCs to carry oxygen and lung damage caused by COVID-19 which complicates gas exchange.

Considering the relationship of the patients with these features, the presence of possible comorbid conditions prevents erythrocyte production due to existing inflammation. Since the variation in these trait pairs is hypersensitive to the immune response in individuals, these trait pairs were highly correlated with sick and healthy individuals. The MCH–MCHC trait pair was found to be highly positively correlated, especially with healthy

individuals, and this pair may be used as an important marker to distinguish healthy individuals in the diagnosis of the disease. Changes in these characteristics may indicate the suppression of lymphocytic and erythrocyte series or platelet and erythrocyte deformities [1]. In addition, in this study, a highly positive association of the MCH–MCV trait pair with COVID-19 was found. Merto ˘glu et al. [2], Huyut et al. [14] and Karakike et al. [21] stated that this was due to the decrease in the size of erythrocytes and anisocytosis in patients. The high positive association of the HGB–RBC trait pair with sick individuals may be related to impaired erythropoiesis in the later stages of the disease. The High–High type feature set provides important clues in the isolation of both sick and healthy individuals.

In Table 4, a high (77%) negative relationship between the fibrinogen–LYM feature pair and COVID-19 patients is seen, and we believe that this level of relationship is due to the fibrinogen feature. Indeed, Winata and Kurniawan [28] noted that the degradation product of fibrinogen (FBU) was increased in all patients in the late stage of COVID-19 and that this was significantly associated with coagulation. In addition, the high correlation of the cholesterol–LDL, cholesterol–HDL-C and chlorine-sodium trait pairs (High–Low type in Table 4) with sick individuals showed that these trait pairs were important markers in identifying sick individuals. Fang et al. [80] and Merto ˘glu et al. [1] stated that this feature set may be associated with multi-organ involvement in COVID-19 and the widespread distribution of angiotensin-converting enzyme receptors in the body.

The fact that the Low–High trait pair MCHC–MCV was found to be highly positively correlated with COVID-19 negative individuals in Table 4 suggested the importance of the size of erythrocyte and anisocytosis in healthy individuals [80]. In addition, the functional properties of the ALT–AST, eGFR–Urea and D-Bil–T-Bil pairs were found to be important markers in the isolation of COVID-19 negative individuals. Merto ˘glu et al. [1], Huyut et al. [14] and Zhou et al. [27] stated that the decrease in ALT, AST, GGT, total bilirubin and eGFR indicated that the patients had serious damage to organs such as pancreas and kidney. In another study, Bertolini et al. [81] stated that AST, GGT, ALP and bilirubin may be frequently elevated in COVID-19 and that the main underlying causes of this condition may be hyper inflammation and thrombotic microangiopathy. In addition, the high positive correlation of the INR–PT trait pair with negative COVID-19 individuals suggested that it is important to monitor these individuals for the development of disseminated intravascular coagulopathy and acute respiratory distress [80,82].

In this study, 13 popular classifier machine learning models and the LogNNet neural network model were run on 51 routine blood values to detect patients infected with COVID-19. Histogram-based gradient boosting (HGB) was the model with the fastest and highest accuracy in determining the diagnosis of the disease (accuracy: 100%, time: 6.39 sec).

For the HGB model using a single input feature (*A*1), the most effective features in the diagnosis of the disease were LDL 96.87% (№ 43), cholesterol 9507% (№ 39), HDL-C 94.99% (№ 36), MCHC 94.35% (№ 20), triglyceride (№ 48) and amylase (№ 31) (Table 6). For the HGB model using the dual entry feature (*A*2), the most effective trait pair in the diagnosis of the disease was MCHC–MCH (*A*<sup>2</sup> = 99.81) (Table 7). The success of MCH as a single-entry feature in the diagnosis of the disease is low (*A*<sup>1</sup> = 52.13). Huyut and Velichko [12] found an accuracy rate of 99.1% in the diagnosis of the disease by running the MCHC–MCH features with LogNNet. The HGB model operated with MCHC–MCH was found to be more successful than the LogNNet model in the diagnosis of the disease.

Since low values of MCH and high values of MCHC were associated with COVID-19 [83], it was expected that the use of these two features together in the diagnosis of the disease would produce higher classification success. The most effective dual trait pairs (Table 7) were similar to the most effective single traits (Table 6) for the HGB model for the diagnosis of the disease. This provides important information about the functional properties of the binary trait pairs obtained with the HGB model in the diagnosis of the disease. Six basic features, that is LDL (№ 43), cholesterol (№ 39), HDL-C (№ 36), MCHC (№ 20), triglyceride (№ 48) and amylase (№ 31), among the combinations of binary features used in the diagnosis of the disease and four features, that is UA (№ 51), LDH (№ 42), CK-MB (№ 32) and ALP (№ 30), that most frequently pair with these features are given in Figure 6. The main feature LDL (№ 43) generated the largest number of effective pairs for classification. The effectiveness of these feature pairs (Table 7) in detecting patients is visualized in two-dimensional space (Figure 7). Classification is most clearly visible in the MCHC–MCH pair (Figure 7a), which explains the higher classification ability.

In the binary feature combinations used by HGB in the diagnosis of the disease, the maximum accuracy was *A*<sup>2</sup> = 99.81% which is slightly lower than the use of 51 features (*A*<sup>51</sup> = 100%). However, feature reduction provides more cost effective and rapid results in interpreting the classification of patients from a practical point of view and identifying the most effective features. The highest classification success obtained for the HGB model using three feature combinations was *A*<sup>3</sup> = 99.91 (Table 8).

Analysis of Table 8 showed that no new features were added to the top twenty models with the highest accuracy compared to Table 7. The binary combinations in Table 7 were sufficient for the diagnosis of the disease. In addition, the co-existence of MCH and MCHC features in all combinations reveals hidden association structures between these features and contains important clues in the diagnosis of the disease.

In this study, the most important 11 biomarkers were found with the HGB model used to determine the diagnosis of the disease, and with these features, all patients and healthy individuals were correctly identified with high performance (A11 = 100%). In addition, the importance of various combinations of these features in the diagnosis of the disease was recognized. The performance of these 11 features, namely LDL (№ 43), cholesterol (№ 39), HDL-C (№ 36), MCHC (№ 20), triglyceride (№ 48), amylase (№ 31), UA (№ 51), LDH (№ 42), CK-MB (№ 32), ALP (№ 30) and MCH (№ 19) and their various combinations in the diagnosis of the disease was higher than the individual performances, suggesting that there is a high level of confidential information between these feature combinations and COVID-19.

Kocar et al. [84] and Zinellu et al. [85] presented evidence of significant changes in the lipid profile of severe COVID-19 patients, particularly in total cholesterol, LDL and HDL-C concentrations. They also reported that increased cholesterol concentrations in the cell membrane increased the binding activity of SARS-CoV-2, facilitated membrane fusion and enabled the successful entry of the virus into the host. Therefore, Kocar et al. [84] and Wei et al. [86] indicated that total cholesterol, LDL and HDL-C characteristics may aid in early risk stratification and clinical decisions. However, conflicting results have been reported for changes in triglyceride levels of severe COVID-19 patients [85]. Stephens et al. [87] stated that in severe COVID-19 patients, the elevated serum amylase value is often not attributable to acute pancreatitis or a clinically significant pancreatic injury, but is more likely to be a nonspecific manifestation of shock/critical illness. Mao et al. [83] stated that changes in leukocytes, neutrophils, lymphocytes, platelets, hemoglobin levels, MCV and MCHC are generally associated with lung involvement, oxygen demand and disease activity. They also noted that high MCV and low MCHC are associated with advanced anemia and are independent predictors of disease worsening [83].

Wu et al. [88] stated that an increase or decrease in LDH is indicative of radiographic progression or improvement. They also demonstrated the potential usefulness of serum LDH as a marker for assessing clinical severity, monitoring treatment response and thus aiding risk stratification and early intervention in COVID-19 pneumonia. Hu et al. [89] stated that SARS-CoV-2 infection is associated with low serum uric acid (SUA) levels, and this feature may be an independent risk factor for the disease. They also noted that male patients with COVID-19 accompanied by low SUA levels are at higher risk of developing severe symptoms than those with high SUA levels at admission. Zinellu et al. [90] found that high CK-MB concentrations were significantly associated with severe morbidity and mortality in COVID-19 patients. They stated that this biomarker of myocardial damage may be useful for the classification of patients with severe COVID-19, and that high CK-MB values may reflect excessive inflammation status. They also stated that the evaluation of CK-MB in COVID-19 patients provides specific clinical information for early risk stratification, independent of myocardial necrosis and cardiac complications. Afra et al. [91] showed the incidence of abnormal liver tests in severe COVID-19 patients and reported the association of elevated AST, ALT and total bilirubin levels with liver injury in severe COVID-19 patients [13,91,92]. However, conflicting results have been reported regarding the ALP levels of mild and severe COVID-19 patients [91]. In addition, Afra et al. [91] showed that elevated liver enzymes can effectively predict hospital-critical COVID-19 cases.

The accuracy of ML algorithms is difficult to determine when used without any physician input [93,94]. A major limitation of ML is that it is difficult to explain how these algorithms arrive at their conclusions [95]. An ML algorithm can be likened to a black box that takes inputs and produces outputs without any explanation as to how it produces the outputs [94].

Additionally, if an algorithm misdiagnoses a malignant lesion, the algorithm cannot explain why it chose a particular diagnosis [94,95]. While the printouts can aid interpretation, it can be a potential danger and problem to the patient if the model fails to explain to a patient why he or she has diagnosed a lesion as benign or malignant, or how it has chosen a particular treatment [94].

Physician interpretation is necessary for choosing a diagnosis or treatment. In addition to the black box nature of these algorithms, machine learning is also prone to the "garbage in, garbage out" motto [94]. This maxim indicates that the quality of the dataset input determines the quality of the output. Therefore, if the data inputs are badly labeled, the outputs of the algorithm will reflect these inaccuracies [93–95].

In addition, all the devices should be evaluated in prospective clinical trials and made publicly available in the peer-reviewed literature.

#### *5.2. Analysis of Results from IoT Perspective*

The aim of this study is the feasibility analysis of a fast, reliable and cost-effective digital tool for the diagnosis of COVID-19 based on the RBV values measured at admission. The proposed solution is based on the concept of ML sensors for diagnosis of the COVID-19 disease (Sensor 1.0 type). The concept makes a step towards "smart sensorics of human" with promising opportunities for AI applications in healthcare.

In our study, we are not targeting IoT systems for telemedicine, wherein any procedure is performed by a physician using telecommunication means of transmitting medical data. Solutions of clinical telemedicine are subject to strict certification. Telemedicine is prescribed by a doctor and is administered via a medical device (never via a smartphone). Instead, we focus on promising IoT systems for telehealth/telecare and mobile health (m-Health) [45]. First, data from ML sensors support the prognosis of the disease in offline and online modes. Second, ML sensors can be used in AAL and other IoT environments to support a person in his/her everyday life. Importantly, COVID-19 is not the only disease to apply ML sensors in IoT systems.

AI methods become effective for the prognosis of various diseases. COVID-19 has opened the new era for AI methods to mitigate future pandemics. The rapidly growing number of publications confirms the potential of ML sensors for collecting datasets for further analysis with AI methods. Predictive analytics uses available retrospective data and various predictive models (including ML-constructed) to aid in answering the question "What could happen?". Prognosis from the sensed data is required not only for clinical medicine (to support clinical medical decisions). Managerial predictive analytics supports managers in healthcare at various levels to assess possible scenarios for the development of diseases, the budgets of medical organizations, the need for medicines, etc.

In AAL, ML sensors are useful in personal use as digital assistance (recommendations, including prognosis). In fact, the five natural human sensors (vision, hearing, touch, smell, taste) are extended by ML sensors. A person can develop health insight from their own RBVs in real time or collect the data for retrospective analysis. Humans themselves can act as complicated sensors [44]. A human traditionally finds a way to enhance her/his function, e.g., glasses (optical tool) to advance the vision or thermometer (physical tool)

to regularly sense body temperature. Now the era of digital tools for personal health assistance is coming.

The implemented prototype of the diagnosis tool demonstrates that the LogNNet network can be imported to various microcontrollers. Many IoT devices can be made smarter, opening a way to develop advanced AmI healthcare with essential parts of IoT and edge intelligence [96]. A LogNNet-equipped ML sensor can be effectively employed in future IoT applications for healthcare and for other problem domains that require active digitalization and emerging AmI methods [46].

The LogNNet network can be used to predict a diagnosis based on blood biochemical parameters. This result is an important step in smart human sensors for IoT application, as the COVID-19 status and other blood-related health parameters are difficult to analyze on the IoT edges (in contrast to more widespread parameters, such as temperature or heartbeat) [97–99]. Our approach is applicable to the development of personalized bionic systems (smart suit for a person or AmI environment with people), wherein disease status recognition is a regular digital service for healthcare or well-being applications in everyday life [45].

Although the small IoT devices cannot provide such high accuracy as resourceintensive ML algorithms on powerful computer systems, AAL systems are intended for everyday life settings (e.g., at home, workspace, outdoor). Where strict medical decisions and critical medical support are not mandatory, the digital services may provide attention points and optional recommendations for personal use. We believe this type of smart human sensors will soon diffuse from the restricted medical lab setting toward the wide market of smart consumer electronics and digital services [100].

#### **6. Limitations of the Study**

The data primarily represent a single institution (EBYU-MG) and the Turkish population. Secondly, our dataset does not include comorbidities of patients and other diagnostic information of patient subgroups. In practice, the data in retrospective studies collected in a certain period cannot meet all data sample requirements. We suggest the findings in this study be supported by a retrospective cohort study setup.

#### **7. Conclusions and Future Studies**

Determining a COVID-19 infected status with diagnostic tests and imaging results is costly and time-consuming. If this process is prolonged, the patient's health may be at greater risk by being exposed to different complications. This study provides a fast, reliable and cost-effective alternative mobile tool for the diagnosis of COVID-19 based on the RBVs measured at the time of admission.

In this study, 13 popular classifier machine learning models and the LogNNet neural network model were run on 51 routine blood values to detect patients infected with COVID-19. The histogram-based gradient boosting (HGB) model was the most successful classification model in terms of accuracy and time in detecting the diagnosis of the disease (accuracy: 100%, time: 6.39 s). In addition, the absence of any normalization method and additional feature selection procedure for the HGB model contributes to the speed and efficiency of the model.

The eleven most important biomarkers in the diagnosis of the disease were found with the HGB classifier: LDL (№ 43), cholesterol (№ 39), HDL-C (№ 36), MCHC (№ 20), triglyceride (№ 48), amylase (№ 31), UA (№ 51), LDH (№ 42), CK-MB (№ 32), ALP (№ 30) and MCH (№ 19). Using only these 11 RBVs features, the HGB model accurately detected all COVID-19 patients (*A*<sup>11</sup> = 100%).

The high accuracy of the single, double and triple combinations of these 11 features selected by the HGB model in the diagnosis of the disease showed the importance of these features in the diagnosis of the disease. In addition, the performance of double and triple combinations of these features in the detection of sick and healthy individuals was higher than the individual performances, suggesting that there is a high level of hidden information between these blood feature combinations and COVID-19.

The HGB model reveals that 11 features are sufficient for the diagnosis of the presence of COVID-19 using the HGB classifier. These features and their binary combinations are an important source of variation in the diagnosis of COVID-19. We propose to use these features and their binary combinations to be run with HGB as important biomarkers in the diagnosis of the disease.

The study results can be effectively used in IoT medical edge devices with low RAM resources, ML sensors, portable point-of-care blood testing devices [101], decision support systems, telecare and m-Health. This opportunity empowers the development of many innovative applications for predictive analytics in clinical MIS or everyday AAL systems.

The artificial intelligence models for the early prediction of the diagnosis and progression of COVID-19 and other diseases produce satisfactory results. Future artificial intelligence studies for the early diagnosis and prognosis of fatal, costly and severe diseases will ease the burden of healthcare professionals and increase patient comfort. In addition, the use the physiological, comorbidity and demographic features of the patients together with the RBVs data may provide interesting insights. Testing the results of this study on multi-racial, multi-center and larger patient groups may improve the generalizability of the findings. In this context, this study may pave the way for many exciting subsequent investigations.

**Author Contributions:** Conceptualization, M.T.H. and A.V.; methodology, A.V., M.T.H., M.B., Y.I. and D.K.; software, A.V., M.T.H., M.B. and Y.I.; validation, M.T.H. and A.V.; formal analysis, A.V., M.T.H., M.B, Y.I. and D.K.; investigation, A.V.; resources, M.T.H.; data curation, M.T.H.; writing—original draft preparation, A.V., M.T.H., M.B., Y.I. and D.K.; writing—review and editing, A.V., M.T.H., M.B., Y.I. and D.K.; visualization, A.V. and M.B.; supervision, D.K.; project administration, D.K.; funding acquisition, D.K. All authors have read and agreed to the published version of the manuscript.

**Funding:** The research is implemented with financial support by Russian Science Foundation, project no. 22-11-20040 (https://rscf.ru/en/project/22-11-20040/ (accessed on 14 September 2022)) jointly with Republic of Karelia and funding from Venture Investment Fund of Republic of Karelia (VIF RK).

**Institutional Review Board Statement:** The dataset used in this study was collected in order to be used in various studies in the estimation of the diagnosis, prognosis and mortality of COVID-19. The necessary permissions for the collected dataset were given by the Ministry of Health of the Republic of Turkey and the Ethics Committee of Erzincan Binali Yıldırım University. This study was conducted in accordance with the 1989 Declaration of Helsinki. Erzincan Binali Yıldırım University Human Research Health and Sports Sciences Ethics Committee Decision Number: 2021/02-07.

**Informed Consent Statement:** In this study, a dataset including only routine blood values, RT-PCR results (positive or negative) and treatment units of the patients was downloaded retrospectively from the information system of our hospital in a digital environment. A new sample was not taken from the patients. There is no information in the dataset that includes identifying characteristics of individuals. It was stated that routine blood values would only be used in academic studies, and written consent was obtained from the institutions for this. In addition, therefore, written informed consent was not administered for every patient.

**Data Availability Statement:** The data used in this study can be shared with the parties, provided that the article is cited.

**Acknowledgments:** We thank the method of Erzincan Mengücek Gazi Training and Research Hospital for their support in reaching the material used in this study. Special thanks to the editors of the journal and to the anonymous reviewers for their constructive criticism and improvement suggestions.

**Conflicts of Interest:** The authors declare no conflict of interest.


#### **Appendix A**

#### **References**


## *Article* **Gait Characteristics Analyzed with Smartphone IMU Sensors in Subjects with Parkinsonism under the Conditions of "Dry" Immersion**

**Alexander Y. Meigal 1,\*, Liudmila I. Gerasimova-Meigal 1, Sergey A. Reginya 2, Alexey V. Soloviev <sup>2</sup> and Alex P. Moschevikin <sup>2</sup>**


**\*** Correspondence: meigal@petrsu.ru; Tel.: +7-911-402-9908

**Abstract:** Parkinson's disease (PD) is increasingly being studied using science-intensive methods due to economic, medical, rehabilitation and social reasons. Wearable sensors and Internet of Thingsenabled technologies look promising for monitoring motor activity and gait in PD patients. In this study, we sought to evaluate gait characteristics by analyzing the accelerometer signal received from a smartphone attached to the head during an extended TUG test, before and after single and repeated sessions of terrestrial microgravity modeled with the condition of "dry" immersion (DI) in five subjects with PD. The accelerometer signal from IMU during walking phases of the TUG test allowed for the recognition and characterization of up to 35 steps. In some patients with PD, unusually long steps have been identified, which could potentially have diagnostic value. It was found that after one DI session, stepping did not change, though in one subject it significantly improved (cadence, heel strike and step length). After a course of DI sessions, some characteristics of the TUG test improved significantly. In conclusion, the use of accelerometer signals received from a smartphone IMU looks promising for the creation of an IoT-enabled system to monitor gait in subjects with PD.

**Keywords:** inertial measurement unit; smartphone; accelerometry; TUG test; gait; Parkinson's disease; "dry" immersion

#### **1. Introduction**

Parkinson's disease (PD) is very suitable for the application science-intensive instrumental research methods. PD is gradually becoming a kind of "model disease" for the testing of new technologies for PD diagnostics and escorting PD subjects [1]. For several reasons, PD is one of the most studied neural pathologies in humans. One of the reasons is that PD is a widespread neurodegenerative disease worldwide, and its prevalence is increasing [2]. Furthermore, PD exerts a high economic burden on society [3] and worsens the quality of life of patients with PD [4]. Next, PD is characterized by gradual progression over decades [5], and PD symptoms are reliably quantified using clinical scales, which allow for the mathematical modeling of PD evolvement [6]. In addition, PD seems to be an extremely informative research object, since it allows for the development of insights into such phenomena as muscle tone and tremor, motor commands, postural reactions, orientation in space and gait.

Earlier, we have shown that in subjects with PD that both single session of Earth-based microgravity—modeled with "dry" immersion conditions (DI) [7]—and a program of repeated DI sessions [8] attenuates muscle rigidity and tremors and improves some aspects of activity of daily living. Additionally, some motor-cognition tests [9] and characteristics of hemodynamics and heart rate variability were improved after a program of DI sessions [10]. On the other hand, the function of a patient's spatial orientation in a vertical stance and the

**Citation:** Meigal, A.Y.;

Gerasimova-Meigal, L.I.; Reginya, S.A.; Soloviev, A.V.; Moschevikin, A.P. Gait Characteristics Analyzed with Smartphone IMU Sensors in Subjects with Parkinsonism under the Conditions of "Dry" Immersion. *Sensors* **2022**, *22*, 7915. https:// doi.org/10.3390/s22207915

Academic Editors: Raffaele Gravina and Leopoldo Angrisani

Received: 19 August 2022 Accepted: 14 October 2022 Published: 18 October 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

function of postural transition proved non-responsive to the condition of DI [7]. Muscle rigidity is often associated with bradikinesia (slowness of movements) and akinesia (difficulties with starting motion), which is seen in the akinetic-rigid form of PD. This allows for the presumption that a decrease in muscle rigidity, provoked by DI conditions, may result in an improvement of gait characteristics, e.g., gait speed, cadence and length of steps.

The Timed Up-and-Go (TUG) test has proven to be reliable in many domains of neuromuscular and orthopedic pathology for assessing gait, basic mobility skill, strength, agility and balance [11]. It consists of five sequential phases: (1) standing up from a chair (Sit-to-Stand transition phase), (2) walking straight forward (Gait-Go phase, including stand-to-walk transition), (3) turning by 180◦ (U-turn phase), (4) walking back (Gait-Come phase), and (5) sitting down (Stand-to-Sit transition with a turn). In its classic 3 m form, the TUG test provides an immediate score, requires no training and only needs one tester [11]; however, the classic TUG test supplies little data on gait as it requires the patient to take only 4 to 6 steps in both directions. In addition, the first step, the step prior and right after the U-turn, and the last step are clearly specific by their biomechanical functionality (transition to locomotion, decelerating when approaching the U-turn point and the end-point of the test, correspondingly). To overcome this problem, longer (expanded) versions of the TUG test were invented. For example, Haas et al. [12] presented the so-called L-test, which includes longer walks and turning in both directions, and Galán-Mercant et al. [13] presented a 10 m version of the TUG test. Earlier, we proposed an even longer (extended) version of the TUG test (13 m long, which returns around 20 steps in one direction) to provide a more precise view of a self-paced walk at a comfortable speed in the middle of both the Gait-Go and Gait-Come phases [14].

Throughout the last decade, instrumented versions of the TUG test (iTUG) were increasingly invented. In most of these versions, varied numbers and positions of miscellaneous inertial measurement unit (IMU)-based wearable sensors (accelerometers) were used to discriminate between the phases of the TUG test [15,16]. The IMU is often fixed on a foot to obtain the exact position of a limb in real time. The quality of the sensor's trajectory restoration is often controlled by video capture, and its position accuracy is in the millimeter range [17,18]. One of the problems with such a system is the time synchronization of inertial sensor data and video flow [19]; some researchers use multi-IMU networks. For example, in the study by Qiu et al. [20], a system of 100 Hz IMUs connected via WiFi was applied for monitoring complex gait parameters, including knee angle dynamics. Bogaarts et al. [21] explored the impact of noise on gait features that had been extracted from smartphone sensor data. They created a model of a moving body, generated acceleration signals for plenty of the points on the body, then added noise to simulated signals and after that tried to extract the gait features. As a result, they showed that sensors in from-the-shelf smartphones are sufficient for registering acceleration signals, given that the sensor's noises introduce negligible impacts on the computation of step power and other similar parameters [21].

There is a multitude of technological approaches for studying PD. Among them are optical motion trackers, biopotential devices, audio and video recording, and, especially, wearable sensors, such as smart glasses, hats, insole sensors of ground-reaction force and smartphones [22]. Previously, we evaluated the effect of DI on PD subjects with conventional laboratory tools (EMG, reaction time, tapping test, posturography) [7–10].

According to the review by Deb et al. [22], wearable sensors are currently the most used (40% articles in the field), while smartphones are the least used; however, starting from the year 2020, there is a trend in the growing number of articles that have used smartphones for their research [22]. Among the application areas, the diagnosis and monitoring/prognosis of PD were the most studied, and among symptoms, the gait, tremor and speech of a subject were the most studied [22]. Smartphone applications have good to excellent ability for predicting and discriminating gait and postural instability between PD subjects and healthy controls, as well as the leg dexterity and gait cycle breakdown between PD subjects with different severities of the disease [23]. Thus, there is strong evidence regarding the potential use of smartphone applications to assess gait and balance among individuals with PD in the home or laboratory [22,23].

Smartphones are equipped with IMUs that consist of a 3-axis accelerometer, a 3-axis gyroscope and a digital magnetometer that is comparable in sensitivity to research-grade biomechanical instrumentation [24]. In the study by Manor et al. [24], smartphones were placed in the front pocket, which is relevant for non-laboratory settings. Typically, smartphones are secured to the trunk or lower extremities. In our earlier study, we suggested a method of reconstruction of the head trajectory in 3D-space using the IMU-based accelerometer and gyroscope of a smartphone, which was fixed on a subject's head [14]. We assumed that, in accordance with the concept of the "inverted pendulum model", the head produces the biggest displacement in the vertical axis [25]. Similarly, Hwang et al. [26] conducted research with a 60 Hz single IMU fixed on a head, which is similar to the method used in our study. They used a FIR low-pass filter to reduce the noise and applied threshold to capture the exact phases of a stride. Since a FIR filter introduces a certain latency in the processed series data, this fact should be considered in data analysis. The authors also presented comprehensible figures that demonstrated that a single sensor fixed on a head picks up acceleration signals from both legs, and further, they clarified how this obtained signal might be "decoded" and understood. Thus, a smartphone fixed on a subject's head can return meaningful information about gait. Still, the 100 Hz IMU of a smartphone does not allow for sufficiently precise tracking of the trajectory of the head.

The major hypothesis of the present study was that gait characteristics in patients with PD are responsive to the conditions of either one session of DI or a course of repeated DI sessions. To address this, we obtained up-sampled acceleration signals from smartphonebased 100 Hz IMU sensors attached to the subject's head during a 13 m TUG test before and after single session of DI and a program of seven DI sessions.

#### **2. Materials and Methods**

#### *2.1. Subjects*

Altogether, data from six PD subjects was collected in the study. Six subjects with PD participated in the study after providing their informed consent. Their anthropometric and clinical data and the medication they use is presented in Table 1. All of them are from the same cohort of subjects who participated in our earlier studies [7–10]. The data on gait characteristics presented in this article were obtained from these studies. All subjects signed their informed consent, and the protocol of the study was approved by the Local Ethical Committee (joint ethics committee of the Ministry of Healthcare of the Republic of Karelia and Petrozavodsk State University (Statement of approval No. 31, 18 December 2014)).


**Table 1.** The anthropometric and clinical data of the subjects with PD.

T—tremulous form, AR—akinetic-rigid form of PD. Subject 1 participated in 2 courses of DI sessions. Subject 6 did not participate in DI sessions.

#### *2.2. Procedures*

#### 2.2.1. On-Earth Model of Microgravity

The on-Earth microgravity was modeled using the conditions of a "dry" immersion (DI). This method of DI has already been presented in detail in our earlier papers [7–10]. In brief, the condition of DI was created with the help of MEDSIM (Medical simulator of weightlessness, Center for Aerospace Medicine and Technology, IMBP, Moscow, Russia), which is housed in the Laboratory of Novel Methods in Physiology (Petrozavodsk State University). The MEDSIM facility uses a bathtub filled with 2 m3 of fresh, thermally comfortable water stabilized at T = 32 ◦C. The water in the tub was periodically filtrated and aerated to prevent bacterial contamination. The water surface was covered with a large, square waterproof film (3 × 4 m2), which was wrapped around the subject's body. The DI session was conducted at 9:30 AM, in the condition of "on-medication" in order to synchronize the effects of DI and the anti-PD therapy. The subjects usually took their medicines 2 h before the study, at 7:30 AM. Before DI, subjects were instructed to drink 200 mL of water and urinate due to the strong diuretic effect of DI [27]. Before immersion, subjects laid supine for approximately 10 min on a solid movable motor-driven platform on a cotton sheet in the MEDSIM facility in order to attach electrocardiogram electrodes, measure brachial blood pressure (BP), and familiarize (altogether around 5 min) and note ECG recordings in standard lead II (5 min). If after 10 min of lying supine the subject's BP was higher than 140/80 mm Hg, he/she was not allowed to enter DI and the study was postponed for another day. After that, the platform was driven to its bottom position, and subjects found themselves immersed in water without direct contact with the water; the head and upper chest were left above the water's surface. One DI session lasts for 45 min. BP and ECG were monitored at the 15th, 30th and 45th min. After the DI session, subjects laid motionless on the platform in its upper position for a further 5–7 min for re-adaptation to the pre-DI conditions and for ECG monitoring. Altogether, 22 measurements were successfully conducted: 10 measurements before/after a single session of DI (5 paired sets of data), and 12 measurements before and after a program of DI sessions (6 paired sets of data).

The program of DI consisted of seven 45 min DI sessions that were conducted twice a week for 25–30 days. The total DI dose during the course was 5<sup>1</sup> <sup>4</sup> h.

#### 2.2.2. Test Protocol: 13 m TUG Test

The TUG test was performed in its extended form (13 m instead of the conventional 3 m long test). Still, its phases were all the same: (1) standing up from a 46 cm highchair (Sit-to-Stand phase), (2) walk straight (13 m, Gait-Go phase, including Stand-to-Walk transition), (3) turning by 180◦ (U-turn), (4) walking back (13 m, Gait-Come phase), and (5) sitting down (Stand-to-Sit transition with a turn). In addition, unlike the classic 3 m version, the 13 m version of the TUG test allowed for an analysis of the subject's steps (gait)—because subjects performed up to 20 steps in one direction, which is sufficient for analyzing gait [28]. The TUG test was performed 15 min prior and then 8–10 min after the DI session. A baseline gait analysis throughout the day was not conducted, neither was one conducted before or after the DI session. The TUG test was performed prior and 8–10 min after the DI session.

#### *2.3. Data Processing*

During the TUG test, the acceleration and rotation rate were measured with the sensor module in the smartphone Xiaomi Mi4 (Xiaomi Tech, Bejing, China 68.5 mm × 139.2 mm × 8.9 mm, 149 g). The obtained signals were further processed offline.

The subject was instructed to sit still and look forward before and after the test. In a motionless state, the shape of the accelerometer signal is formed by the current projections of the gravity vector, measurement noise and the existing zero-G offsets, as well as the head tremor. The beginning and end of the movement are characterized by a change in the *x*-, *y*- and *z*- components of the acceleration vector due to the inclination as well as the presence of linear accelerations while standing up and sitting down. For the gyroscope in a motionless state, the signal includes the sensor noise as well as the head tremor; however, it is characterized by the constancy of the mean value (measurement offset). The start and end points of the TUG test were selected manually by analyzing the change in the mean value due to the rotation of the head and body during inclination while standing up and sitting down.

The internal phases of the TUG test and step moments were determined automatically. For steps in a straight-line walk (15–17 more or less uniform steps in the middle of both the Gait-Go and Gait-Come phases), a set of gait features was calculated. Altogether, subjects performed 35–40 steps in both directions, of which 30–35 steps that were in the middle of the walk were analyzed.

#### *2.4. Inertial Data Acquisition and Pre-Processing*

The sampling rate of the inertial sensors—both the accelerometer and gyroscope—of the Xiaomi Mi4 smartphone that was used in the present study was 100 Hz (the period between data samples was Δt = 10 ms), which can be regarded as neither reliably accurate nor fast. The smartphone was fixed on the back of the head of a subject with an elastic band and, additionally, a tight-knitted hat; subjects felt comfortable with this kind of fixation and the smartphone never fell out of its position. Values for the acceleration and angular velocity were collected as a time-stamped data stream. Thus, the measurements were accompanied by timestamps from the smartphone's operating system timer. For further analysis, the accelerometer and gyroscope measurements with a time-stamp difference of less than 5 ms (half of the measurement period) were considered synchronous. In order to increase the time resolution and achieve a smoother distribution, the time series data were up-sampled to a 10-fold-higher frequency of 1 kHz (Δt' = 1 ms) (Figure 1). In addition, an increase in the time resolution allowed for the application of high-order digital filters to the obtained time series.

**Figure 1.** Up-sampling of a periodic signal obtained during walking. The black open circles on the bottom panel correspond to the real data sampled at a frequency of 100 Hz. The red points are new data points reconstructed with up-sampling. The red curve is a continuous signal passing through all the circles.

Since the analyzed signal tended to be periodic, the up-sampling, which used Fast Fourier Transform, could be applied. Furthermore, as long as the measurement signals are real-valued, the real (single-sided) FFT is suitable for conversion into the frequency domain. In the frequency domain, up-sampling means there is zero-padding at the end of the high frequency components of the signal. The up-sampling procedure included the following steps:


A simple calibration of the zero offsets of the sensors was performed before the start of the test. To do this, we used the measurements obtained from a smartphone placed on a horizontal surface. It was noted that the sensor offsets were probably pre-calibrated by the Android OS. The bias instability and velocity/angle random walk for smartphone sensors was previously analyzed by us using the Allan variation [14]. The bias instabilities are (7.3, 8.2, 8) × <sup>10</sup>−<sup>4</sup> m/s<sup>2</sup> for the *<sup>x</sup>*-, *<sup>y</sup>*- and *<sup>z</sup>*-axis of the accelerometer, and (1.7, 5, 7) × <sup>10</sup>−<sup>5</sup> deg/s for the gyroscope. Since the test duration is less than 1 min, the bias drifts can be neglected.

The orientation of the smartphone was calculated using a well-known complementary filter proposed by Robert Mahony et al. [29] and is expressed in the form of a quaternion Q. Using Q, the acceleration and angular velocity measurement vectors were converted to a global coordinate system (global frame):

$$\mathbf{GV} = \mathbf{Q} \otimes \mathbf{SV} \otimes \mathbf{Q}^\* \,\tag{1}$$

where GV and SV are "pure" quaternions associated to the 3-dimensional measurement vector in the sensor frame and global frame, respectively; Q\* is a conjugate of Q; and the ⊗ symbol represents the Hamilton product.

#### *2.5. Turns (Rotation) Detection*

Automatic detection of a turn was conducted by analyzing the projection of the angular velocity on the vertical axis. No additional filtering of measurements was performed. If the values of the amplitude and duration of the rotation rate exceeded certain threshold values, a rotation was considered to be detected (recorded). At the first stage, a comparison was made with the threshold value of the rotation rate (10 degrees per second). At the second stage, the rotation duration was estimated. Rotations lasting less than 1 s were discarded. If three or more turns were detected in the TUG test, the two longest turns were considered the 1st (at the U-turn phase) and 2nd (prior to sitting down on a chair) turn. According to the available experimental data, this algorithm was successful in 100% of cases for both turn events.

#### *2.6. Step Detection*

Step detection was automatically performed by analyzing the time series of the acceleration vector. Since the typical cadence of stepping is about two steps per second, the measurements were filtered with a forward–backward zero-phase low-pass filter (Butterworth, 10th order) with a cut-off frequency of 3 Hz, which allowed us to obtain the LPF time series data. After that, peak values of the filtered signal were detected. Moments where the acceleration magnitude reached 11 m/s2 were taken as the approximate time-stamp of the initial contact of the foot with the ground (T' point).

For each step, the revised time-stamp T\_step of the heel strike and the corresponding maximum acceleration along the vertical axis were determined by searching for the maximum value in the ±40 ms window near the T' point. Not all steps taken during the TUG test were taken into account for gait analysis. The following local maxima that were obtained during the step detection procedure were discarded:


#### *2.7. Gait Features*

2.7.1. Duration of the TUG Test Phases (D-Parameters)

To estimate the duration of the entire TUG20 test and its phases, the following parameters were determined (Figure 2):

**Figure 2.** A representative plot of the 13 m TUG test with the phases (D1–D4) and parameters P1–P3 determined with an accelerometer (**left panel**) and the moments of step duration distribution (S1–S2, **right panel**) along progression of time.

D1 (The entire TUG test duration): the time from the very beginning of motion (the Sit-to-Stand movement) until the end of the test (sitting down on a chair).

D2 (Corresponds to the Sit-to-Stand phase plus the Stand-to-Walk period): the time from the start of the lifting to the moment of the heel strike on the second step.

D3 (U-turn phase, the 1st turn duration): the time to perform a 180◦ turn at the far turning point.

D4 (Walk-to-Sit phase, or the 2nd turn duration): the time from the beginning of the second turn until the end of the test.

2.7.2. Characteristics of the Temporal Stability of Stepping (S-Parameters)

For the analysis of gait stability, only straight-line, uniform steps were taken into account (see Section 2.6). The following parameters were computed:

S1 (Mean\_step\_duration, s): the duration of the step (dt) was determined as the difference between consecutive time-stamps of successive steps (Tstep moments). Before calculating the average value, two points with the largest deviation from the median value of step duration were discarded (red crosses, see Figure 2).

S2 (Step\_duration\_std, s) (see Figure 2).

To calculate the cadence mean and standard deviation, the "instantaneous walking pace" was first estimated for each step (cadence = 60/dt); then two outliers should be discarded. Usually, these outliers were characteristic of the "transitional" moments during the TUG test (at the beginning and end of the Gait-Go and Gait-Come phases, and before the U-turn).

S3 (Cadence\_mean, steps per min).

S4 (Cadence\_std, steps per min).

S5 The ratio of the average deviation of the two largest outliers of the step duration to the standard deviation of the step duration without taking into account the two largest outliers (red double-sided arrow, see Figure 2). S5 reflects a tendency to take unusually long steps (LS).

S5 was calculated according to the following algorithm:


The estimates of the probability density functions of the step duration and the acceleration upper/lower peaks were obtained using kernel density estimation (KDE). KDE was computed using the Python scipy.stats.gaussian\_kde function (written by Robert Kern, 2004, Enthought, Inc., Austin, TX, USA). As P1, P2 and S2 values are related to the width of the target variables' distributions, they are shown as the full width at half maximum. On the left panel, the red dots denote minimal acceleration when both feet were touching the floor, and the green dots denote heel strike. The open red and green circles represent these dots. Two outlier values are denoted with black crosses. On the right panel, the open black circles represent the individual step duration along the time course. The outlier values, denoted by red crosses (>0.7 s), represent unexpectedly longer steps right prior to U-turn. Furthermore, note that during the Gait-Come phase (upper group of open black circles), the length of the steps decreased roughly from 0.68 m to 0.6 m. For more information, see the text below; these data were obtained from Subject 6.

2.7.3. Characteristics of the Power Stability of Stepping (P-Parameters)

To analyze the power (amplitude) characteristics of each step, the following parameters were estimated:

P1 (Heel\_strike\_accel\_std, m/s2): the standard deviation of the vertical acceleration in Tstep moments. Two outliers were discarded. P1 characterizes the stability (uniformity) of the heel strike during stepping in the Gait-Go and Gait-Come phases (see Figure 2).

P2 (Swing\_accel\_std, m/s2): standard deviation of the vertical acceleration minima that corresponds to the weight transfer phase. Two outliers were discarded. P2 characterizes the stability (uniformity) of the minima values when both feet made contact with the floor during the swing phase of stepping during the Gait-Go and Gait-Come phases (see Figure 2).

P3 (Peak-to-peak\_vertical\_acceleration\_mean, m/s2): the average difference between the minimum and maximum of the vertical accelerations in a series of straight steps.

All D- and P-parameters, and some of S-parameters, can be identified from Figure 2. The duration of the entire TUG test (D1) and its phases—D2, D3 and D4—shown on the left panel in Figure 2, were recognized automatically by analysis of the acceleration and rotation rate time series data. The upper peaks (green dots) correspond to heel strike moments and form a cloud of green open circles to the right. Their distribution is characterized by the P1 parameter. Similarly, P2 describes the width of distribution of the acceleration minima during the swing phase. The horizontal green and red dotted lines denote the medians for these sets. P3 is the difference between the medians. The right panel in Figure 2 describes the distribution of the performed steps over the step duration (histogram and kernel density estimation). S1 stands for the average step duration and S2 stands for the standard deviation. The black circles—from the lowest to the highest—correspond to the recognized steps from the first step to the last one in time. The two longest steps were

excluded from the averaging statistics; however, they probably informed the light form of "freeze of gait".

#### *2.8. Statistical Analysis*

The analysis was executed with IBM SPSS Statistics 21.0 (SPSS, IBM Company, Chicago, IL, USA). The values of D1–4, S1–5 and P1–3 were compared in the pairs of conditions "before-after a single DI session" and "before-after a course of DI sessions" with the nonparametric paired Wilcoxon *t*-test.

#### **3. Results**

None of the D, P and S gait characteristics responded to the conditions of a single ("acute") DI session (Table 2); however, in each of the five examined subjects with PD, at least several—usually different—parameters were positively modified after a 45 min session of DI. Furthermore, at least in 2–4 measurements of 5, the gait parameters changed to better values. For example, in Subject 1, after a single session of DI, the D1 (duration of the entire TUG test) decreased by 5 s—from 30 to 25 s—and the subject's cadence increased from 88 to 100 steps/min. After another DI session with this subject, the values of the P1 parameter (heel strike) increased from 0.55 to 0.77 m/s2, P2 increased (swing phase) from 0.50 to 0.82 m/s2, and P3 increased from 8–9 to 10–11 m/s2 (Figure 3), which provides insight into the increased variability of step length after DI and the stronger strike of the heel on the floor. In Subject 2, only the value of D4 decreased, which similarly occurred in Subject 3, wherein the value of S5 decreased (Figure 3). In addition, in all five cases of DI, the change in the distribution type of P2 from a unimodal distribution to a more bimodal one took place (see Figure 3). The individual data for all measurements are presented in Supplementary Material Table S1.

**Figure 3.** *Cont*.

**Figure 3.** Individual plots for five separate DI sessions in three subjects. For details, see Figure 2: (**a**) Subject 1 before and (**b**) after the 1st DI; (**c**) Subject 1 before and (**d**) after the 7th DI; (**e**) Subject 2 before and (**f**) after the 5th DI; (**g**) Subject 2 before and (**h**) after the 4th DI; (**i**) Subject 3 before and (**j**) after the 7th DI.

Unlike with a single DI session, a course of DI sessions exerted a significant influence on a few gait parameters, namely, D4 and S5 (see, Table 2), which means that subjects with PD performed sitting down on a chair with turning (D4 phase) faster, and there were unusually long steps after a course of DI sessions. The individual plots of stepping are presented in Figure 4.


**Table 2.** Gait characteristics before and after a single DI session and after a course of DI sessions.

*p*—probability of difference in Wilcoxon test between "before" and "after" conditions. The meaning of D-, S- and P-parameters can be found in the text.

**Figure 4.** *Cont*.

**Figure 4.** Individual plots for six courses of DI sessions in five subjects. For details, see Figure 2: (**a**) Subject 1 before and (**b**) after the 1st course of DI; (**c**) Subject 1 before and (**d**) after the 2nd course of DI; (**e**) Subject 2 before and (**f**) after the course of DI; (**g**) Subject 3 before and (**h**) after the course of DI; (**i**) Subject 4 before and (**j**) after the course of DI; (**k**) Subject 5 before and (**l**) after the course of DI.

#### **4. Discussion**

The purpose of this study was (1) to test the reliability of an assessment of stepping characteristics with an up-sampled IMU-based accelerometer signal and gyroscope of a smartphone when placed on the subject's head, and (2) with the help of the acceleration signal, to study the effect of a single DI session and a program of DI sessions on stepping in subjects with PD during a long version of the TUG test.

There are plenty of studies that have demonstrated sufficient reliability of iTUG test technologies based on a smartphone's IMU to recognize the phases of the TUG test [24,30–32]. It has been concluded that the iTUG test is relevant for self-administered TUG test [31,33] and has good agreement with 3D motion video capture analysis [34], and it is superior to stopwatch measurement [35]. As such, the recognition of sub-phases during the instrumented TUG test, either in its classic or extended versions, with smartphone accelerometers is not necessarily novel; however, most of these studies were conducted with the classic 3 m TUG test, with a smartphone attached to a belt and with a 100 Hz sampling rate.

Instead, in the present study, we focused on (1) the gait analysis during the Gait-Go and Gait-Come phases with the help of (2) the extended 13 m TUG test, and (3) with a smartphone fixed to the head. It has been found that the 13 m TUG test returns information about 16–21 steps in each direction (28–36 steps, altogether), of which 15–17 steps in the middle of the Gait-Go and Gait-Come phases of the TUG test were considered to be functionally uniform (straight walk). This number of steps is reliable, as data collected from 10–20 strides (20 steps) were reported to be sufficient for the reliable characterization of the gait speed and cadence of stepping [28]; however, the reliable evaluation of the variability of stepping requires much more data (hundreds of strides) [28]. In addition, we considered that the pre-processing of the time series with an up-sampling procedure (from 100 up to 1000 Hz) allowed us to increase the accuracy of the capture of stepping events (heel strike and swing phase). Moreover, we paid much attention to the step variability and evolution over time. For example, we introduced a special new parameter that measures the tendency to "freeze of gait"—i.e., unusually long steps, which can provide insights into the difficulties of performing a step.

From a technical point of view, the obtained parameters and graphical presentation of the TUG test can be regarded as reliable and demonstrative, as it allows for the tracing of individual features of the subject's gait and the recognition of graphical patterns of the subjects by eye. Furthermore, the position of the smartphone on the head can be regarded as a reliable site for the collection of information about a human's gait. This allows for a reduction in the number of IMUs to one that is placed on the head.

We found that a single ("acute") 45 min DI session exerted no effect on the studied parameters of gait across the entire group of subjects with PD, which is opposite to our original hypothesis. On the other hand, in each subject, an individual set of gait characteristics still changed. For example, in Subject 1 (see Figure 3), the entire duration of the TUG test (D1) decreased by 5 s (or, by 15%), and the time to perform the U-turn (D3 phase) decreased by 0.2–0.3 s, while the cadence (S1) increased by 3–12 steps/min. Furthermore, P1 (standard deviation of heel strike acceleration) and P2 (standard deviation of acceleration minima during the swing phase) increased by 0.2–0.3 m/s2, and P3 (vertical acceleration range) increased by 2–3 m/s2. As a whole, these changes suggest that after a single DI session, Subject 1 walked faster, performed faster turns and stepped more firmly on the floor. All these modifications can be regarded as positive. In Subjects 2 and 3, the effect of DI conditions was negligible, probably due to the relatively good initial values of their gait parameters, for example, in Subject 3, their cadence was 145 steps/min—in comparison to 90–112 steps/min in Subjects 1 and 2. In addition, Subject 1 did not take anti-PD medicine, which means that before the DI session he stood in the "off-medication" condition. As a result, the effect of DI was not inferred by anti-PD therapy.

The effect of a program of seven DI sessions was a bit more pronounced. At a minimum, the D4 and S5 parameters became significantly larger after a course of DI, and a change in P2 values resulted in an increase after a program of DI. The reaction to DI conditions was individually significant. Again, Subject 1 presented the most notable improvement in D1 (by 5–6 s, or by 20%), and almost in all other parameters. Subjects 2, 4 and 5 demonstrated moderate improvement of only some parameters, and Subject 3 demonstrated a notable improvement of gait.

The Internet of Things (IoT) is comprised of interconnected devices, machines, and servers with data storage that functioning through a network [36,37]. A smartphone can be considered an ideal measuring device for further instrumentation and incorporation into IoT-enabled systems since it already appears as a part of the Internet.

A smartphone is always "at hand" (in the pocket), it is not heavy or cost-effective, and it is already pre-set for data transfer to cloud-based storage [24]. Smartphones are already efficiently used to detect and monitor PD symptoms, e.g., with reaction time tests, tapping tests, and voice (speech), posture and gait tests [38], and there are smartphone applications that are available for self-testing [39]. In a sense, smartphones have undergone a kind of "instrumentalization" compared to regular consumer devices—for example, a treadmill [40]. Altogether, this makes smartphones a relevant candidate for the implementation of diagnostic and monitoring applications in PD. Smartphones are very suitable because, unlike wearable sensors, they do not need additional resources, as they are already a part of the Internet. In addition, smartphones are capable of performing online calculations.

Data collected on the gait of PD subjects with wearable accelerometers is suitable for Artificial Intelligence (AI) or IoT decisional support [36,41]. AI-based wearable gait monitoring is already used for optimization of Parkinson's disease management [41]. We figured that smartphone applications based on AI can be applied to monitor gait characteristics in PD subjects. Among the varied learning methods, deep learning may provide higher accuracy in PD assessment than machine learning [42].

The TUG20 test accelerometer signals have a repetitive structure and contain gait features. Furthermore, there are two ways that the methods of AI could be applied. First, it can be used to collect a database of signals and split this database into two parts: the training and test sets. To increase the adequacy of the model, this approach might be applied after investigation of more than 100 PD cases, which is difficult in real life. The second way is to investigate the gait features and to understand what features are significant, i.e., to exclude insignificant features and thus decrease the dimension of the model, and then apply these data to clustering. This approach requires less studied cases, and we would prefer to follow it in the future. The major limitation of this study was the insufficient number of study subjects and measurements, which did not allow for a more precise analysis of data to be conducted. Furthermore, control groups (young and older healthy subjects) were not formed. In future studies, we propose that more measurements should be conducted in control groups and subjects with PD, both under "dry" immersion conditions and non-DI conditions.

#### **5. Conclusions**

In conclusion, the data from smartphone-based IMU accelerometers allowed us to compute gait characteristics that are conventionally used in the field of locomotion physiology, such as step duration and cadence. Like other IMU-based analyzing systems, the presented method allowed for the recognition of the phases of the TUG test. The application of an extended version of the TUG test provided a sufficient number of steps to characterize gait, and it allowed for the visualization of the duration of individual steps during the process of locomotion. Furthermore, the presented method appears to be suitable for a fast visual evaluation of stepping patterns in PD subjects. Of note, some of the specific characteristics of Parkinsonism events were recognized with the IMU sensors—for example, unusually long steps, which were produced while walking.

For the entire group, the conditions of a single 45 min "dry" immersion affected none of the studied gait parameters derived with the help of smartphone-based IMU sensors; however, in one subject there was a clear increase in cadence, gait and turning speed. After

a course of repeated DI sessions, some characteristics of the TUG test were improved; however, gait speed did not significantly change.

The presented method of gait analysis appears to be suitable for further instrumentation because a smartphone is perfectly suited for association in IoT-based networks.

**Supplementary Materials:** The following supporting information can be downloaded at: https://www. mdpi.com/article/10.3390/10.3390/s22207915/s1, Table S1: Individual values of gait characteristics at different study conditions.

**Author Contributions:** Conceptualization, A.Y.M. and L.I.G.-M.; methodology, A.Y.M. and A.P.M.; software, S.A.R., A.V.S. and A.P.M.; investigation, A.Y.M. and L.I.G.-M.; writing—original draft preparation, A.Y.M. and A.P.M.; writing, review and editing, A.Y.M., L.I.G.-M., A.P.M., S.A.R., A.V.S.; visualization, S.A.R.; supervision, A.Y.M. and A.P.M.; project administration, A.Y.M.; funding acquisition, A.Y.M. All authors have read and agreed to the published version of the manuscript.

**Funding:** The research was financially supported by the Ministry of Science and Higher Education of the Russian Federation (theme No. 0752-2020-0007, to AM).

**Institutional Review Board Statement:** The study was conducted in accordance with the Declaration of Helsinki, and approved by the Institutional Review Board (or Ethics Committee) of Ministry of health care of the Republic of Karelia and Petrozavodsk State University (Statement of approval No. 31, 18 December 2014).

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study.

**Data Availability Statement:** The datasets generated for this study are available on request to the corresponding author.

**Acknowledgments:** The authors thank the subjects for their participation and engineer Kirill Prokchorov for assisting with conducting measurements.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Predicting Chemical Carcinogens Using a Hybrid Neural Network Deep Learning Method**

**Sarita Limbu and Sivanesan Dakshanamurthy \***

Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, Washington, DC 20057, USA **\*** Correspondence: sd233@georgetown.edu

**Abstract:** Determining environmental chemical carcinogenicity is urgently needed as humans are increasingly exposed to these chemicals. In this study, we developed a hybrid neural network (HNN) method called HNN-Cancer to predict potential carcinogens of real-life chemicals. The HNN-Cancer included a new SMILES feature representation method by modifying our previous 3D array representation of 1D SMILES simulated by the convolutional neural network (CNN). We developed binary classification, multiclass classification, and regression models based on diverse non-congeneric chemicals. Along with the HNN-Cancer model, we developed models based on the random forest (RF), bootstrap aggregating (Bagging), and adaptive boosting (AdaBoost) methods for binary and multiclass classification. We developed regression models using HNN-Cancer, RF, support vector regressor (SVR), gradient boosting (GB), kernel ridge (KR), decision tree with AdaBoost (DT), KNeighbors (KN), and a consensus method. The performance of the models for all classifications was assessed using various statistical metrics. The accuracy of the HNN-Cancer, RF, and Bagging models were 74%, and their AUC was ~0.81 for binary classification models developed with 7994 chemicals. The sensitivity was 79.5% and the specificity was 67.3% for the HNN-Cancer, which outperforms the other methods. In the case of multiclass classification models with 1618 chemicals, we obtained the optimal accuracy of 70% with an AUC 0.7 for HNN-Cancer, RF, Bagging, and AdaBoost, respectively. In the case of regression models, the correlation coefficient (R) was around 0.62 for HNN-Cancer and RF higher than the SVM, GB, KR, DTBoost, and NN machine learning methods. Overall, the HNN-Cancer performed better for the majority of the known carcinogen experimental datasets. Further, the predictive performance of HNN-Cancer on diverse chemicals is comparable to the literature-reported models that included similar and less diverse molecules. Our HNN-Cancer could be used in identifying potentially carcinogenic chemicals for a wide variety of chemical classes.

**Keywords:** chemical carcinogens; machine learning; deep learning neural network; hybrid neural network; convolution neural network; fast forward neural network

#### **1. Introduction**

Substances capable of causing cancer are known as carcinogens. Carcinogenicity is a primary concern among all the toxicological endpoints due to the severity of its outcome. Carcinogens may be genotoxic, which induces DNA damage and cancer, or non-genotoxic, which uses other modes of action, such as tumor promotion, to exhibit their carcinogenic potential in humans [1]. Some of the genotoxic carcinogens are mutagens too. Many environmental chemicals have been identified as carcinogenic to humans [2,3]. The onset of cancer in humans depends on various factors, including the dose and duration of exposure to carcinogens. Identifying carcinogenic compounds is also an integral step during the drug development process. The two-year rodent carcinogenicity assay has been established as the standard to determine chemical carcinogenicity [4]. However, such animal testing is time-consuming, costly, and unethical. The experimentalists need to replace, reduce, and refine (3Rs) the use of animals as this 3Rs policy encourages alternative methods to minimize the unprincipled use of animals [5].

**Citation:** Limbu, S.;

Dakshanamurthy, S. Predicting Chemical Carcinogens Using a Hybrid Neural Network Deep Learning Method. *Sensors* **2022**, *22*, 8185. https://doi.org/10.3390/ s22218185

Academic Editors: Dmitry Korzun, Andrei Velichko and Alexander Meigal

Received: 13 September 2022 Accepted: 23 October 2022 Published: 26 October 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Computational methods for various toxicological endpoints prediction have now become a popular alternative to traditional animal testing. Numerous computational models using machine learning (ML) methods are developed to predict carcinogenicity based on the properties of chemicals. Computational models can be classification models (qualitative) that predict chemical is carcinogenic/noncarcinogenic (binary classification models) or that predict the degree of carcinogenicity (multiclass classification), and regression models (quantitative) that predict the dose of chemical required for carcinogenesis. Computational models based on structurally related congeneric chemicals are reported to have high predictive performance. Luan et al. reported an accuracy of 95.2% while predicting the carcinogenicity of N-nitroso compounds based on the support vector machine (SVM) method [6]. Ovidiu et al. presented a SVM-based model to predict the carcinogenicity of polycyclic aromatic hydrocarbons (PAH) with 87% accuracy [7]. Computational models based on non-congeneric chemicals are of interest due to their predictive ability for diverse chemicals. Fjodorova et al. predicted the carcinogenicity of non-congeneric chemicals with 68% accuracy using a counter propagation artificial neural network (CP ANN) [8]. Tanabe et al. reported an accuracy of 70% for non-congeneric chemicals based on SVM and improved the accuracy to 80% by developing models on the chemical subgroups based on their structure [9]. Zhang et al. presented binary classification models based on ensemble of the extreme gradient boosting (XGBoost) method that predicted the carcinogenicity of chemicals with 70% accuracy [10]. Li et al. used six different ML methods to generate the binary classification model with 83.91% accuracy and ternary (multiclass) classification models with 80.46% accuracy for the external validation set for the best model [11]. Toma et al. developed binary classification models with an accuracy of 76% and 74% and regression models with r2 of 0.57 and 0.65 on oral and inhalation slope factors to predict carcinogenicity for the external validation set [12]. Fjodorova et al. reported a correlation coefficient of 0.46 for the test set for their regression models using counter propagation artificial neural network (CP ANN) [8]. Wang et al. constructed a deep learning model that requires fewer data and achieved 85% accuracy on the external validation set for carcinogenicity prediction [13].

Taken together, numerous carcinogenicity predictive models on congeneric and noncongeneric chemicals for binary classification and a few multiclass and regression models were reported [6–17]. However, there is a need for more non-congeneric computational models with a broad applicability domain for carcinogenicity prediction. In this study, to predict potential carcinogens, we developed a hybrid neural network method called, HNN-Cancer. Based on diverse non-congeneric chemicals, we have developed binary classification, multiclass classification, and regression models, using HNN-Cancer and other machine learning methods. We have used the binary classification to predict a chemical is carcinogenic or non-carcinogenic, the multiclass classification model to predict the severity of the chemical carcinogenicity, and the regression model to predict the median toxic dose.

#### **2. Materials & Methods**

#### *2.1. Datasets*

We have collected carcinogens from several different data sources detailed below.


listed in the Technical Guide 230 (TG230): "Environmental Health Risk Assessment and Chemical Exposure Guidelines for Deployed Military Personnel" [19], which provides military exposure guidelines (MEGs).

	- a. CPDB\_CPE (CPDB CarcinoPred-EL) data: CPDB data for rat carcinogenicity were collected from the CarcinoPred-EL developed by Zhang et al. [10]. The list contains 494 carcinogenic and 509 non-carcinogenic chemicals.
	- b. CPDB data: CPDB [24] data were collected and processed to obtain the median toxicity dose (TD50) for rat carcinogenicity. TD50 is the dose-rate in mg/kg body wt/day administered throughout life that induces cancer in half of the test animals. A total of 561 carcinogenic chemicals was obtained with TD50 values for rat carcinogenicity. A total of 605 noncarcinogenic chemicals was obtained for rat carcinogenicity. For 543 carcinogenic chemicals out of 561, the TD50 values in mmol/kg body wt/day were also obtained from the DSSTox database (https://www.epa.gov/chemical-research/distributedstructure-searchable-toxicity-dsstox-database; accessed on 30 September 2017).

#### 2.1.1. Dataset I: Binary Classification Data

The two classes considered in the binary classification models were class 0 (noncarcinogen) and class 1 (carcinogen). Datasets used to train the models are listed below:

i. For binary classification of chemicals to predict the carcinogenic or non-carcinogenic category, 448 carcinogenic chemicals were obtained from data sources 1 to 6 above. Data 1 (MEG): The chemicals classified into Groups A, B, and C were considered as carcinogens. Data 2 (TG30): The chemicals listed as carcinogens were considered as carcinogens. Data 3 (NTP): The chemicals classified as either "reasonably anticipated to be a human carcinogen" or "known to be human carcinogens" were considered as carcinogens. Data 4 (IARC): The chemicals classified into Groups 1, 2A, and 2B were considered as carcinogens. Data 5 (JSOH): The chemicals classified into Groups

1, 2A, and 2B were considered as carcinogens. Data 6 (NIOSH): The carcinogenic chemicals listed were considered as carcinogens.


For the binary classification model dataset, we used 7994 chemicals with 4636 carcinogenic and 3358 non-carcinogenic chemicals.

#### 2.1.2. Dataset II: Multiclass Classification Data

The classes considered in the multiclass classification models were class 0 (noncarcinogen), 1 (possibly carcinogen and not classifiable chemicals), and 2 (carcinogen and probably carcinogen). Datasets used to train the models are listed below:


The dataset II for the multiclass classification models, we used a total of 459 chemicals data in class 0, 604 chemicals data in class 1, and 555 chemicals data in class 2.

#### 2.1.3. Dataset III: Regression Data

Regression models were developed to predict the quantitative carcinogenicity or the median toxic dose (TD50) of the chemicals in the form of pTD50 (logarithm of the inverse of TD50). Dataset III for the regression models consisted of 561 TD50 data in mg/kg body wt/day converted to pTD50 from data source 7b. Independently, the regression models were also developed on 543 TD50 data in mmol/kg body wt/day converted to pTD50.

#### *2.2. Descriptors*

Mordred descriptor calculator [25] that calculates 1613 2D molecular descriptors from SMILES and is used for descriptor calculation. This descriptor calculator supports Python 3 that we used to run the Mordred locally. The final set of 653 descriptors was obtained with no missing calculated values for the entire datasets for which descriptors were calculated. The 653 descriptors were used as a final set of input features for the training and test data set for the machine learning models.

#### *2.3. SMILES Preprocessing*

The simplified molecular-input line-entry system (SMILES) uses ASCII strings for the 1D chemical structure representation of a compound and can be used to convert to its 2-D or 3-D representation. It is one of the key chemical attributes and is used in our deep learning model. Raw texts cannot be directly used as input for the deep learning models but should be encoded as numbers. Tokenizer class in python is used to encode the SMILES string. The SMILES preprocessing method that we used while predicting toxicity [17] created the index for the set of unique characters of SMILES from the training set only. If the training set consists of only two compounds "C=CC=C" and "O=CC, a dictionary would be created for only three distinct characters in the SMILES of the training set that would map C to 1, = to 2, and O to 3. Then, the vector output for the SMILES characters was one-hot encoded where the categorical value of each character in the SMILES is converted to binary vector with only the index set to 1. Thus, C, =, and O are represented by the vectors [1 0 0], [0 1 0], and [0 0 1], respectively. If a new character, such as 'N', which does not exist in the training set, appears in the SMILES of the test set, the character would be skipped. For the string C=CC#N, the SMILES vectorization method would output the following matrix of dimension LxM, where L = 325 is the allowed maximum length of the SMILES string and M is the number of the unique characters in the SMILES of the training set:


Here, in the modified vectorization method, we have created a unique index for 94 characters in the ASCII table. Hence, there is no possibility of missing out on creating an index of any character in the SMILES string represented in any format. A total of 94 characters in the ASCII table !, ", #, ... , =, >, ?, @, A, B, C, ... , |, }, ~ represented by decimal numbers 33, 34, 35, ... , 61, 62, 63, 64, 65, 66, 67, ... 124, 125, 126, respectively, made the vocabulary of the possible characters in the SMILES. Each of these 94 ASCII characters were obtained by looping through the numbers 33 through 126 and converting the number to the corresponding character using python function chr(). Then, the characters were mapped to indices 1, 2, 3, ... , 29, 30, 31, 32, 33, 34, 35, ... , 92, 93, 94 using the fit\_on\_texts() function of the Tokenizer module to create a dictionary.

Each character in the SMILES is converted to its corresponding index in the dictionary, and a vector is created for the SMILES of each compound. As an example, acrylonitrile-d3 with SMILES string C=CC#N is encoded as [35, 29, 35, 35, 3, 46]. As the SMILES length varies depending on the compound's length and properties, the length of the encoding results also varies. The resulting vector for the SMILES of every input compound is thus padded with 0s or truncated so that they are of uniform length, L. The SMILES for the input compounds are converted to a 2-D matrix of size K x L, where K is the number of input SMILES, and L = 325 is the allowed maximum length of the SMILES string used in the model. Thus, for the string C=CC#N, the current SMILES vectorization method would output the following vector of length 325:

[35, 29, 35, 35, 3, 46, 0, 0, . . . , 0]

Our previous method [17] mapped the SMILES for the K number of chemicals to a one-hot encoded matrix of size KxLxM, where M is the number of the possible characters in the SMILES.

#### **3. Machine Learning Models**

#### *3.1. Hybrid Neural Network Model*

Hybrid neural network (HNN) model [17] that we developed for chemical toxicity prediction was used here by modifying the SMILES vectorization method. Then, the method by which the vectorized SMILES input is processed by the convolutional neural network (CNN) of the model. The model is developed in python using the Keras API with Tensorflow in the backend. The model consists of a CNN for deep learning based on structure attribute (SMILES) and a multilayer perceptron (MLP)-type feed-forward neural network (FFNN) for learning based on descriptors of the chemicals. To vectorize SMILES, each character in the SMILES string is converted to its positional index in the dictionary, as explained in the SMILES preprocessing section. The 2D array of vectorized SMILES strings was the input for the CNN. The embedding layer of Keras is used to convert the index of each character in the SMILES string into a dense vector. The embedding layer takes three arguments as input: input\_dim is the vocabulary size of the characters in the SMILES string, output\_dim is the size of the embedded output for each character, and input\_length is the length of the SMILES string. In the model, we have embedded the index of each character in the SMILES to a vector of size 100 by setting the output\_dim to 100. The embedding layer converts the input 2D array of size KxL, where K is the number of SMILES and L is the maximum length of SMILES, to a 3D array of size KxLx100.

The 1D convolution layer activation function ReLU represented mathematically as max(0, x), is used in the model that replaces all the negative values with zeros. The derivative of ReLU is always 1 for positive input, which counteracts the vanishing gradient problem during the backpropagation. The output of the pooling layer of the CNN, together with the FFNN, is connected to the final fully connected layer to perform the classification task.

#### *3.2. Other Machine Learning Algorithms*

To test the performance of HNN-Cancer for the case of binary classification and multiclass classification, the other machine learning algorithms random forest (RF), bootstrap aggregating (Bagging) using bagged decision tree, and adaptive boosting (AdaBoost), were used.

Random forest (RF): A bootstrap aggregating (bagging) model that uses ensemble decision trees to make final decisions. This algorithm uses only a subset of features to find the best feature to separate classes at each node of the tree. The regression model fits every feature, and the data are split at several points. The feature with the least error is selected as the node.

Bagged decision tree (Bagging): Bagging uses a bootstrap method to reduce variance and overfitting. It uses the ensemble method for the final decision. Bagging method uses all features to find the best feature for the splitting node of the tree.

Adaptive boosting (AdaBoost): AdaBoost is an ensemble machine learning method that uses weak classifiers to make stronger classifiers.

Support vector regressor (SVR): SVR depends on the subset of training data. SVR performs non-linear regression using kernel trick and transforms inputs into m-dimensional feature space.

Gradient boosting (GB): GB produces an ensemble of weak prediction models or regression trees in a stage-wise fashion. Each stage optimizes a loss function by choosing the function that points in the negative gradient direction.

Kernel ridge (KR): Ridge regression uses L2 regularization to limit the size of the coefficients of the model and eliminates the problem in the least square regression. The ridge method adds a penalty to the coefficients equal to the square of the magnitude of coefficients. Regularization parameter λ controls the penalty term. Kernel ridge uses kernel tricks to make the model non-linear.

Decision tree with AdaBoost (DT): The prediction of the decision tree was boosted with AdaBoost. The decision tree method predicts by learning decision rules from the training data. AdaBoost is a boosting algorithm introduced by Freund and Schapire [26]. AdaBoost makes final predictions from weighted voting of the individual predictions from weak learners. It implements AdaBoost.R2 algorithm [27].

KNeighbors (KN): Nearest neighbors find k number of training data closest to the test data for which prediction is made. Each closest neighbor contributes equally while making a prediction (default parameter).

#### *3.3. Model Evaluation*

All the statistical metric results presented for the model evaluation are the average of 10 repeats (in the case of binary classification models and regression models) and 30 repeats (in the case of multiclass classification models). Approximately 20% of data were separated randomly in each iteration as test sets and the remaining data as training sets, such as five-fold cross-validation, except that the test sets were randomly selected in each iteration. In the case of binary and multiclass classification, the performance of each model was evaluated based on accuracy and area under the receiver operating characteristic curve (AUC). The classification models were also assessed for sensitivity and specificity. The evaluation scores are calculated as:

$$Accuracy = \frac{TP + TN}{TP + TN + FN + FP} \times 100$$

$$Sensitivity \left(TPR\right) = \frac{TP}{TP + FN} \times 100$$

$$Specificity \left(TNR\right) = \frac{TN}{TN + FP} \times 100$$

For the five-fold cross-validation, we used 80:20 training to test set ratios, which are good numbers for the significant data size used in this study. Further, the data are shuffled in each iteration before separating the training and the test set to make sure the process does not end up with a dataset containing bias in both the training and the test set. Additionally, the average performance metrics were calculated from the outcome of 10 simulations in the case of binary classification models and regression models. Whereas for the multiclass classification models, the average performance metrics were calculated from the outcome of 30 simulations. The training on 80% of the data give more room for better performance (compared to 10-fold cross-validation with 90% data in the training set) while predicting for an external dataset using a model trained on 100% of the data.

In the multiclass classification, micro averaging is used to obtain the average of the metrics of all the classes. Micro averaging involves calculating the average by converting the data in multiple classes to binary classes and giving equal weight to each observation. In multiclass classification with the imbalanced dataset, micro averaging of any metric is preferred when compared to macro averaging, which involves calculating the metrics separately for each class and then averaging them by giving equal weight to each class. In the case of multiclass classification with *n* number of classes,

$$Acc\_{micro} = \frac{(TP\_1 + TP\_2 + \cdots + TP\_n) + (TN\_1 + TN\_2 + \cdots + TN\_n)}{(TP\_1 + \cdots + TP\_n) + (TN\_1 + \cdots + TN\_n) + (FN\_1 + \cdots + FN\_n) + (FP\_1 + \cdots + FP\_n)}$$

$$Sensitivity\_{micro} = \frac{TP\_1 + TP\_2 + \cdots + TP\_n}{(TP\_1 + TP\_2 + \cdots + TP\_n) + (FN\_1 + FN\_2 + \cdots + FN\_n)} \times 100$$

$$Specificity\_{micro} = \frac{TN\_1 + TN\_2 + \cdots + TN\_n}{(TN\_1 + TN\_2 + \cdots + TN\_n) + (FP\_1 + FP\_2 + \cdots + FP\_n)} \times 100$$

where *TP* = true positive, *TN* = true negative, *FP* = false positive, *FN* = false negative, *TPR* = true positive rate, *TNR* = true negative rate.

The performance of each regression model was evaluated based on the coefficient of determination (R2). The coefficient of determination gives the percentage of variation in the dependent variable that is predictable from the independent variable, or that is explained by the independent variable.

$$\mathcal{R}^2 = \frac{ESS}{TSS} = \frac{\sum\_{i=1}^n \left(\mathfrak{F}\_i - \overline{\mathfrak{Y}}\right)^2}{\sum\_{i=1}^n \left(y\_i - \overline{y}\right)^2} \tag{1}$$

where *ESS* is explained as the sum of squares, and *TSS* is the total sum of squares; *y*ˆ*<sup>i</sup>* is the predicted value of the *i*th dependent variable; *yi* is the *i*th observed dependent variable; and *y* is the mean of the observed data.

#### **4. Results and Discussion**

It is a desperate need to efficiently evaluate potential carcinogenic compounds that humans are exposed to in preventing cancer incidence, progression, and high mortality. Several computational and machine learning models have been developed for the prediction of carcinogenic compounds [6–16,28–40]. However, most or all of the models are developed as binary or regression models, not as categorical multiclassification models or comprehensive classification models. Further, these models are limited to congeneric computational models with a limited applicability domain and small dataset; they lack chemical diversity and were applied to targeted organ systems for carcinogenicity prediction. To fill this gap, we developed HNN-Cancer, a deep learning-based hybrid neural network model and predicted the carcinogenicity in large scale with a variety of datasets. The HNN-Cancer combines two neural network models, the CNN and the FFNN. The HNN-Cancer model combines CNN for deep learning based on the structure attribute (SMILES) with a multilayer perceptron (MLP)-type feed-forward neural network (FFNN) for learning based on descriptors of the chemicals. We developed different classification models, such as binary classification, multiclass classification, and regression models based on diverse non-congeneric chemicals.

The HNN carcinogenicity prediction models are developed based on the hybrid neural network (HNN) architecture we reported previously for toxicity prediction [17]. To compare the HNN prediction performance, we also developed other machine learning models, such as random forest (RF), bootstrap aggregating (Bagging), and adaptive boosting (AdaBoost) for binary classification and multiclass classification. Several regression models were developed based on random forest (RF), support vector regressor (SVR), gradient boosting (GB), kernel ridge (KR), decision tree with AdaBoost (DT), and KNeighbors (KN) using the sklearn package in python to make the final consensus prediction of the median toxic dose (TD50). A consensus prediction was calculated based on the average of all seven predicted values. We used the modified version of the 3D array representation of 1D SMILES in the convolutional neural network (CNN) in the HNN models from our previous model [17]. The SMILES processing method included a vocabulary of 94 characters in the ASCII table so as not to miss any possible characters of SMILES in any format. Additionally, instead of using one-hot encoding to vectorize the characters in the 1-D SMILES, the embedding layer of the CNN was used.

#### *4.1. Carcinogen Prediction Using Binary Classification*

The binary classification models were developed for Dataset I comprising 7994 chemicals (4636 carcinogenic and 3358 noncarcinogenic) from 9 different sources. Out of 1613 descriptors calculated by the Mordred descriptor calculator, 653 descriptors with no missing values were used to develop the models. We used the SMILES string in addition to the 653 descriptors in the HNN model. The accuracy, AUC, sensitivity, and specificity of the HNN-Cancer, RF, and Bagging models were comparable, whereas AdaBoost statistical metrics were significantly lower (Figure 1). The accuracy of the three models was 74%, and their AUC was ~0.81. The sensitivity and specificity of the HNN model was 79.47% and 67.3%.

**Figure 1.** (**A**) accuracy, (**B**) AUC, (**C**) sensitivity, and (**D**) specificity for the dataset I as given by the binary classification models developed based on the HNN, RF, Bagging, and AdaBoost methods.

Zhang et al. [10] built several machine learning models on the CPDB's 1003 carcinogenic data on rats. The highest accuracy they reported was 70.1%, and an AUC of 0.765 for the five-fold cross-validation. Wang et al. [13] developed a deep learning tool CapsCarcino on the 1003 rat data from CPDB used by Zhang et al. For five-fold cross-validation, they reported accuracy of 74.5%, a sensitivity of 75%, and specificity of 74.2%. Li et al. developed 30 models on only 829 rat data from CPDB, with the highest accuracy of 89.29% on their test set. Tanabe et al. [9] developed an SVM model with an accuracy of 68.8% and an AUC of 0.683 for non-congeneric chemicals from six sources using dual cross-validation. They improved the accuracy by developing models on congeneric subgroups. Notably, these studies clearly demonstrate that models developed on more diverse chemicals result in reduced accuracy. In contrast, the predictive performance of our HNN-Cancer models based on a highly diverse set of chemicals is still good compared to the previously reported models with a high AUC. Hence, we expect the HNN-Cancer will rapidly make optimal carcinogen predictions for a wider variety of chemicals.

#### *4.2. Carcinogen Prediction Using Multiclass Classification*

The multiclass classification models were developed for Dataset II, containing 1618 chemicals with 459 chemicals in class 0, 604 in class 1, and 555 in class 2. In contrast, class 0 comprises non-carcinogens, class 1 comprises possible carcinogens and not classifiable chemicals, and class 2 comprises carcinogens and probable carcinogens. The overall accuracy is 50.58%, 54.73%, 55.52%, and 46.50%, the micro accuracy is 67.05%, 69.82%, 70.34%, and 64.33% whereas the average micro AUC is 0.68, 0.724, 0.725, and 0.653 for HNN-Cancer, RF, Bagging, and AdaBoost, respectively (Figure 2). As observed by Limbu et al. [17], the HNN-Cancer model is not performing better for the multiclass in comparison to RF and Bagging method. This is because the deep learning method performs best with a large dataset, and the dataset used in these two studies is not sufficiently large.

**Figure 2.** (**A**) Overall accuracy, (**B**) micro accuracy, (**C**) micro AUC, (**D**) micro sensitivity, and (**E**) micro specificity for the dataset II as given by the multiclass classification models developed based on HNN, RF, Bagging, and AdaBoost methods.

Li et al. developed 30 multiclass (ternary) classification models that categorized compounds into carcinogenic I (strongly carcinogenic), carcinogenic II (weakly carcinogenic), and non-carcinogens [11]. Their kNN model based on MACCS fingerprint with the best predictive performance achieved micro accuracy of 81.89%. The ternary classification of their data was based on the TD50 values where TD50 ≤ 10 mg/kg/day were carcinogenic I and TD50 > 10 mg/kg/day were carcinogenic II. Whereas the classification of data in our models is based on their category, they are class 2 if they are carcinogenic or probably carcinogenic, class 1 if they are possibly carcinogenic or not classifiable chemicals, class 0 if they are non-carcinogenic. All the data from CPDB with TD50 were classified as class 2, and non-carcinogens were classified as class 0; yet, none of them classified as class 1. However, we provided a complete classification range coverage when predicting the chemical carcinogenicity.

#### *4.3. Carcinogenicity Prediction Using Regression*

Regression models were developed for Dataset III comprising 561 TD50 chemicals. The models predicted carcinogenicity in the form of pTD50 (logarithm of the inverse of TD50), and the average of all seven predicted values was calculated as the final consensus prediction of the pTD50 value. The R<sup>2</sup> is 0.35, 0.36, 0.04, 0.33, 0.36, 0.39, and 0.21 for the HNN-Cancer, RF, SVM, GB, KR, DTBoost, and NN methods, respectively (Figure 3). The overall R2 was slightly increased to 0.40 by the consensus prediction. The correlation coefficient (R) is 0.628, 0.611, 0.322, 0.588, 0.614, 0.636, 0.527, and 0.649 for the HNN, RF, SVM, GB, KR, DTBoost, NN, and consensus methods, respectively (Figure 4). The

models were also developed for 543 TD50 data in mmol/kg body wt/day. The correlation coefficient (R) is 0.604, 0.601, 0.287, 0.577, 0.545, 0.617, 0.497, and 0.629 for the HNN-Cancer, RF, SVM, GB, KR, DTBoost, NN, and consensus methods, respectively (Figure 5).

**Figure 3.** R2 of regression models developed based on HNN-Cancer, RF, SVM, GB, KR, DTBoost, NN, and consensus methods.

**Figure 4.** Correlation coefficient (R) of regression models developed based on HNN-Cancer, RF, SVM, GB, KR, DTBoost, NN, and consensus methods.

**Figure 5.** Correlation coefficient (R) of regression models developed based on HNN-Cancer, RF, SVM, GB, KR, DTBoost, NN, and consensus methods that predicts the carcinogenicity in mmol/kg body wt/day.

Fjodorova et al. [8] developed the quantitative models for carcinogenicity prediction on 805 rat data from CPDB using counter propagation artificial neural network (CP ANN) [8]. The correlation coefficient of the models was 0.46 for the test set. Toma et al. developed regression models to predict the carcinogenicity for external validation set with r2 of 0.57 and 0.65 for models using oral and inhalation slope factor [12]. In the Toma et al. [12] study, only 315 out of 1110 oral and 263 out of 990 inhalation compounds were included in their final dataset after selecting compounds based on various criteria. The external validation set was randomly chosen from the finally obtained dataset with highly

similar compounds. This may be the reason for a slightly better coefficient of determination reported by Toma et al. [12] compared to our models. Singh et al. [41] developed regression models based on generalized regression neural network (GRNN) to predict the carcinogenicity in mmol/kg body wt/day for 457 CPDB compounds and reported a correlation coefficient of 0.896 [41]. The high value of the correlation coefficient in comparison to our models could be attributed to the nine molecular descriptors selected for the regression models, and the GRNN method was used. Taken together, our model included the multiclassification models with full classification range coverage with diverse class of chemicals and provided optimal carcinogen predictive performance over the other methods.

#### **5. Conclusions**

Determining environmental chemical carcinogenicity is an urgent need. Though several machine learning models have been reported, there is a need for more non-congeneric computational models with a vast applicability domain for carcinogenicity prediction. In this study, we determined the carcinogenicity of thousands of wide-variety classes of reallife exposure chemicals. We have developed carcinogen prediction models based on our hybrid neural network (HNN) architecture method HNN-Cancer to determine chemical carcinogens. In the HNN-Cancer, we included new SMILES feature representation method. Using the HNN-Cancer and other machine learning methods, we predicted the carcinogen in terms of binary classification, multiclass classification, and regression models for the very diverse non-congeneric chemicals. Notably, the binary and multiclass classification models developed for the larger set of diverse chemicals were from diverse sources, most of which are human exposure-relevant chemicals.

The models based on the HNN-Cancer, RF, and Bagging methods predicted the carcinogens with an accuracy of 74% and an AUC of 0.81, which shows that the carcinogen predictions made by these models can be considered as optimal. Multiclass classification models to categorize the carcinogenicity of chemicals into one of the three classes: noncarcinogens, possible carcinogens/not classifiable chemicals, or carcinogens/probable carcinogens, were developed. The HNN-Cancer exhibited an accuracy of 50.58%, a micro accuracy of 67.05%, and a micro AUC of 0.68. Further, we developed regression models to predict the median toxic dose of chemicals in the form of pTD50. The consensus prediction achieved the overall R2 of 0.40 by calculating the average of all the methods. Though our model included very diverse chemical categories and a larger number of chemicals from different data sources, still our models could be able to predict the binary, categorical (multiclass), and quantitative (regression) carcinogenicity comparable to the other literature reported models that included smaller and similar chemicals. Therefore, our HNN-Cancer can be used to identify the potential carcinogens for any chemical.

Several studies described the design of IoT-enabled environmental pollution and toxicology using the artificial intelligence technique to improve human health [42–47]. For example, Aisha et al. [42] proposed a neural network model that includes IoT-based sensor to sense eight pollutants and report the status of air quality in real-time by using a cloud server and informing the presence of hazardous pollutants levels in the air. Shukla et al. [46] and Memon et al. [47] employed artificial intelligence neural network IoT-enabled big data pipeline to the identification of breast cancer. Similarly, the HNN-Cancer could be integrated into the IoT-enabled sensors to inform the presence of carcinogens.

#### **6. Limitations**

The developed hybrid neural network method HNN-Cancer is first in class with developing various classification models, such as binary classification, multiclass classification, and regression models based on diverse non-congeneric chemicals. These models would enable the scientific community to classify chemicals carcinogenicity at specific doses or dose ranges. However, there are some potential limitations that exist in the prediction of carcinogens. Firstly, lack of a large dose-dependent chronic in vitro and in vivo carcinogen dataset to train the model. Secondly, the HNN-Cancer method needs several routines of

optimization with further refinement. We will improve HNN-Cancer method carcinogen predictions further by including more experimentally determined carcinogenic dose data (in vitro and in vivo) that we obtained recently from the National Toxicology Program (NTP), bioinformatics and toxicology group.

**Author Contributions:** Conceptualization, S.D.; Data curation, S.L.; Formal analysis, S.L.; Funding acquisition, S.D.; Investigation, S.L. and S.D.; Methodology, S.L. and S.D.; Project administration, S.D.; Resources, S.D.; Software, S.L. and S.D.; Supervision, S.D.; Validation, S.L.; Writing—original draft, S.L. and S.D.; Writing—review & editing, S.D. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded in part by the United States Department of Defense (DOD) grant CA140882.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** We acknowledges the support in part by the United States Department of Defense (DOD) grant CA140882, the GUMC Lombardi Comprehensive Cancer Center, and the GUMC Computational Chemistry Shared Resources (CCSR).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Development of an Artificial Neural Network Algorithm Embedded in an On-Site Sensor for Water Level Forecasting**

**Cheng-Han Liu 1, Tsun-Hua Yang 1,\* and Obaja Triputera Wijaya 1,2**


**Abstract:** Extreme weather events cause stream overflow and lead to urban inundation. In this study, a decentralized flood monitoring system is proposed to provide water level predictions in streams three hours ahead. The customized sensor in the system measures the water levels and implements edge computing to produce future water levels. It is very different from traditional centralized monitoring systems and considered an innovation in the field. In edge computing, traditional physics-based algorithms are not computationally efficient if microprocessors are used in sensors. A correlation analysis was performed to identify key factors that influence the variations in the water level forecasts. For example, the second-order difference in the water level is considered to represent the acceleration or deacceleration of a water level rise. According to different input factors, three artificial neural network (ANN) models were developed. Four streams or canals were selected to test and evaluate the performance of the models. One case was used for model training and testing, and the others were used for model validation. The results demonstrated that the ANN model with the second-order water level difference as an input factor outperformed the other ANN models in terms of RMSE. The customized microprocessor-based sensor with an embedded ANN algorithm can be adopted to improve edge computing capabilities and support emergency response and decision making.

**Keywords:** edge computing; ANN; microprocessor; water level prediction; decentralized

#### **1. Introduction**

The Emergency Event Database (EM-DAT) includes records for 432 disastrous events related to natural hazards worldwide in 2021. Floods dominated these events, with 223 occurrences, with an average of 163 annual flood occurrences recorded in the 2001–2020 period [1]. The losses of life and property caused by floods are tremendous [2]. Structural and nonstructural measures have been devised to prevent or mitigate loss of life and property [3]. The development of early warning systems, which are nonstructural measures, was cited as a critical defense against floods [4]. These systems involve on-site facilities such as bubble gauges, float gauges and pressure sensors, which are installed to observe water level changes; then, these observations are used as indicators to assess the flood potential [5]. Accurate and cost-efficient water level monitoring sensors are required and must be deployed with very high intensity for detailed flood records [6–9]. Nevertheless, these sensors are only used for water level monitoring, and this kind of application provides limited lead time for decision makers to take response measures to mitigate the impact of disasters [7–11].

Edge computing is a distributed computing paradigm in which computations are largely or completely performed at distributed device nodes known as smart devices or edge devices, as opposed to computations in a centralized cloud environment [12]. By implementing edge computing on sensors, issues prevalent in centralized cloud systems, such as latency and network connection dependency, can be avoided [13]. Since system

**Citation:** Liu, C.-H.; Yang, T.-H.; Wijaya, O.T. Development of an Artificial Neural Network Algorithm Embedded in an On-Site Sensor for Water Level Forecasting. *Sensors* **2022**, *22*, 8532. https://doi.org/10.3390/ s22218532

Academic Editors: Andrei Velichko, Dmitry Korzun and Alexander Meigal

Received: 9 October 2022 Accepted: 3 November 2022 Published: 5 November 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

failure or other misinformation issues occur during the transmission process, which is common during disasters, simulation cannot be performed, resulting in delays in the emergency response and increasing the possibility of damage. Many studies in the fields such as medical care, manufacturing, and fault detection have developed sensors not only for monitoring but also implemented edge computing for applications that are close in proximity to the sources in case of need, e.g., [13–15]. In comparison with other applications, only a few studies have investigated edge computing in landslide- and flood-related applications [16–18]. People hesitate to apply edge computing in flood warning systems for the following possible reasons. First, most of the systems are still built based on traditional frameworks. If a prediction is performed in real time for operational purposes, the observed data from local monitoring devices are transmitted to the remote server through the internet. The server performs simulations and provides results to responders for decision making. This is called a centralized simulation approach. The monitoring devices are only capable of observing water levels without forecast functions. Therefore, there is no opportunity to implement edge computing until new sensors are developed and deployed. Second, to increase the lead time of the response, physically based models are used in the traditional framework for predictions. These systems consider different hydrological and inundation modeling components based on the specific target region, size of basins, available data and resources, and system development approach [19]. These physically based models considering detailed hydraulic processes (e.g., solving Saint-Venant equations) are complex and computationally intensive, causing limited applications in practical applications due to the availability of input data for parameterization and the detailed requirements of simulations [20–22]. None of these models are designed to be installed in sensors with limited computing power. In this regard, data-driven machine models that focus on the relationship between input factors (e.g., historical flow discharge and precipitation) and outputs (e.g., water level) are recommended as an alternative [23,24].

Related studies about the integration of customized sensors and data-driven models were examined. Customized ultrasonic sensors have been developed [5,10]. Warnings are issued when the monitored water levels are above a specific threshold. No extra calculation was performed on the sensors in these studies. Other studies used microcontroller-based sensors to collect environmental information and perform calculations in cloud-based neural networks to predict flood disaster conditions [25–27]. These studies confirmed the functionality of ANN models for water level predictions. However, the models were executed in the cloud-based server. Finally, Samikwa et al. [28] utilized edge computing for flood prediction and carried it on a low-power device within the IoT wireless sensor network. Long short-term memory (LSTM), a type of ANN, was applied in the study to predict one-hour ahead-of-time forecasts of water level. Similar to Bande and Shete [25], water level and rainfall were utilized as inputs in the study to train the ANN models to produce forecasts. However, the details of the input selection were not discussed in the study. Al Qundus et al. [29] deployed sensors to collect data such as water levels, temperature and wind speed. A support vector machine (SVM) algorithm was implemented at nodes (sensors). Only four features (temperature, humidity, wind speed, and water level) were selected to develop the SVM models. However, the details of feature selection were not discussed in the study.

This study proposed a novel decentralized flood warning system with edge computing. It consists of edge computing-enabled wireless water level monitoring sensors and a suitable forecasting algorithm that was embedded in these sensors. The newly developed sensor shifts applications, data, and computing power (services) away from centralized points to the logical extremes of a network. In other words, the sensor involves applications or general functionalities that are close in proximity to the sources of other processes, thus involving interactions between distributed systems technology and the physical world; this implies that the sensor can perform simulations and predictions directly at flood-prone locations with localized information. Therefore, situational and customized awareness is maintained during flooding, even if an internet connection is unavailable. As a result, the

efficiency of the emergency response is increased. In addition, new flood prediction models are desperately needed to be performed on the computing power-limited sensors. ANN models were developed to determine the water level on the sensor. A detailed analysis regarding the correlation between input factors and output water levels was carefully conducted to maximize the efficiency of edge computing. Furthermore, special attention was given to extreme events such as typhoons during the development of the proposed system.

#### **2. Study Areas**

This study focused on three rivers and one artificial canal to develop the ANN model and evaluate the performance of the proposed system. The geographical locations of these study areas are shown in Figure 1.

**Figure 1.** The four study areas are the Shimen canal in Taoyuan County (**upper left**), the Yilan River in Yilan County (**upper right**), the Beinan River in Taitung County (**bottom right**), and the Toucian River in Hsinchu County (**bottom left**), Taiwan.

The Yilan River Basin is located on northeastern Taiwan Island. Its main stream is approximately 24.4 km (km) long and covers an area of approximately 149.06 square km. The Yilan River Basin was first selected to train and test the ANN model for water level forecasting. This is because the precipitation, river stage, and flow velocity data have been carefully measured by the Water Resources Agency (WRA) and National Center for High-performance Computing (NCHC) since 2012 [30] for the Yilan River basin. In addition, there is no human interference, such as a reservoir upstream of the Yilan River, and there are no human operation-related factors considered among the ANN input factors in this study. Figure 1 (in a clockwise direction) shows the remaining two river basins, the Beinan River and Toucian River Basins, in eastern and western Taiwan and one artificial canal, Shimen Canal, in western Taiwan. The Beinan River is approximately 84 km long and flows through Taitung County to the Pacific Ocean. The Toucian River flows through Hsinchu County for 63 km to the west. Different from the Yilan River and Beinan River situations, there are also no human-made hydraulic structures upstream along the Toucian River, but there is an off-stream reservoir upstream of the river. These two rivers were selected because there was no human interference, such as reservoirs or gates, found in

the rivers. For reference, the proposed system (integration of Raspberry Pi (RPi) sensors and an ANN model) was only implemented on-site in the Shimen Canal because of the limitations of devices and the need for permission to install equipment. Tests performed in the Beinan River and Toucian River were conducted offline.

#### **3. System Development**

In this study, a water level forecasting model is embedded in an RPi-based monitoring device that can provide real-time water level observations and support local calculations. The system structure and data processing flowchart are illustrated in Figure 2. The monitoring devices use ultrasonic waves to measure the water levels, and the observed data and other related information are preprocessed in the RPi platform. Thereafter, the water level predictions at specific locations are performed using the proposed ANN-based water level forecasting models. The details of each component in the structure are described in the following subsections.

**Figure 2.** System data processing flowchart.

#### *3.1. Raspberry Pi Water Level Sensor*

Automatic water level sensors with wireless functions are usually costly. Therefore, low-cost, open-source, and low-energy-consumption sensors are always of interest for environmental monitoring [31]. The proposed sensor is shown in Figure 3.

**Figure 3.** Details of the RPi-based ultrasonic water level sensor (**left**) and the installation of the sensor (**right**).

The RPi is a reliable, low-cost microcomputer (MCU) that was developed in 2006 by the University of Cambridge's Computer Department and has been produced by the Raspberry Pi Foundation since 2012. The RPi 3 Model B+ module, which was released in February 2018 with a 1.4 GHz 64-bit quad core processor, was used as a platform embedded with an ultrasonic sensor to measure water stages at a local site. The UNIX/Debian=based Raspian operating system supports the implementation of a Python programming language-based ANN module to forecast water levels with lead times. Many studies have successfully applied ultrasonic waves to measure water levels under severe environmental conditions [32,33]. The ultrasonic sensor used here is the high-performance ultrasonic distance sensor MB7386 HRXL-MaxSonar-WR from MaxBotix, Brainerd, MN, USA. The ultrasonic sensor emits sound waves at a frequency of 42 kHz with a 6 Hz sampling rate and detects the sound waves that bounce back. The distance can be estimated by the elapsed time between the generated and returning sound waves. Theoretically, the sensor is effective up to a maximum range of 10 m, with functions of temperature compensation and noise cancelation. However, the efficient measuring distance varies based on the size of the target and power supply to the sensor (usually within 5 m).

#### *3.2. Artificial Neural Network Water Level Forecasting Algorithm*

ANNs are inspired by the human central nervous system. ANNs usually consist of layers such as an input layer, one or more hidden layers, and an output layer. A three-layer ANN data processing flowchart is shown in Figure 4. The input layer comprises a number of nodes (*i* = 1, 2, 3 ... *n*). A node, also called an artificial neuron, connects to other nodes in the hidden layer and has an associated weight (*w*) and threshold (*bias*). When the incoming signals (*X1*, *X2*, *X3* ... *Xn*) are passed to the nodes (*j* = 1, 2, 3 ... *N*) in the next layer, they are multiplied by the weight of the connection. These weights describe the importance of any incoming signal, with larger weights contributing more significantly to the output compared to other signals. The effective signal (*Ej*) to node *j* shown in Equation (1) is the weighted sum of all incoming signals. In the first phase of training, the weights (i.e., *wji*) are set to random values.

$$E\_{\bar{j}} = \sum\_{i=1}^{n} X\_i w\_{\bar{j}i} + bias \tag{1}$$

**Figure 4.** A three-layer ANN model and its data processing flowchart.

An activation function is used to transform the effective signal (*Ej*) into an output value to be fed to the next layer or as an output. In this study, hyperbolic tangent (tanh) and exponential linear unit (ELU) functions are used as the activation functions to transfer input signals to hidden and output layers, respectively. The tanh function has an S-shape similar to that of the sigmoid activation function, with a difference in the output range of −1 to 1. The ELU function is also similar to the rectified linear unit (ReLU), with a difference in output value for negative values of input. A three-layer network structure with one input layer, one hidden layer, and one output layer is adopted because of the limited computing power of the RPi 3 Model B+ module in the sensor [27]. No general guidelines exist for specifying the optimal number of nodes required in the hidden layer [34]. The number of nodes in the hidden layers can be estimated using Equation (2), as recommended by Fletcher and Goss [35]. The formula was confirmed by Huang and Foo [36] to provide the optimal network size, resulting in minimum error and a high correlation in the validation data set. There are three output nodes representing the forecasted water levels with lead periods of 1, 2, and 3 h.

$$N = 2n\tag{2}$$

where *N* is the number of nodes in the hidden layer and *n* is the number of incoming signals. The output from the neural network is calculated by propagating an input signal through each layer until the output layer outputs its values. It is a so-called feed-forward network. As mentioned above, the weights initially are random values and modified by reducing the differences between the output and a known output. The procedure repeatedly optimizes the weights until the value of the objective root mean square error (*RMSE*) function, shown in Equation (3), falls below a certain threshold. In this study, the threshold is 0.01.

$$RMSE = \sqrt{\frac{1}{K} \sum\_{i=1}^{K=3} (P\_i - O\_i)^2} \tag{3}$$

where *Pi* is the predicted water level and *Oi* is the observed water level. *K* is the number of outputs and here refers to the forecasted water levels with three lead times. This learning procedure is called the backpropagation approach and was proposed by Rumelhart et al. [37]. It is a method for training the weights in a multilayer feed-forward neural network structure. A Python module Scikit-learn was applied for the ANN model computation [38].

Since the output of the developed ANN model is the water level, the selection of the input factors must be based on the characteristics and shifts of the outputs at a given location. The change in the time series water level is not only dependent on rainfall records in upstream watersheds and river inflows but also related to previous water levels based on river discharge conditions. Moreover, the length of the lag phase of each input factor is influenced by the distance between input and output locations, and it determines the length of the input sequence. In fact, the selection of input factors has a large impact on the accuracy and efficiency of ANN models. There is no global way to select the input factors for an ANN model [39]. Thus, parameter selection for edge computing is important to maximize the computing efficiency of ANNs. The efficiency of calculations must be optimized for the appropriate number of input factors. In this study, a cross-correlation analysis (*cc*2), shown in Equation (4), one of the most widely used methods for factor selection, as discussed by Babel and Shinde [39], was carried out between the outputs (forecasted water levels at a given location) and input factors with a lag phase length.

$$cc^2 = \frac{\sum\_{i=1}^{n-k} (X\_i - \overline{X}) \left( Y\_{i+k} - \overline{Y} \right)}{\sqrt{\sum\_{i=1}^{n} \left( X\_i - \overline{X} \right)^2} \sqrt{\sum\_{i=1}^{n} \left( Y - \overline{Y} \right)^2}} \tag{4}$$

where *n* is the total number of time sequences in hours and *k* represents the time lag value. *X* and *Y* are the water level and a possible input factor, respectively. *X* and *Y* denote average values. Only the factors with a relatively high correlation value with the output were adopted in the proposed ANN models. In addition to the original inputs, two extra input factors were included in the cross-correlation analysis: water level variation (*Wr*) and

the frequency of water level change (*Wf*) for consecutive records in a time interval (e.g., 1 h). The definitions of *Wr* and *Wf* are shown below:

$$\mathcal{W}\_{r,i} = X\_{\text{obs},i} - X\_{\text{obs},i-1} \tag{5}$$

$$\mathcal{W}\_{f,i} = \mathcal{W}\_{r,i} - \mathcal{W}\_{r,i-1} \tag{6}$$

Mathematically, *Wr* and *Wf* represent the first- and second-order differences of the water level sequence at a target location, respectively. Physically, *Wr* and *Wf* are the velocity and acceleration of the change water level, respectively. Zhong et al. [40] found that considering the first- and second-order differences in the water level sequence can improve the forecasting accuracy of ANN models. However, they used this information to identify the level of data fluctuations and then applied the Kalman filter algorithm for local optimization. Details of the selected factors were not discussed in their study. In contrast, this study is the first attempt to apply these variables as input factors for the proposed ANN models to perform water level forecasting. Details of the analysis are described in the Results and Discussion section.

#### **4. Results and Discussion**

The ANN models were integrated into sensors, and their performance was evaluated in real cases. The discussion of the results is divided into three parts: (1) data preparation, (2) development, and (3) application. The research flowchart of this process is shown in Figure 5.

**Figure 5.** Research flowchart of the proposed ANN models.

#### *4.1. Data Preparation*

4.1.1. Generation of Synthetic Rainfall Data for Different Return Periods

Data quality and quantity are important for the accuracy of data-driven models (e.g., ANN models). The well-calibrated data from 2012 to 2017, including rainfall, flow discharge, and water stage, from the experimental watershed in the Yilan River Basin (Figure 1), were adopted. Following the research flowchart shown in Figure 5, a frequency analysis using rainfall data from five rainfall stations was conducted, and the magnitude of extreme events was related to the corresponding frequency of occurrence through the use of a probability distribution. To find the magnitude associated with a certain return period, the standard frequency factor method [41] is used, as follows:

$$X\_T = \overline{X} + Ks \tag{7}$$

where *XT* is the calculated rainfall value in a certain period, *X* is the mean rainfall value from historical data, *K* is the frequency factor, and *s* is the standard deviation of the historical data. *K* was selected for 2-, 5-, 10-, 25-, 50-, and 100-year return period events based on the Pearson III distribution [42]. The calculated cumulative rainfall for a 24-h rainfall duration is shown in Table 1. In terms of topography, the upper parts of the Yilan River Basin have mountain topography and steep slopes. There are wide flood plains from the lower reaches to the Pacific Ocean. The stations YR\_R2 and YR\_R4 are located in mountain and floodplain areas, respectively. As a result, YR\_R2 and YR\_R4 have the greatest and fewest values among all stations. The rainfall values are consistent with the variations in topography.


**Table 1.** Calculated 24-h rainfall values at five stations for different return periods.

Hydrographs specify the precipitation depth in 24 successive time intervals of 1 h duration over a total of 24 h. Such hydrographs are necessary inputs to hydrological and hydraulic models to generate flow discharge and water level data in the Yilan River Basin for different return periods. The details associated with hydrological and hydraulic models are discussed in the next section. The annual 24-h maximum rainfall value at each station from 2012 to 2017 was selected, and the average contribution (%) of each hour to the 24-h duration was calculated. These contributions were reordered in a time sequence with the maximum contribution occurring at the center of the 24-h duration and the remaining contributions arranged in descending order alternatively to the right and to the left of the center value to form a hyetrograph. The results are shown in Figure 6.

**Figure 6.** *Cont*.

**Figure 6.** (**a**–**e**) The hourly rainfall distribution for different return periods in the Yilan River basin at YR\_R1-YR\_R5.

4.1.2. Generation of Synthetic Water Level and Flow Discharge Data

The abovementioned synthetic rainfall data were used with the hydrological model Hydrologic Modeling System (HEC-HMS) and the hydraulic model River Analysis System (HEC-RAS) to generate discharge and water level data for five return periods. HEC-HMS simulates rainfall runoff processes at the watershed level and includes different components, such as the runoff volume, baseflow, and channel flow. For more details, please refer to the Technical Reference Manual [43] and User's Manual [44]. In this study, an initial loss and a constant loss rate were subtracted from the precipitation depth, and the remaining depth was referred to as precipitation excess. Thereafter, excess precipitation was transformed to direct surface runoff through the SCS unit hydrograph (SCS-UH) method.

The total flow at YR\_Q1 in the Yilan River Basin (Figure 1) was the sum of the direct runoff plus the base flow, and the base flow was obtained from an initial value multiplied by an exponential decay constant. The calculated total flow from the HEC-HMS model was then used as the upstream boundary condition for the downstream HEC-RAS model to calculate the variation in the water level downstream. To validate the performance of the HEC-HMS model, Typhoons Soulik (2012), Dujuan (2015), and Megi (2016) were considered. Table 2 shows the comparisons between the observations and simulations. Differences in peak discharge and time to peak discharge were below 15% and less than 2 h, respectively. Since the results met the relevant performance requirements, the calibrated model was then used to generate synthetic discharge values at YR\_Q1 for further analysis. The inflow results associated with different return periods are shown in Figure 7.

**Table 2.** The validated performance of HEC-HMS for three typhoons.


**Figure 7.** Synthetic inflow hydrograph at YR\_Q1 in the Yilan River basin for different return periods.

There are four water level stations downstream of YR\_Q1 (Figure 1). Among them, data from YR\_S1, YR\_S2, and YR\_S3 were used as input data to train the ANN model. The HEC-RAS model was used to estimate water level variations at the abovementioned locations, and YR\_S4 was used as the downstream boundary. HEC-RAS is a 1D river hydraulic model based on the Saint-Venant equations. These equations are approximated with the implicit Preissmann scheme and solved numerically using the Newton–Raphson iteration approach [45]. The downstream boundary condition was assumed to be the observed water levels during Typhoon Dujuan (2015). In comparison with observations, the performance was evaluated at YR\_S2 in terms of temporal variations in water level, and the results are shown in Figure 8. The results demonstrate that differences in peak water level and time to peak water level were below 10% and zero hours, respectively. In conclusion, both the hydrological and hydraulic results confirm that the model parameters were well-tuned to generate the data needed for the development of the ANN model in the next section.

#### *4.2. ANN Model Development*

#### 4.2.1. Correlation Analysis and Input Factor Selection

As described in Section 3.2, precipitation from YR\_R1 to YR\_R5, flow discharge at YR\_Q1, and water levels from YR\_S1 to YR\_S3 were assumed to be correlated with the water level output at YR\_S2. A correlation analysis between the targeted water level at the present time and the abovementioned variables from previous periods was performed using Equation (4). This model was considered the ANN\_0 model. To include physical features such as the velocity and acceleration of water level variations, a correlation analysis between the first-order and second-order differences of the output and its values from previous periods was conducted. In this way, two more models were developed, named the ANN\_1 and ANN\_2 models. All of the correlation results are shown in Table 3. The variables with the highest correlation results with those in previous periods were selected as the model inputs. For example, for the ANN\_0 model, the highest correlation results are 0.795 and 0.917 for YR\_R1 at t−4 h and YR\_S1 at t−1 h, respectively. As a result, four variables of YR\_R1 and one variable of YR\_S1 were selected for the ANN\_0 model. However, in some cases, such as YR\_R1 in the ANN\_1 model, the correlation results 7 and 8 h earlier were 0.601 and 0.608, respectively. These values were different at three decimal digits. Other rainfall-related input factors, such as YR\_R2, YR\_R3, YR\_R4, and YR\_R5, were

the variables 7 h earlier. To be consistent with other rainfall input factors and to increase computational efficiency, the variables 7 h earlier were selected for YR\_R1. In addition, few variables were included, so the efficiency was increased considering the limited computing resources of the sensors. As a result, there were a total of 26, 46, and 46 input factors for the ANN\_0, ANN\_1, and ANN\_2 models, respectively, and details of the input factor selection are listed in Table 4.

**Figure 8.** Comparison of the simulated and observed water levels at YR\_R2 for (**a**) Typhoon Soulik; (**b**) Typhoon Dujuan; and (**c**) Typhoon Megi.


**Table 3.** Correlation matrix for output and input variables.

**Table 4.** ANN models and their input factors.


4.2.2. Model Training and Testing

Random sampling was employed to split the input data from Section 4.1 into training and testing sets at an 80–20% ratio. For more details in terms of random sampling, a dataset included the input factors listed in Table 4. For example, there were 46 data needed in a dataset to train the model and produce water levels at t + 1, t + 2, and t + 3. Shown in Table 4, all data in the dataset were in a sequential order. The amount of dataset was depended on the data availability during the training process. For example, if a 24-h was available, there were 15 datasets available for training process. At each training, observed or synthetic water levels at t + 1, t + 2, t + 3 were used for performance evaluation. Two extra experiments were conducted. One involved using data from 2-, 5-, 10-, 25-, and 50-yr return periods for training and data from the maximum 100-yr return period for testing. Another involved using data from 5-, 10-, 25-, 50-, and 100-yr return periods for training and data from the minimum 2-yr return period for testing. The number of training data was fully prepared and the number of the data was constant, therefore, there was only one epoch done during this training process. The purpose of these experiments was to assess the ANN models and their forecasting capability beyond the training data range. The RMSE index (Equation (3)) was used to evaluate the model performance. The forecasting results with 1-, 2-, and 3-h lead times at location YR\_R2 are shown in Table 5.


**Table 5.** RMSEs of different ANN models for the training and testing processes.

The results demonstrated that the worst performance among the three models for all three lead times was from the ANN\_0 model; its RMSE results were 0.3421 m and 0.7743 m for 1- and 3-h lead times, respectively. The performance of the other two tests for all three models was comparable to that in the cross-validation test. This finding confirmed that the proposed ANN models have the capability to forecast data beyond the testing data range. In comparison, the ANN\_1 and ANN\_2 models yielded RMSEs that were all below 0.2 m. The ANN\_2 model displayed better performance for the results with 2- and 3-h lead times than ANN\_1. However, the performance of these two models deteriorated when the forecasting lead time was increased. According to the abovementioned results, the following tests were continuously implemented using the ANN\_1 and ANN\_2 models.

To test the performance of the proposed ANN models for operational purposes, three historical typhoon events, namely, Yutu (2018), Mangkhut (2018), and Maria (2018), were considered. Three typhoons were split into two typhoons for training and one typhoon for testing. The results are demonstrated in Table 6, and the naming convention is based on the names used in the test case. The performance of both models was fairly consistent. All RMSEs were below or close to 0.1 m regardless of the lead time. According to the individual results, the ANN\_2 model performed slightly better than the ANN\_1 model. A temporal comparison between observations and the simulations of both models is shown in Figure 9. The results demonstrated that both models agree fairly well with the observations for all three typhoons. It was interesting to find that performance did not deteriorate when the lead time was increased. In contrast, both models produced worse results with a 1-h lead time when the peak water level occurred during Typhoon Yutu in comparison with those results for 2- and 3-h lead times.


**Table 6.** Comparison of RMSEs for different ANN models and historical typhoon events.

**Figure 9.** Comparison of simulated and observed water levels at YR\_R2 for the ANN\_1 model (**left**) and ANN\_2 model (**right**) at Yutu (**a**), Mangkut (**b**), and Maria (**c**).

Additionally, the calculation times of the models were compared for different hardware configurations. The comparison was conducted with a hardware configuration that included an AMD Ryzen 94900 central processing unit, an NVIDIA GeForce RTX 2060 (PC) and an RPi 3B+ (the sensor mentioned in Section 3.1), among other components. Using Typhoon Yutu with the ANN\_2 model as an example, the calculation time was 5 min if the model was run on the PC and 30 min if it was run on the sensor. The results for the PC and the sensor were consistent, but the calculation time when the model ran on the sensor was 6 times slower than when it ran on the PC. Therefore, calculation time is an issue that must be addressed in future investigations if the sensor is installed on site for real operation. In conclusion, all the results above confirmed the capability of the proposed ANN models to forecast water levels with up to a 3-h lead time. According to the comparison of model

performance, the ANN\_2 model can be continually applied for further applications and will be discussed in the next section.

#### *4.3. Applications*

The proposed ANN\_2 model and integrated sensor were then applied to three other canals and rivers for real-world tests. The details of the geographic locations of these three study areas can be found in Figure 1. The same input factors as in Table 4 were selected, but the number of gages varied according to the number of gages installed in the study areas. For example, there is only one rainfall station near the Shimen Canal; therefore, the number of input factors decreased from 46 to 17. The list of the input factors for these three study areas is shown in Table 7. In addition to the RMSE shown in Equation (3), the coefficient of determination (*R*2) described below was used to evaluate system performance in these real-world cases.

$$\begin{array}{l} R^2 = 1 - \frac{SS\_{\text{res}}}{SS\_{\text{tot}}}\\ SS\_{\text{res}} = \sum (P\_{\text{i}} - O\_{\text{i}})^2\\ SS\_{\text{tot}} = \sum \left(P\_{\text{i}} - \overline{O}\right)^2 \end{array} \tag{8}$$

where *Oi* and *O* are the hourly observations and mean of the observations, respectively, and *Pi* is the prediction. If the predictions exactly match the observations, *SSres* = 0 and *R*<sup>2</sup> = 1. A detailed discussion for each case is given below. In the applications, the sensor was continually receiving data. If the sensor collected new data, the model was retraining with newly collected data. Therefore, a new epoch was completed. This process was repeated until the end of the experiment.

**Table 7.** ANN models and their input factors for various applications.


#### 4.3.1. Shimen Canal

The distance between SC\_R1 and SC\_R3 is approximately 650 m and is shown in Figure 1. Three sensors were installed at SC\_R1, SC\_R2, and SC\_R3. The experiment was performed on site during 24 June 2021, and 25 June 2021. There were a total of 200 observations collected by each sensor for an hour, and the average value was used for model input. Other data, such as discharge and precipitation data, were retrieved from local stations. The system started to produce water levels at SC\_R2 with 1-, 2-, and 3-h lead times after the fifth hour of installation. The overall testing time was 35 h. The comparison between observations and simulations is shown in Figure 10. The errors, which were defined as the difference between observation and prediction, were within 0.075 m. The largest error of 0.075 m was found in Figure 10a with a 1-h lead time. One possible reason for the error was human interference. The canal is operated at a certain water level for the purpose of irrigation. This study did not include the factors of human operations, and the performance deteriorated once human inference was carried out. The RMSEs were between 0.02 and 0.03 m. According to the performance, the system was able to produce the variation in the water levels. However, because of manual operations such as gate control upstream of the canal, *R*<sup>2</sup> was only in the range of 0.3 to 0.45. The total computation time needed to train the proposed ANN\_2 model in the sensor and produce water level forecasts using the observations was 8 min.

#### 4.3.2. Toucian and Beinan Rivers

To avoid the impact of manual operations on system performance, the Toucian and Beinan Rivers were selected to test system performance. Unfortunately, the distance between the observed water surface and the installation of the sensor was beyond the range of the maximum measurement distance. Therefore, the following experiments were conducted by using the data retrieved from the Water Resources Agency directly. The evaluation period was from 3 June 2021, to 31 December 2021, for the Toucian River. Typhoon In-fa on July 24 and Typhoon Lupit on August 6 influenced this study area during this period. Figure 11 shows that the largest difference between observations and predictions at TR\_S2 (see the location in Figure 1) was 1.3 m among all forecasts. This was the time when Typhoon In-fa had an impact on this area (solid circle in Figure 11). Similar to previous cases, the system started to produce forecasts the 5th hour after the experiment started. It was confirmed that having more data in the training process increased the accuracy of the ANN\_2 model. The errors between observations and simulations were decreased to below 0.5 m for the second typhoon event (dashed circle in Figure 11) and thereafter. Finally, the overall R<sup>2</sup> and RMSEs were approximately 0.98 and 0.045 m, respectively.

**Figure 10.** *Cont*.

**Figure 10.** (**a**–**c**) Comparison of simulation and observed water levels at SC\_S2 of Shimen canal for lead times t + 1 h, t + 2, and t + 3 h, respectively.

For the Beinan River, the evaluation period was from 1 July 2021, to 31 December 2021. Typhoon Lupit on August 6 and Typhoon Kompasu on October 14 influenced this study area during the period. Figure 12 shows that the greatest difference between the observations and predictions at BR\_S2 (see the location in Figure 1) was below 1 m. The largest errors were found when the highest and lowest water levels were observed (solid and dashed circles in Figure 12), and extreme values such as these were not included in the training data. The overall R<sup>2</sup> and RMSEs were approximately 0.98 and 0.08 m, respectively. In conclusion, based on the above experiments, the proposed ANN\_2 model and integrated sensor show excellent potential to perform edge computing locally and generate water level forecasts for real-time operation. The forecasting accuracy was influenced if the water levels were beyond the values in the training data set. Furthermore, the performance of the proposed system decreased if human interference occurred in the study area.

**Figure 11.** (**a**–**c**) Comparison of simulation and observed water levels at TR\_S2 of the Touciaan River for lead times t + 1 h, t + 2, and t + 3 h, respectively.

**Figure 12.** (**a**–**c**) Comparison of simulation and observed water levels at BR\_S2 of the Beinan River for lead times t + 1 h, t + 2, and t + 3 h, respectively.

#### **5. Conclusions**

In this study, an ANN-based model was integrated into a Raspberry Pi-based sensor to implement edge computing for hourly river water level forecasting. The ANN model is capable of forecasting the river level with a high level of precision by only using previously observed water level, rainfall information, and flow discharge as inputs without the need for other hydrological and meteorological parameters. Edge computing is a form of computing that is conducted on site or near a targeted location, thus minimizing the need for data to be processed at a remote data center and increasing the efficiency of the emergency response. This study is a first attempt to combine real-time customized sensors and ANN algorithms in practice. Based on historical measured data from the Yilan River in Yilan County, synthetic upstream rainfall and discharge data were generated for six different return periods using the Pearson III probabilistic model and the HEC-HMS hydrological model with synthetic inflow data. The downstream water level data were obtained with the aid of the HEC-RAS hydraulic model. Different input combinations, including first-order difference and second-order difference water levels, were investigated to enhance the precision of the ANN model. The results demonstrated that the proposed ANN model has the capability to precisely forecast the future fluctuations in the water level of rivers in a short time and with a small number of inputs. The model was then embedded into the customized sensor to forecast the water level over different time horizons up to 3 h in advance. Finally, a comprehensive comparison between forecasts and observations was performed for three other rivers and canals. The findings revealed that the proposed water level sensor with the ANN model exhibited a high level of performance when it was applied to real events. Therefore, the integration proposed in this study is very promising and could be incorporated into a new generation of flood warning systems to prevent and mitigate the impacts of floods in downstream areas. However, the time required to train the ANN model and the system to produce results is over 30 min. Due to the limited computing power of the sensor, the amount of data required to train the model during real-time operation while maintaining high forecasting accuracy to speed up the computing process needs to be investigated further. In addition, different options of microcontroller units with more computing power will also be considered in future studies. Finally, in the current study, only an on-site experiment was performed because of the limitation of devices. More on-site experiments must be performed, and input factors of the ANN model regarding human interference should be included in future studies to extend the system application scope.

**Author Contributions:** C.-H.L. and T.-H.Y. developed the theoretical formalism, performed the analytic calculations and performed the numerical simulations. O.T.W. analyzed the data and produced figures and tables. C.-H.L., T.-H.Y. and O.T.W. contributed to the final version of the manuscript. All authors have read and agreed to the published version of the manuscript.

**Funding:** This study was supported by the National Science and Technology Council, Taiwan under Research Grant MOST 111-2625-M-A49-006.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study.

**Data Availability Statement:** The experimental watershed data can be accessed in [23], and the data that support the findings of this study are available from the corresponding author, T.-H.Y., upon reasonable request.

**Acknowledgments:** The authors would like to thank the National Center for High-performance Computing (NCHC) for the experimental watershed data sets.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Systematic Review* **E-Cardiac Care: A Comprehensive Systematic Literature Review**

**Umara Umar 1,\*, Sanam Nayab 1, Rabia Irfan 1, Muazzam A. Khan <sup>2</sup> and Amna Umer <sup>3</sup>**


**Abstract:** The Internet of Things (IoT) is a complete ecosystem encompassing various communication technologies, sensors, hardware, and software. IoT cutting-edge technologies and Artificial Intelligence (AI) have enhanced the traditional healthcare system considerably. The conventional healthcare system faces many challenges, including avoidable long wait times, high costs, a conventional method of payment, unnecessary long travel to medical centers, and mandatory periodic doctor visits. A Smart healthcare system, Internet of Things (IoT), and AI are arguably the best-suited tailor-made solutions for all the flaws related to traditional healthcare systems. The primary goal of this study is to determine the impact of IoT, AI, various communication technologies, sensor networks, and disease detection/diagnosis in Cardiac healthcare through a systematic analysis of scholarly articles. Hence, a total of 104 fundamental studies are analyzed for the research questions purposefully defined for this systematic study. The review results show that deep learning emerges as a promising technology along with the combination of IoT in the domain of E-Cardiac care with enhanced accuracy and real-time clinical monitoring. This study also pins down the key benefits and significant challenges for E-Cardiology in the domains of IoT and AI. It further identifies the gaps and future research directions related to E-Cardiology, monitoring various Cardiac parameters, and diagnosis patterns.

**Keywords:** arrhythmia; artificial intelligence (AI); cardiac; communication technologies; Electrocardiogram (ECG); systematic literature review (SLR)

#### **1. Introduction**

Following the available information as confirmed by the World Health Organization (WHO) [1], cardiovascular disease claims a large number of causalities across the globe [2] and is responsible for approximately 80% of sudden deaths. Moreover, in more than 15% of the deaths, cardiac arrhythmia is considered the chief reason. Thus, promoting cardiovascular health is vital and requires an overhaul of healthcare systems [3].

The rapidly expanding Internet of Things (IoT) [4] technology has the capability to monitor and control the critical human functions, irrespective of where the individual is located or what they are doing. Medical IoT (MIoT) is a cutting-edge technology that functions by exploiting the advantages of the Internet at a very affordable cost with minimum effort. The MIoT-based cardiac system guarantees monitoring the physical symptoms [5] of cardiac patients, such as temperature, Blood Pressure (BP), Oxygen Saturation (SPO2), Electrocardiogram (ECG), Heart Rate (HR) [6], and linked environmental parameters effectively and without any failure. The MIoT cardiac care framework is a customized paradigm that meets the requisite medical and safety standards of pervasive cardiac healthcare, including serious heart-related issues.

Various cardiac (heart) abnormalities can be detected through an Electrocardiogram (ECG) which is a medical testing platform that keeps track of electrical activity the heart generates as it contracts. An electrocardiograph is a device that records patient's ECG. An ECG is a valuable tool for identifying problems associated with heart rate or heart rhythm.

**Citation:** Umar, U.; Nayab, S.; Irfan, R.; Khan, M.A.; Umer, A. E-Cardiac Care: A Comprehensive Systematic Literature Review. *Sensors* **2022**, *22*, 8073. https://doi.org/10.3390/ s22208073

Academic Editors: Andrei Velichko, Dmitry Korzun and Alexander Meigal

Received: 30 August 2022 Accepted: 4 October 2022 Published: 21 October 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

It offers assistance to the physician in determining whether a patient is having a heart attack or has had one in the past. An ECG is usually the first option for a cardiac test because of its proven dependability. An ECG is helpful to determine if one's pulse is difficult to feel (bradycardia), or it is too fast (tachycardia) to count accurately. An ECG can also show heart rhythm irregularities, i.e., arrhythmia. The main types of arrhythmia are mentioned in Table 1.

**Table 1.** Various Types of Arrhythmias.


Similarly, atrial fibrillation, atrial flutter, and premature or extra beats are the other types of cardiac issues. Figure 1 shows waveforms for different arrhythmia types. Atrial fibrillation refers to a rapid, disorganized, and irregular heart rhythm., while atrial flutter is an atrial arrhythmia generated by a fast circuit in the atrium. Compared to atrial fibrillation, atrial flutter is typically more organized and regular.

**Figure 1.** Types of arrhythmia.

A comprehensive review of E-Cardiology, which encompasses the Internet of Things, artificial intelligence, and cardiology could help understand the essential building blocks of an IoT-based cardiac care system and intelligent diagnosis of various cardiovascular diseases. It can also help to develop a complete picture of various hardware devices (sensors), AI techniques, and communication technologies adopted by the existing studies in the field of intelligent cardiac healthcare.

Following the introduction, Section 2 of this paper briefly discusses related works; Section 3 highlights the contributions made by this paper. Section 4 elaborates on the review methodology adopted for conducting the survey. Section 5 gives the outcomes of the selected studies with a detailed analysis of the research questions (RQs). This section is

further divided into four subsections. Section 6 contains a discussion. Section 7 summarises the conclusions.

#### **2. Related Works**

This section presents a brief explanation of the related surveys in the field of IoT-based cardiac healthcare.

The primary goal of the study [7] was to collect the latest facts, figures, and evidence on the use of preprocessing techniques for heart disease classification. The review study also summarised the impact of the most frequently used preprocessing tasks and techniques and the performance in the field of cardiology. This review paper covered the literature from 2000 to June 2019.

A survey on IoT and AI in healthcare was presented by [8] for 2007 to February 2018. The paper highlighted the top application classifications, which included wearables, sensor networks, connectivity options, and disease detection and treatment. This review identified gaps and provided future research directions related to technology and design. However, this survey analysed only three online databases.

A review article on data mining techniques frequently used in the field of cardiology until 2015 was presented in [9]. The performance comparison of various data mining models in cardiology were also discussed in this review paper.

The authors in [10] presented a survey on the Internet of Things (IoT) for healthcare using mobile computing. This systematic study investigated how mobile computing assisted IoT in a healthcare environment. Moreover, the intention of this paper was to analyse the impact of mobile computing on IoT technology in Smart hospitals and the field of healthcare. This study covered the literature between 2011 and 2019.

Another study [11] proposed a substantial review of various IoT applications in a life-saving environment, as well as various other fields in Smart cities. It also contrasted IoT with M2M and highlighted some drawbacks of IoT technology. This review article covered 2013 to 2018 through the Scopus database.

Another study [12] presented literature on (IoT) technologies and several projects for healthcare in 2018. This paper provided a review of primary medical IoT sensors and an overview of state-of-the-art IoT infrastructure essential for healthcare. It focused on the latest IoT technologies for healthcare services, such as cloud computing, big data, RFID, WSN, Bluetooth, Wi-Fi, and other vital medical sensors. However, this study lacks a systematic review.

The study [13] highlighted various IoT applications and was presented in 2022. The study focused on IoT adoption in Pakistan and France in 2020. This systematic study highlighted the barriers and possibilities for the implementation of IoT applications. It also indicated the influence of COVID-19 on IoT adoption in the healthcare domain.

The [14] systematic review discussed telemedicine and healthcare IoT (HIoT). It covered 146 articles between 2015 and 2020. The articles were divided into five categories after a technical analysis. In addition to the benefits and limitations of the selected methods, a comprehensive comparison of evaluation techniques, tools, and metrics was also included. This study presented a summary of healthcare applications of IoT (HIoT).

The discussion so far is limited to only a particular aspect of Smart healthcare/E-Cardiology and does not genuinely attempt to cover the domain holistically. When we say "entire domain", it means AI-based IoT, which encompasses preprocessing techniques and also various communication technologies. According to the deficiencies of the existing review papers, we provide a comprehensive systematic literature review for the following reasons:


The following section highlights the contributions made by this review study, thus bringing novelty to this systematic review study.

#### **3. Contributions**


The next section discusses the research methodology adopted for our SLR.

#### **4. Review Methodology**

A systematic literature review (SLR) paradigm is followed in this paper for reviewing papers from the most reliable resources, as shown in Figure 2. Principally, the research work, applications, and monitoring/detection techniques provided by AI-aided MIoT in cardiac care are considered. The primary studies have been then passed through a quality assessment process for the study analysis to produce the best fit results.

**Figure 2.** SLR protocol outline.

The following subsections briefly describe the detail of each step involved in our review protocol.

#### *4.1. Defining Review Strategy*

The application of medical IoT in cardiac care is a compelling field of study for the researchers, so the primary focus of this SLR was to formulate the research questions exploring how medical IoT is affecting cardiac care and the significance of artificial intelligence in the diagnosis and detection of various heart diseases.

The review questions in Table 2 indicate how MIoT and AI are contributing to cardiac healthcare systems in Smart hospitals.


**Table 2.** Review questions and their motivation.

#### *4.2. Defining Search Strategy*

Once the research questions were designed, the next step was to indicate and state the search strategy to be followed precisely. Therefore, the primary literature mentioned in Appendix A (Table A1) was identified using three search strings which were used in the five digital databases, namely IEEE Xplore, ACM Digital Library, SpringerLink, ScienceDirect, and Google Scholar. These are the most popular online data resources in the domain of computer science and information technology. Second, these digital libraries were used as sources for previous systematic literature reviews related to computer science and E-Cardiology [15].

Our search span included the period of 2016 to 2021. The criteria used for the selection of search terms or keywords is mentioned below [16]:


After the critical analysis of the key terms, three search strings were formed in order to extract the relevant information. These search strings were checked on each of the aforementioned databases by changing their patterns to retrieve the best relevant results. The three search strings are given in Table 3.

**Table 3.** Search strings used for data retrieval.

## **No. Search String**



#### *4.3. Inclusion and Exclusion Criteria*

To identify and include studies relevant to answer the RQs, inclusion and exclusion criteria were developed as described in the section "Defining Review Strategy". To find the most appropriate publications, we defined the inclusion and exclusion criteria as mentioned in Table 4.

#### **Table 4.** Inclusion and exclusion criteria.


The authors evaluated each forthcoming paper to decide whether it should be included or excluded. The selection of papers was accomplished by following the three steps mentioned below.

The first step included the removal of duplicated and redundant papers; perusing the keywords, abstracts, and titles of research articles was the next step. Reading of full length research papers was carried out in the last step. Accordingly, the inclusion and exclusion criteria were implemented to their full effect. The articles that attracted difference of opinion were discussed and reviewed again by the authors, either using the full text or the partial text, until a consensus was achieved on an agreed-upon draft.

#### *4.4. Quality Assessment Criteria*

In this step, based upon the coherence and relevancy, we analyzed all the collected studies to address the defined research questions. A deep analysis of each paper was made, and based on our research questions, 134 papers were selected. Out of those 134 research papers, the papers having considerable citation count, appearing in good impact factor journals, and being delivered at highly ranked conferences were finally selected, thus leaving a total of 104 papers for the review, shown in Appendix A (Table A1).

#### *4.5. Quantitative Analysis*

The last step of our review protocol design was conducted to execute necessary statistical analysis on quantitative data. In this step, we quantitatively summarised and analyzed the results extracted from various sources such as conferences, journals, and book sections. Then, we carried out some quantitative statistical analysis of the findings to explore more about our research questions (RQs) and trends.

Figure 3 gives a thorough overview of our screening and assessment method for the statistical analysis of our literature. Five databases were chosen for the review, as illustrated in this figure. A total of 502 documents were chosen for review and analysis. The majority of papers were discovered to be duplicates. Thus, 203 records were eliminated before screening. Papers were removed for a few different reasons. Articles were chosen in the screening process based on a planned inclusion and exclusion strategy. Following the screening, 104 papers were chosen based on inclusion and exclusion standards.

**Figure 3.** Flow diagram showing the screening process for the systematic review.

The next section highlights the "Outcomes" of this systematic review.

#### **5. Outcomes**

*5.1. RQ 1: What Are the Vital Hardware Components/Sensors Used in E-Cardiac Architecture for Different CCU Parameters?*

The cardiac healthcare monitoring system in an IoT sphere encompasses the various IoT sensory modules and technologies attached to the patient, receiving sensory data, and sending data to the cloud for further monitoring, processing, and decision making. In an IoT-based cardiac healthcare monitoring system, the sensors, such as heart rate/pulse sensor, temperature sensor, blood pressure sensor, blood oxygen sensor, and ECG sensor, obtain sensory parametric values from the patient, transmit data through specific communication technologies to the cloud, apply machine learning practices to the learned parametric values, and generate alerts to the specialist suggesting timely action when warranted.

#### **(i) Heart Rate Sensory Unit**

Heart rate monitoring plays a crucial role in patient cardiac abnormalities diagnosis, detection, and classification. Several cardiac ailments and disorders occur due to a patient's high or low heart rate. Normal beats per minute (bpm) are 60–100. Less than 60 bpm is considered to be low and greater than 100 is considered to be high bpm. We discovered comprehensive studies that used several types of heart beat sensors for bpm monitoring. The studies [17,18] used heart beat pulse sensors to measure patient heart rates in a real-time environment. In this article [19], the KY 093 module was used to obtain heart rate values. Using a MAX30100 pulse oximeter, the authors in [20] collected heart beat information. The publications [21–23] utlized an ECG module AD8232 to obtain the patients heart rate data in real time for monitoring purposes. Table 5 mentions recent studies on heart rate sensors. Each heart rate sensor has its own set of properties. This table shows some important characteristics of several heart beat sensor variants such as pins, type, operating voltage, low current supply, accuracy, and so on.

#### **(ii) Temperature Sensory Unit**

Body temperature is an essential parameter for the development of cardiac healthcare monitoring. Various analog and digital temperature sensors are available for determining body temperature. Temperature can be measured in celsius or fahrenheit. Temperatures above 37.5 or 38.3 celsius are considered high. The temperature sensor LM35 is referenced in [24–27] for health monitoring. The authors in [18,28] used an 18DS20 sensor for temperature monitoring in a real-time environment. Table 6 shows several of the temperature sensors, along with descriptions. The LM35 and 18DS20 sensors are the most widely employed temperature sensors in the research studies that we analysed.

#### **(iii) Blood Pressure (BP) Sensory Unit**

BP monitoring is a fundamental biological measure for the detection and diagnosis of cardiac incongruities and anomalies. BP values can be obtained using various sensory units and devices. Systolic and diastolic values are captured by BP sensors or devices to be examined by a physician. Normal BP is less than 120/80 mmHg, while low BP, called "hypotension", is below 90/60 mmHg, and high BP, called "hypertension", is above 140/90 mmHg. Our research discovered a publication on E-Cardiology that dealt with cardiac patients' BP. In 2017 [29], a digital BP monitor (OMRONHBP1300) was used to monitor and automatically detect cardiac arrhythmia. The paper published in 2019 [30] examined predicting cardiac ailments in E-Cardiology using ECG, cholesterol, and BP. The MPX10 BP sensor was utilised in the 2020 publication [27] for patient health monitoring. Table 7 shows lists of sensors and devices utilized in the past few years for BP monitoring of cardiac patients, along with their comparable attributes.

The multiple modules of BP sensors and devices used in past studies, as well as different sensors of other cardiac parameters, are shown in Table 7.


**Table 5.** Features of heart rate sensors/devices.


**Table 5.** *Cont.*

Acc—accuracy; BPM—beats per minute; GND—ground; INT—integrated IR; LED—infrared light-emitting diode; IRD—IR LED to driver; LO—leads off; N/A—not applicable; N—no; RD—red LED to driver; SCL—serial clock; SDA—serial data; SDN—shutdown control input; V—voltage; VCC—voltage common collector; VIN—voltage input; Y—yes.

#### **(iv) Oxygen Sensory Unit**

Blood oxygen can be monitored using several IoT-based blood oxygen sensory units, such as pulse oximetry sensors, to obtain oxygen saturation levels along with a patient's heart rate. Ready-made wearable devices are also available to measure blood oxygen saturation levels. Blood oxygen is measured in percentage. The normal blood oxygen saturation is 90 to 100%. The study [37] describes a pervasive healthcare monitoring service system that uses an SpO2 device to measure oxygen saturation. The MAX30100 pulse oximeter has proven to be useful in measuring blood oxygen levels in cardiac patients [20]. Our findings and the literature on IoT-based cardiovascular healthcare monitoring used the sensors mentioned in Table 8 to measure oxygen saturation. This table lists several important and common features of oxygen saturation sensors and devices, such as addressed parameters, voltage, type, accuracy, pins, range, and so on. Table 8 shows that blood oxygen sensors/devices are used for cardiac patients in very few studies.

#### **(v) ECG Unit**

ECG is the most crucial biological parameter for monitoring, detecting, predicting, and classifying cardiac irregularities and variations in the human heart. The ECG AD8232 module was used in the studies [21,22,38–40] to monitor ECG and detect cardiac anomalies in cardiovascular patients. Heart abnormalities were detected with the use of the ECG AD8233 module [41]. In Table 9, recent papers published on multiple ECG sensors and devices are mentioned along with their necessary and comparable features. Low supply current, electrodes, the sampling rate, right leg drive shut down, single supply operation, high pass filter, output, operating temperature, pins, and other features of various ECG modules are considered as the most notable attributes.

Table 10 shows detailed and comprehensive literature analysed to find IoT-based cardiovascular sensors and devices used in previous studies from 2016 to 2021. This table demonstrates that the majority of research on ECG has been conducted using the ECG AD8232 module to detect anomalies in cardiac patients.


**Table 6.** Features of temperature sensors/devices.

Acc—accuracy; ADD—address select; ◦C/◦F —centigrade/fahrenheit; DQ—data in/out; GND—ground; N—no; NC—no connection; SDA—serial data; SDN—shutdown control; V—voltage; VCC—voltage common collector; VDD—power supply voltage; VOUT—output; Voltage Y—yes.

**Table 7.** Features of blood pressure sensors/devices.


Acc—accuracy; BP—blood pressure IR; kPa—kiloPascal's; LED—infrared light-emitting diode; mmHg—millimeters of mercury; ms—millisecond; N/A—not applicable; RT—response time; V—voltage; Vout—voltage output; Vs power supply.


<sup>∼</sup>D.C.4.3 V IR LED ±2% (80–100%); Acc—accuracy; GND—ground; INT—integrated; IRD—IR led to driver; HR—heart rate; N/A—not applicable;

PR—pulse rate; RD—red LED to driver; SDA—serial data; SDN—shutdown control; V—voltage; VIN—voltage input; Y—yes.

**Table 9.** Features of ECG sensor/devices.


AC/DC—alternating current/direct current; EMG—electromyography; FR—fast restore; GND—ground; HPF—high pass filter; HPSENSE—high pass sense; HPDRIVE—high-pass driver; HZ—hertz; INT—integrated; LA—left arm; LO—leads off; MHZ—megahertz; NA—not applicable; N—no; OPER TEMP—operation temperature; PC—personal computer; RA—right arm; REFIN—reference buffer input; RL—right leg; SDN—shutdown control input; SQL—structure query language sampling rate; V—voltage; Vs—power supply terminal; Y—yes.

### 221


**Table 10.** Sensors used in previous studies.




BP/bp—blood pressure; CL—cholesterol; DNNS—device name not specified; ECG—electrocardiogram; HB—heartbeat; HR—heart rate; INT—integrated; MNNS—module number not specified; PCG phonocardiograph; PR—pulse rate; PPG—photoplethysmography; UB-DNNS—used but device name not specified UB-MNNS—used but module number not specified; ✗—The specified parameter is not addressed.

#### *5.2. RQ 2: What Are the Most Important Communication Technologies Used in E-Cardiac Care?*

Communication technologies and protocols can be defined as a set of rules, technologies, semantics, equipment, and programs used to transfer, process, communicate, and receive information. Communication technologies and protocols vary depending upon the technology and network type devised, developed, or utilized. Some of the protocols and communication technologies are discussed in this section. The publications mentioned in Table 11 address the communication technologies and protocols used in previous selected studies for the development of E-Cardiology, monitoring, detection, and classification. BL is a wireless technology for short-range communication and exchanging data between mobile and fixed devices. BL has a transmission power of 1 mw–100 mw and a 1 Mbps data rate. Its data transmission range is 30 feet. The wearable healthcare monitoring devices (wearable fitness watches and pulse oximeters) may have the BL features integrated. BL technology was also employed in previous research [19,39,41,58,66,73] for data transmission for E-Cardiology. In prior literature on E-Cardiology monitoring, BL technology was determined to be the most commonly used technology. Ethernet is a wired communication networking protocol that can be used in local area networks (LANs), metropolitan area networks (MANs), and wide area networks (WANs). Ethernet allows communication through data cables. The publications [58,83] used an Ethernet-wired technology for connectivity support between various hardware modules implemented for cardiovascular disease diagnosis. One existing research study found that Ethernet-wired communication is rarely used in E-Cardiology. GSM is a cell-based or mobile communication modem that works as a mobile communication system. GSM technology is also used in E-Cardiology to send SMS messages or dial calls. GPS, which helps people to find their position on Earth, consists of networks of satellites and receivers or devices that determine location.

The communication technologies and protocols used in E-Cardiology in previous research studies and findings are detailed in Table 11.


**Table 11.** Communication technologies used in previous studies.


**Table 11.** *Cont.*

BT—Bluetooth; GSM—global system for mobile communications; GPS—global positioning system; GPRS general packet radio service; MQTT—message queuing telemetry transport; N—No; Sc.P—security protocols; SMS—short message service; SMPP—short message peer-to-peer; SP/PC—Smart phone/personal computer; TCP/IP—transmission control protocol/Internet protocol; WIFI—wireless fidelity; Y—yes.

#### *5.3. RQ 3: Which Pre-Processing Techniques Are Used in E-Cardiology, along with the Most Widely Used AI Classifiers/Models?*

RQ 3 is divided into two subsections. The first subsection investigates and compares various AI Models for the classification and prediction of CVD. This part explores various studies that use different machine learning and deep learning models for CVD prediction. Our study also provides a comprehensive explanation about the algorithms and methodologies used for prediction and classification techniques and the different datasets and performance metrics that we used to evaluate the models. Furthermore, the data preprocessing techniques used with different classifiers are also indicated in the second subsection below.

#### 5.3.1. AI Classifiers/Models and E-Cardiology

The prediction of CVD is a much discussed topic of research in the realm of healthcare. AI-based prediction systems can be of great help in detecting disease at an earlier stage which can reduce risk associated with disease progression. The concept of AI is not new in cardiac electrophysiology with automated ECG interpretation. It has existed in some form or other since the 1970s [87].

Artificial Intelligence (AI) is the reflection of human cognitive functions from the surroundings acquired by applying algorithms, pattern matching, cognitive computing, and deep learning to achieve specific objectives [88]. The ongoing progress in AI, primarily in the sub-domains of machine learning (ML) and deep learning (DL), have caught the attention of physicians hoping to develop newly integrated, dependable, and potent methods for ensuring standard healthcare in the critical field of cardiology.

Machine learning (ML) is a subset of AI to "teach" computers to analyze huge datasets in a quick, accurate, and efficient manner by using complex computing and statistical algorithms [89]. Supervised ML is more successful in predicting survival compared to the traditional clinical risk scores [90].

The study [91] proved that the accuracy of disease prediction can be increased by using an unsupervised type of ML for obstructive coronary artery disease in nuclear cardiology.

Deep learning (DL) is a supervised ML methodology that relies on neural networks and is known for the automated algorithms required to extract meaningful patterns from data collections [92]. In the medical context, the most widespread deep learning algorithms are artificial neural networks (ANN), multilayer perceptron (MLP), convolution neural networks (CNN/ConvNet), recurrent neural networks (RNN), radial basis function network (RBFN), deep belief networks, and deep neural networks (DNN) [88]. Compared with traditional supervised ML, the real strength of DL is that it is an effective, powerful, and flexible approach to representing complicated raw input data that does not demand manual feature engineering. For instance, while addressing the issue of automated ECG interpretation, early conventional supervised ML techniques depended on human-defined ECG features. In contrast, the modern DL model extracts patterns within raw ECGs to detect sinus rhythm and various other arrhythmias with a performance that equals the result of any cardiologist [93].

The significant areas of cardiac healthcare that can benefit from ML/DL techniques are prognosis, diagnosis, classification, treatment, and clinical workflow. Table 12 presents an overview of different AI algorithms extracted from the literature review on heart disease diagnosis/classification.

**Table 12.** Overview of common ai techniques used in E-Cardiology.


A comparative analysis of different AI techniques frequently used in Smart cardiology for the prognosis/diagnosis of various CVDs is given in Table 13.

As suggested by WHO, by 2030 almost 23.6 million individuals will die from heartrelated causes [94]. CVDs are the main cause, but they can be cured and prevented. To reduce the risk involved, analysis is fundamental. The difficult part is accurate diagnosis [95].

Table 14 summarises the most recent work performed in the field of artificial intelligence related to CVDs.



PCA-KNN—principal component analysis with K-nearest neighbor; NB—naïve Bayes; RF—random forest; SVM—support vector machine; LR—logistic regression; DL—deep learning; ANN—artificial neural networks; BP—backpropagation; Y—yes; S—slow; F—fast; L—low; H—high; N—no.

**Table 14.** Summary of AI-methodologies and data preprocessing techniques identified for E-Cardiology, from different studies.



**Table 14.** *Cont.*

H—high; M—medium; N/A—not available; DL—deep learning; MPNN-BP—multilayer perceptron neural network-backpropagation; RF—random forest; WPE—wavelet packet entropy; CNN—convolutional neural network; CL—K means clustering; ZAS—Z alizadeh; PTB—physikalisch technische bundesanstalt; NB—naive Bayes; FFT—fast fourier transform; DNN—deep neural network; SVM—support vector machine; NN—neural networks; BBNN—block-based neural network; MC—Mayo Clinic; HWT—haar wavelet transform; EMD empirical mode decomposition; RBNN—radial basis function neural network; OCAD—obstructive coronary artery disease; DBN—deep belief networks; ROC—receiver operating characteristic curve; DHCAF—dynamic heartbeat classification; SFS—sequential forward transform; DOST—discrete orthogonal stockwell transform; DOM—difference operation method; CCTA—cardiac CT angiography; PSO—particle swarm optimization; CUDB—Creighton University database; VFDB—ventricular fibrillation database; LS—SVM-least square SVM; OR—online repository; DBN—deep belief networks; ROC—receiver operating characteristic curve; DWT discrete wavelet transform; DHCAF—dynamic heartbeat classification with adjusted features; MCHCNN—multichannel heartbeat convolutional neural network.

#### 5.3.2. Data Preprocessing Techniques in E-Cardiology

This section identifies and evaluates studies that applied data preprocessing techniques in cardiac disease classification. Data Preprocessing (DP) in AI is a critical stage that enhances the quality of data to achieve meaningful insights and is the initial step in the development of an AI model. Conventionally, real-world data is not in an appropriate format and contains errors or outliers. It usually lacks specific attribute values/trends, thus resulting in an inadequate AI model. Data preprocessing solves this problem by cleaning and organizing raw data to tailor it to the needs of building and training AI models. Hence, data preprocessing in AI is a data mining approach that reshapes raw data into a readable format that is readily available for an AI model to meet the high standards of performance [120]. Consequently, the algorithm can easily interpret the data's features.There are four primary ways of data preprocessing, i.e., (1) data cleaning, (2) data integration and formatting, (3) data transformation, and (4) data reduction.

Different preporocessing techniques used in past studies for diagnosing heart disease and other types of arrhythmia are also mentioned in Table 14, along with AI models. This table also lists the task performed by the preprocessing technique.

#### *5.4. RQ 4: What Are the Major Issues and Challenges in Current E-Cardiology?*

After conducting comprehensive research, we identified some significant benefits and major challenges in the field of MIoT to answer our RQ 4. These challenges and benefits have been emphasized on the basis of studies conducted by different researchers in the domain of MIoT and E-Cardiology. Based upon selection and rejection criteria, only valid and reliable papers were selected, as mentioned earlier. We incorporated only the latest benefits and challenges that were found to be unique in the domain of IoT and AI regarding

healthcare and cardiology. A pictorial representation of these benefits and challenges is shown in Figure 4.

**Figure 4.** Benefits and challenges of E-Cardiology.

5.4.1. Benefits of E-Cardiology

*Internet of Things (IoT)* develops a linkage between "things", such as devices, gadgets, vehicles, and sensors. Likewise, the medical Internet of Things (MIoT)-based cardiovascular healthcare system monitors the physical symptoms [5] of cardiac patients at a very reasonable cost. These physical symptoms include temperature, blood pressure (BP), SPO2, and heartbeat, along with ECG [6] and associated numerical measurements. Significant benefits of IoT-based cardiology from various studies are noted in Table 15.




#### **Table 15.** *Cont.*

A comparison of some key factors in Smart cardiac care are shown in Table 16. *Artificial intelligence (AI)* is another significant aspect of E-Cardiology. Until now, AI in cardiac electrophysiology has exhibited promising results. Primary advantages of AI-based cardiology are discussed in Table 17.

**Table 16.** Comparison of the existing evaluation factors in the Smart cardiac care.



#### **Table 16.** *Cont.*

✓—parameter mentioned; ✗—parameter not mentioned.


#### **Table 17.** Key challenges and primary benefits of AI-based cardiology.

#### 5.4.2. Challenges of E-Cardiology

*Internet of Things (IoT)* comes with various challenges as shown in Table 15. In addition to these challenges, ref. [129] proposed some other vital challenges, such as fixation of sensors, body impact on signal propagation, and synchronization, that may affect critical health services such as cardiac care. The MIoT-based healthcare systems are expected to produce a vast amount of data. Moreover, these sensors and devices are linked through networks, thus enabling real-time transmission of data. Therefore, hackers may attempt to target it. Moreover, the timely availability of medical data will affect the patient's life. Consequently, it is crucial to have real-time information with lower latency over the network [124].

*Artificial intelligence (AI)* has brought a revolution in the field of healthcare. It has especially made a great contribution in the domain of cardiac care, such as timely prediction and diagnosis of cardiovascular diseases (CVDs), ECG analysis, and arrhythmia classification. However, despite all these milestones, it carries some challenges. Table 17 describes critical issues and challenges of AI-based cardiology from different past studies.

It is hoped that these challenges can be met and that through MIoT and AI we can achieve new levels of technical and medical standards in the field of cardiac healthcare.

#### **6. Discussion**

Figure 5 shows the distribution of the selected studies (104 papers) according to our four research questions (RQs). The pie chart shows that 47 studies overlap RQ1 and RQ2, whereas 33 studies address RQ3 and 24 address RQ4.

**Figure 5.** Statistical analysis of reviewed papers in terms of formulated RQs.

The outcomes of this systematic literature review suggest a noticeable increase in the research conducted in IoT using artificial intelligence in E-Cardiology. The research activities also incorporate monitoring CCU parameters, ECG analysis, and diagnosis/classification of various heart diseases. This extensive review study reveals that IoT sensors utilized for E-Cardiology are based upon analog sensors, digital sensors, and different wearable sensors or device modules. For fitness tracking and health monitoring activities, a variety of ready-made and wearable watches are also available. E-Cardiology patients can use a variety of wearable gadgets to track various cardiac healthcare characteristics. To establish expert E-Cardiology systems, several communication technologies and protocols for transmissions over a defined range need to be instituted and maintained. Our findings on communication technologies and protocols for the implementation of IoT-based cardiovascular healthcare monitoring and detection systems show that the following technologies provide viable tools to use in MIoT systems: Bluetooth (BL/BLE), ethernet, global system for mobile (GSM), global positioning system (GPS), global packet radio service (GPRS), message queuing telemetry transport (MQTT), short message service (SMS), email, zigbee, transmission control protocol/Internet protocol (TCP/IP), wireless fidelity (Wifi), security protocols, broadband/Internet, cloud, Smart cars, Smart phones/computers, etc. Our findings show that deep learning is often used in cardiac imaging procedures, particularly in echocardiography [88]. Furthermore, CNNs have been evaluated and found useful in calculating coronary artery calcium in cardiac CT angiography [101]. Though deep learning techniques are garnering attention in the field of Smart cardiology, we may infer that instead of depending on a particular AI model, hybrid techniques are expected to

produce better results. From the results of our study, it may also be inferred that SVM was more frequently used in cardiac care than was deep learning. However, in recent years, deep learning has emerged as a more powerful and reliable tool for the detection and diagnosis of various heart diseases. It was also noted that data reduction appears to be a major concern of researchers when applying data mining/data preprocessing approaches to predict CVD. This literature review includes 24 studies devoted to major issues and challenges in E-Cardiology. These studies conclude that a major benefit of MIoT in cardiac healthcare is that it generates timely and accurate data, which results in better healthcare outcomes. One vital advantage of using ML techniques is the ability to fuse various types of data [143]. We also assessed the results and noted that mIoT devices do not possess the requisite data protocols and standards [131]. Therefore, many issues must be addressed to ensure IoT privacy and security [144,145], which is one of the major challenges of the IoT era. Moreover, the continuous monitoring of critical indicators requires reduced energy consumption and a longer battery life [146] to prevent a break of communication. This is also one of the significant challenges of MIoT. The medical data gathered using MIoT is seldom standardized and often fragmented. The data in legacy IT systems are usually generated with incompatible formats. Thus, the great challenge of interoperability needs to be addressed as well [124,147]. However, to move forward in the field of MIoT, fearing AI is no option.

Instead, we should work toward the smooth digitization of healthcare infrastructure [148]. Obviously, various benefits of AI cannot be implemented and utilized correctly without integrating AI into clinical decision making effectively and responsibly [149].

Figure 6 refers to the research work conducted based on the number of papers per annum. It shows the year-wise trends in publications in the field of E-Cardiology. It also indicates that the maximum number of papers selected for this survey was from 2019.

#### *6.1. Gaps, Future Recommendations*

In a future work, all the vital cardiac parameters can be combined with an intelligent cardiac care unit (CCU) to develop a complete picture of E-Cardiology. These parameters/indicators may be comprised of temperature, blood pressure, oxygen saturation, heart rate, and ECG analysis. In addition, in differentiating between normal and abnormal heartbeats, a Smart CCU can also be used to detect QRS complexes in electrocardiographic (ECG) signals to determine the presence of a cardiac malady and different arrhythmias. Integrated and wearable IoT solutions, which address all the necessary cardiac parameters of a heart patient, need to be implemented. The results and accuracy of the devices/sensors used in the development of an IOT-based cardiac system cannot be compromised. The implemented system must be tested, evaluated, and approved under the supervision of cardiologists. Advanced communication technologies, including secure network protocols, must be implemented. Data accessibility features, such as widespread data access, must be possible in a secure environment so that the data confidentiality and integrity are maintained. Ubiquitous access is also an important factor and can be achieved by storing the digital data on a cloud server.

#### *6.2. Limitations of the Review Study*

This review of literature has some limitations. First, many papers were conference proceedings; therefore some parameters remained inaccessible since their authors did not mention them in detail. Second, some of the studies on AI-based IoT architecture in cardiac healthcare could not be located even after following a comprehensive search protocol, such as gray literature and reports that were not published in the databases which we selected for review. Therefore, we suggest that an additional systematic study be conducted to cover the related literature from other important databases.

**Figure 6.** Year-wise trends in publications in the field of E-Cardiology.

#### **7. Conclusions**

This review study outlines a total of 104 primary studies from 2016 to 2021 based on our filtering process for supporting the proposed research. Quality assessment of the selected studies was conducted for the formulated research questions after a rigorous analysis.

This work mentions different sensors and communication technologies being used in cardiology. Moreover, this review study also describes various preprocessing techniques and AI algorithms used in the existing studies to diagnose and classify CVD and ECG analysis. This systematic review also provides comparative analysis of various existing techniques in the field of AI, medical sensors, and communication technologies. Finally, this study targets various advantages and issues indicated in the existing literature in the field of E-Cardiology. The interaction of MIoT and artificial intelligence makes cardiac healthcare more manageable by making various applications, services, communication protocols, third party APIs, and IoT sensors available. E-Cardiology guarantees more privacy and security to the IoT devices which are prone to hackers. Furthermore, AI-based diagnosis of various cardiovascular diseases in E-Cardiology helps save time, enabling cardiologists to focus more on treatment. This systematic work presents a review protocol to analyse how IoT applications assist cardiac healthcare and how various artificial intelligence (AI) models contribute to present and prospective research work of IoT in E-Cardiology. This study also indicates how different communication technologies bring privacy and security to IoT devices related to cardiac healthcare. The purpose of this paper is to highlight the influence of IoT, communication technologies, and AI techniques in cardiac healthcare in light of our systematic literature review protocol. Therefore, one can say that this systematic review covers the complete and latest infrastructure of E-Cardiology, along with its benefits and challenges which have not been examined before in such a comprehensive way.

**Author Contributions:** Conceptualization, R.I. and M.A.K.; methodology, R.I. and U.U.; validation, U.U. and S.N.; formal analysis, U.U. and S.N.; investigation, U.U. and S.N.; resources, R.I., U.U. and S.N.; writing—original draft preparation, U.U., S.N. and A.U.; writing—review and editing, R.I., U.U. and S.N.; visualization, A.U. and S.N.; supervision, R.I. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A**

*Selected Literature*

Summary of the selected literature consisting of 104 papers used in this systematic literature review is given in the Table A1.

**Table A1.** Selected Literature.


#### **Table A1.** *Cont.*


#### **Table A1.** *Cont.*


#### **References**


## *Review* **Deep Learning for LiDAR Point Cloud Classification in Remote Sensing**

**Ahmed Diab 1, Rasha Kashef 2,\* and Ahmed Shaker 1,\***


**Abstract:** Point clouds are one of the most widely used data formats produced by depth sensors. There is a lot of research into feature extraction from unordered and irregular point cloud data. Deep learning in computer vision achieves great performance for data classification and segmentation of 3D data points as point clouds. Various research has been conducted on point clouds and remote sensing tasks using deep learning (DL) methods. However, there is a research gap in providing a road map of existing work, including limitations and challenges. This paper focuses on introducing the state-of-the-art DL models, categorized by the structure of the data they consume. The models' performance is collected, and results are provided for benchmarking on the most used datasets. Additionally, we summarize the current benchmark 3D datasets publicly available for DL training and testing. In our comparative study, we can conclude that convolutional neural networks (CNNs) achieve the best performance in various remote-sensing applications while being light-weighted models, namely Dynamic Graph CNN (DGCNN) and ConvPoint.

**Keywords:** point clouds; deep learning; remote sensing

#### **1. Introduction**

The light detection and ranging (LiDAR) mapping generate precise spatial information about the shape and surface components of the Earth. Advancements in LiDAR mapping systems and their technologies have been proven to examine natural and manmade environments across various scales with higher accuracy, precision, and flexibility [1]. LiDAR Remote sensing provides an accurate 3D representation of scanned areas with many features that provide great performance for various applications. Such applications include Digital Elevation Model (DEM), Digital Surface Model (DSM), and Digital Terrain Model (DTM) generation, which, combined with intensity data, achieve excellent performance in urban land cover classification [2]. Some other urban applications include pavement crack detection [3], collapsed building detection [4], road markings and fixtures extraction and classification [5], cultural heritage classification [6], and change detection [7]. Because LiDAR is sensitive to variations in vertical vegetation structure, it makes it very effective for natural resources [8] and forest applications [7], such as tree species classification [9]. Additionally, full-waveform LiDAR adds more advantages to using LiDAR in forestry applications [10].

Various deep learning models have been developed with outstanding performance for data classification on point cloud datasets in multiple applications. Existing deep learning methods for point cloud classifications involve architectures based on the traditional neural network, the Multi-Layer Perceptron (MLP). These models are called PointNet-Based as they build on the pioneering work of PointNet [11]. PointNet is a great performer that is very lightweight but suffers from local information loss. Global features are features of a scene, object, or image that describe it as a whole, compared to local features that are extracted at different points and represent patches of the scene or image [12]. PointNet++ [13]

**Citation:** Diab, A.; Kashef, R.; Shaker, A. Deep Learning for LiDAR Point Cloud Classification in Remote Sensing. *Sensors* **2022**, *22*, 7868. https://doi.org/10.3390/s22207868

Academic Editors: Andrei Velichko, Dmitry Korzun and Alexander Meigal

Received: 12 September 2022 Accepted: 9 October 2022 Published: 16 October 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

mitigates the loss by building a feature aggregation pyramid to learn hierarchically, similar to how a traditional Convolutional network learns. One of the biggest challenges of using LiDAR point clouds in deep learning is the unstructured shapes of the point cloud data; a convolutional kernel that works on uniform grid-structured data cannot be directly applied to the raw point cloud. A convolutional neural network can better capture spatial features, which performs better than a traditional neural network while being more lightweight than most handcrafted models. The convolutional neural network is structured as a convolution layer, non-linearity, e.g., Rectified linear unit (ReLU), and pooling layers to distil features from low-level to high-level [14]. Applying CNNs on point clouds involves the 2D projection of the point cloud to obtain images that can then be fed into traditional convolution layers in a convolutional neural network. Another approach is resampling or restructuring the point cloud into uniform volumetric grids using occupancy functions and 3D convolutional layers to create the CNN or to design novel convolutional layers that can operate on pointsets and the custom convolution operation to build the CNN.

This paper provides a roadmap for current DL deep learning models for LiDAR point cloud classifications in remote sensing. Existing deep learning methods can be classified as projection-based and point-based models. Each category enjoys specific characteristics; however, they show some limitations. Thus, this paper summarizes the significant subcategories: 2D projection, Multiview projection, voxelization, Convolutional-based networks, and graph convolutional networks. Additionally, we cover some examples that encompass most of the fundamentals within each subcategory. Remote sensing applications require different datasets or workflows; thus, we cover some examples from remote sensing that employ or build upon computer vision models. Our comparative analysis shows that DGCNN and ConvPoint have shown the best performance in various remote-sensing applications while being light-weighted models. The rest of this paper can be organized as Section 2 focuses on LiDAR point cloud data and processing overview, Section 3 introduces the primary computer vision deep learning models that are often used to classify 3D data, and Section 4 presents Point cloud computing tasks that are common in remote sensing applications, Section 5 introduces the benchmark 3D datasets used in training and testing of deep learning models grouped as objects, indoor, arial scanned, mobile scanned, and terrestrial scanned datasets, Section 6 shows the evaluation metrics commonly used to measure and benchmark model performance; Section 7 provides a comparative analysis of existing models on different datasets for different classification tasks. Finally, Section 8 concludes the paper.

#### **2. LiDAR Point Clouds**

A typical LiDAR system in remote sensing uses a laser, Global Positioning System (GPS) and an Inertial Measurement Unit (IMU) to approximate the heights of objects on the ground. Discrete LiDAR data are generated; each point represents high energy points along with rebounded energy. Discrete LiDAR points contain each point's x, y, and z values. The z value is used to obtain height. The LiDAR data can estimate surface structures with various methods [15]. The raw LiDAR data are delivered as points, known as point clouds, that can be further processed to create Digital Elevation Models (DEMs) or Triangulated Irregular Networks (TINs) [1]. Point data are commonly stored in LAS (LASer) format, regarded as an industry standard that contains information in a binary file specific to the LiDAR nature of data without being complex [15]. The LiDAR data can also contain other information such as the intensity of the rebounds, the point classification (if applicable), number of returns, time, and source of each point [1,15]. LiDAR scanners use a laser pulse to measure the distance from the sensor using the time for the laser pulse to return in the case of time-of-flight sensors (Figure 1a) [16] or using the triangulation angle on the optical sensor for triangulation-based scanners (Figure 1b) [17]. The LiDAR scanners then generate an [x, y, z] position relative to the sensor's locations based on the distance from the sensor and the degrees of rotation of the sensor, such as pitch, roll, and yaw [18]. Most LiDAR sensors also measure the intensity of the return signal, which can be used to differentiate

between different surface types with different reflectivity [1]. Additionally, the sensor is often paired with a GPS and an IMU to capture data required for georeferencing and mapping of the point cloud.

**Figure 1.** Time of Flight LiDAR sensor calculation (**a**) [16] and triangulation-based LiDAR calculation (**b**) [17].

For supervised classification, a significant challenge when working on LiDAR point clouds is the variation in density inherent in the nature of the data. The density of similar objects is also varied, as it depends on the speed of the vehicle mounting the sensor. Some areas will be too dense and expensive to process, requiring some form of downsampling. Other regions of a point cloud will have few or no points present. Additionally, for LiDAR point clouds that include intensity values, the intensity of the same object could be affected by different conditions and result in the same object having slightly different intensities [18].

#### **3. Point Cloud Computing**

Remote sensing data go through multiple processing steps to generate information that can be consumed for production. Over the past few years, deep learning has been applied to almost all remote sensing data processing aspects. Most notably, classification and segmentation tasks. Regarding remote sensing 3D LiDAR point clouds, there is limited interest in whole scene classification and more in semantic classification or segmentation tasks. Some other examples of deep learning tasks tackled by deep learning include change detection, registration, fusion, and completion.

Traditionally, deep learning classification describes classifying an entire scene or an object as belonging to a specific class as a whole. One example of classification tasks that use 3D point clouds in remote sensing is the classification of tree species or roof types previously segmented. However, remote sensing classification tasks involve semantic classification and segmentation rather than aiming to identify an entire scene or object to a single class. A significant example of semantic classification is Land use/Land cover classification of Terrestrial and Arial Laser scanned (TLS/ALS) data. Segmentation divides and assigns the data into different target classes and is split into three types, semantic, instance, and panoptic segmentation [19]. Semantic segmentation assigns every point/pixel from the input data to one of the target classes without distinguishing different objects; for example, all tree points will be labelled trees. Instance segmentation involves identifying and labelling objects belonging to target classes while distinguishing them from each other, such as tree1, tree2, etc. Panoptic segmentation classifies every point/pixel in the input as part of a class while distinguishing separate objects of a class from each other [19].

The most common application of image fusion in LiDAR remote sensing is the fusion of 3D point clouds and RGB images to train a deep learning model for classification and segmentation tasks [20–22]. The features extracted from both types of data are used to enhance the performance of each class in the application of each class. Registration is the process of matching and aligning two or more images or point clouds in the case of LiDAR data obtained from different viewpoints and/or using different sensors; one example is illustrated in [23], which achieves state-of-the-art performance. Completion is the process of filling in missing information from datasets that could result from the limitations of the sensors, conditions at the time of data capture, or the method of capture. For far-away distances, the spatial resolution of a LiDAR sensor is lower, sometimes resulting in finer details, such as road markings, signs, poles, etc., showing up incomplete. One example of completion can be found in [5]. Most completion tasks on LiDAR point clouds are done before training a classification model to improve performance and robustness.

#### **4. Deep Learning Models**

Advances have been made to produce DL models that are lightweight and efficient. Feature learning models on 3D point clouds can be categorized as projection-based and point-based models. This section briefly discusses models used as backbones or improved for newer networks.

#### *4.1. Projection-Based Methods*

Some projection-based models create 2D projections from 3D point clouds and use traditional 2D feature learning. This process primarily depends on projection direction (X, Y or Z—default: Z) and other aspects such as the grid (size, scale, shape). Other projection models create volumetric grids or voxels through 3D feature extraction layers.

• 2D Convolutional Neural Networks

U-Net [24]: builds on a fully convolutional model and extends it to work with few training data while providing better performance. The U-Net architecture consists of repeated two unpadded 3 × 3 convolutions followed by ReLU and downsampling 2 × 2 max pooling with stride 2. For each convolution step, the number of feature channels is doubled. In the deconvolution steps, the features are upsampled and followed by a 2 × 2 convolution that halves the number of channels. The resulting feature map goes through cropping and two 3 × 3 convolutions followed by a ReLU. The cropping is necessary because of the border pixels lost after every convolution. Finally, a 1 × 1 convolution is applied to label pixels and generate segmentation results.

DeepLab [25]: employs atrous convolution [25,26] to change the scope of convolution and extract global features while also allowing larger networks without extra parameters. DeepLab proposes Atrous Spatial Pyramid Pooling (ASPP) to segment at different scales

by applying the same filters at different sampling rates and field-of-views, then the outputs are added together. To overcome the toll downsampling and max pooling operations in deep convolutional neural networks (DCNNs), DeepLab implements the fully connected Conditional Random Field (CRF) from [27], which is trained separately from the rest of the network. Iterations DeepLabV3 [28] and DeepLabV3+ [29] improve the performance of DeepLab. Unlike [25], DeepLabV3 [28] performs batch normalization within ASPP. Additionally, global average pooling is applied to the last feature map. The resulting imagelevel features are fed into a 1 × 1 convolution with 256 filters, then multiplied to the desired spatial dimension. DeepLabV3 abandons the CRF and replaces it with concatenating and aggregating the resulting features and passing them through another 1 × 1 convolution with 256 filters before computing the final logits. DeepLabV3+ [29] uses a decoder module to refine segmentation results, especially around object boundaries. Depth-wise separable convolutions are applied to ASPP pooling and decoder modules resulting in a faster and more robust network.

VGGNet [30] evaluates the effect of increasing the network depth of a convolutional network using very small 3 × 3 convolution filters. It improves the classification performance compared to previous state-of-the-art models by pushing the depth to 16–19 weight layers. ResNet [31] adopts residual learning to every stacked layer in the convolutional network. The shortcut connections are added without increasing parameter or computation complexity. The residual learning allows deep networks with performance gain over shallower networks.

• Multiview representation

MVCNN [32] tackles 3D feature learning using traditional image-focused networks by making 2D renders of the 3D object from different angles and passing it through a standard CNN. MVCNN generates 80 views of the 3D object by placing 20 virtual "cameras" pointed at the object's centroid, then generates 4 renders per camera at 0-, 90-, 180-, and 270-degree rotation along the axis through the camera and object center. After each image is passed through the first CNN, the outputs are aggregated at a view-pooling layer which performs element-wise maximum operation across the different input views before passing through the remaining section of the network, i.e., the second CNN.

• Volumetric grid representation

VoxNet [33] uses occupancy grids to efficiently estimate occupied, free, and unknown space provided by ranging measurements. Small (32 × 32 × 32 voxels) dense voxels are used to optimize GPU usage. VoxNet uses a more basic 3D CNN to extract and learn features, consisting of 5 of two convolution layers, a convolution and pooling layer, and two fully connected layers. The model can perform object classification in real-time while achieving state-of-the-art performance. VoxelNet [34] introduces a multi-layer voxel feature encoding (VFE) that enables inter-point interaction within a voxel. The point cloud is divided into equally spaced voxels encoded using the stacked VFE layers, allowing complex local 3D information learning. VoxelNet works on object detection using a Region Proposal Network (RPN) at the final stage to create bounding boxes.

#### *4.2. Point-Based Methods*

Point-based methods consume unstructured and unordered point clouds. Some of the models covered in this section are used as backbones or parts of a larger architecture, while others are adapted for remote sensing tasks with minimal modifications.

• PointNets

PointNet [11] directly consumes point cloud data for feature extraction. The network provides a unified approach to 3D recognition that can be applied for various tasks such as object classification, instance segmentation, and semantic segmentation. PointNet uses Multi-Layer Perceptrons (MLPs) combined with a joint alignment network. To hold invariance under geometric transformations, the input is passed through a T-Net module [11], where it is multiplied by an affine transformation matrix. PointNet provides great performance while remaining lightweight and computationally efficient. PointNet cannot produce local features of neighbouring points; PointNet++ [13] introduces a class pyramid feature aggregation scheme. The scheme comprises three stacked layers: the sampling layer, the grouping layer, and the PointNet layer. This allows PointNet++ to extract features in a hierarchical fashion similar to traditional image learning, reducing local information loss. PointASNL [35] is an end-to-end network that effectively deals with noisy point clouds. The two primary components of the model are the adaptive sampling (AS) and the localnonlocal (L-NL) modules. Initially, the AS module reweighs neighbour points surrounding the initial sampled points from the farthest point sampling and then adaptively adjusts the sampled points beyond the point cloud. The L-NL module captures the neighbour and long-range dependencies of the sampled point. Self-Organizing Network (SO-Net) [36] generates a Self-Organizing Map (SOM) to simulate point cloud spatial distribution. The SOM retrieves hierarchical features from individual points and SOM nodes. A Point-tonode search is performed on the output of the SOM for each point. Each point is normalized, and features are learned through a series of fully connected layers. Node feature extraction is done through channel-wise max-pooling the point features. Final learned features are extracted using a batch of fully connected layers referred to as a small PointNet.

### • (Graph) Convolutional Point Networks

ConvPoint [37] proposes continuous convolution kernels to allow arbitrary point cloud sizes. Points {q} are selected iteratively from the input point cloud {p} until the target number of points is reached through a score-based process. Using a kd-tree built on the input point cloud, K-nearest neighbour search from {p} is performed on points in {q}. A convolution operation is performed for each subset, generating the output features. Operations detailed by ConvPoint are successfully adapted for classification, part segmentation, and semantic segmentation tasks. ConvPoint can produce significant performance with time- and cost-efficient. Dynamic Graph CNN (DGCNN) [38] generates local neighbourhood graphs and applies convolution on the edges connecting neighbour point pairs. Unlike traditional graph CNNs, DGCNN uses a dynamic graph where the set of k-nearest neighbours for a point change between layers in the network and is calculated from the sequence of embeddings. The EdgeConv block introduced by DGCNN computes edge features for each input point and applies an MLP followed by channel-wise symmetric aggregation. Taylor Gaussian mixture model (GMM) network (TGNet) [39] is composed of units named TGConv that perform convolution operations parametrized by a family of filters on irregular point sets. The filters are products of geometric features expressed by Gaussian weighted Taylor kernels and local point features extracted from local coordinates. TGConv features are aggregated using parametric pooling to generate feature vectors for each point. TGNet uses a CRF at the output layer to improve segmentation results.

#### **5. Benchmark Datasets**

Advancements in Deep learning on point clouds have attracted more and more attention, especially in the last few years. Several publicly available datasets were also released, which helped further support research on DL development. An increasing number of methods have been introduced to deal with various challenges related to point cloud processing, including 3D shape classification, 3D object detection and tracking, 3D point cloud segmentation, 3D point cloud registration, 6-DOF pose estimation, and 3D reconstruction [18]. Table 1 briefly overviews some of the most commonly used publicly available point cloud datasets. Outdoor datasets are classified based on acquisition technique, Aerial, Mobile, or Terrestrial Laser scanned data or ALS, MLS, and TLS, respectively. The remaining datasets in this paper are indoor laser-scanned datasets and datasets of object scans. While ModelNet40 and S3DIS are not LiDAR scanned datasets, they are included as we found that they are the most commonly tested datasets for their respective tasks in remote sensing classifications. ModelNet40 dataset consists of CAD files; most point cloud network testing uses a point cloud sampled from the 3D object files. The models that used the ModelNet40

dataset outlined later in the paper are tested on the dataset by sampling the objects into a point cloud and then applying the model. Similarly, S3DIS, while not LiDAR data, is a point cloud and the models tested on it are suitable for point clouds obtained from LiDAR scans.


**Table 1.** Benchmark datasets for training and testing deep learning on 3D point clouds.

#### **6. Performance Metrics**

Various evaluation metrics have been used for segmentation, detection, and classification. The summary of the evaluation metrics [53] is shown in Table 2. Metrics for segmentation, detection, and classification are the intersection over union (IoU), mean IoU, and overall accuracy (OA) [53]. Detection and classification results are mainly analyzed using precision, recall and F1-score, which takes the true positives (TP), false positives (FP), and false negatives (FN) for calculation.


**Table 2.** Performance Evaluation Metrics.

#### **7. Comparative Analysis**

The datasets ModelNet40, S3DIS, and Toronto3D provide an overview of benchmarks used for different classification tasks: object classification, indoor scene classification, and urban outdoor classification. Table 3 shows the performance comparison for the current 3D object classification, indoor scene segmentation, and outdoor urban semantic segmentation models using various evaluation metrics. The best-performing configuration for each model was selected. For example, using a higher sampled point cloud in ModelNet40 tests can produce better performance. Therefore, if the authors tested the models using different point counts, the best set of results is used. The results outlined in the table are obtained from the testing by each model's respective author(s) except for the ConvPoint results on Toronto3D, which we tested for this paper. From Table 3, we can see that DGCNN and ConvPoint achieve the best performance on most datasets while being lightweight relative to models with similar performance. Additionally, these two models have been tested on multiple different tasks and different types of datasets. The major limitation of ConvPoint is that the convolutional layer introduced is a scale agnostic, i.e., the object's size is important for scans and provides valuable information. DGCNN could be further improved by adjusting the implementation details to improve the computational efficiency of the model.

Most remote sensing papers use one of the previously outlined computer vision models. The model is deployed directly for the application dataset or modified and attached to post and/or preprocess pipelines. To further test the performance of the ConvPoint model in this paper, we have also experimentally trained ConvPoint on Toronto3D using labels such as L001, L003, and L004 and used L002 for testing. The training was run using batch size 8, block size 8, and #of points 8192 for 100 Epochs. The testing results are marked with a (\*) in Table 4. Table 4 includes some applications categorized according to their dataset, performance, and remote sensing deployment. We can conclude that both DGCNN and ConvPoint have shown promising results across the different applications in remote sensing.


#### **Table 3.** Comparative Analysis of Deployed Models.

**Table 4.** Overview of some deep learning contributions focused on remote sensing data.



#### **Table 4.** *Cont.*

<sup>1</sup> f1-Score is denoted by \*, mIOU is denoted by ˆ and OA is denoted by ~.

#### **8. Conclusions and Future Directions**

Recent work on the advances of deep learning on LiDAR 3D point cloud processing was analyzed and summarized. An overview of the different model types and the stateof-the-art and/or fundamental models of each type was provided. Additionally, the performance of the models was provided on datasets for different classification tasks. The strongest performing models were trending towards 3D Graph CNNs and 3D CNNs [69,70] that work directly on the raw point cloud data. These models can provide state-of-the-art performance and remain computationally lightweight. Finally, different applications of remote sensing that deploy deep learning models were overviewed. One major challenge when comparing the remote sensing models was the lack of standardized test datasets and the frequent use of proprietary datasets. Notable test datasets available are Toronto3D, Paris-Lille 3D, ISPRS 3D, and S3dIS. Future Directions would involve expanding the application of the state-of-the-art methods in autonomous driving [71,72].

**Author Contributions:** Conceptualization, A.D., R.K. and A.S.; methodology, A.D., R.K. and A.S.; software, A.D., R.K. and A.S.; validation, A.D., R.K. and A.S.; formal analysis, A.D., R.K. and A.S.; investigation, A.D., R.K. and A.S.; resources, A.D., R.K. and A.S.; data curation, A.D., R.K. and A.S.; writing—original draft preparation, A.D., R.K. and A.S.; writing—review and editing, A.D., R.K. and A.S.; visualization, A.D., R.K. and A.S.; supervision, R.K. and A.S.; project administration, R.K. and A.S.; funding acquisition, R.K. and A.S.; All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the Natural Sciences and Engineering Research Council of Canada (NSERC), grant number [RGPIN-2020-05857], and Smart Campus Integrated Platform Development Alliance project with FuseForward. The APC was funded by Toronto Metropolitan University.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC), [funding reference number RGPIN-2020-05857], and Smart Campus Integrated Platform Development Alliance project with FuseForward.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


MDPI St. Alban-Anlage 66 4052 Basel Switzerland www.mdpi.com

*Sensors* Editorial Office E-mail: sensors@mdpi.com www.mdpi.com/journal/sensors

Disclaimer/Publisher's Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Academic Open Access Publishing

www.mdpi.com ISBN 978-3-0365-8429-4