**Statistical Machine Learning for Human Behaviour Analysis**

Special Issue Editors

**Thomas Moeslund Sergio Escalera Gholamreza Anbarjafari Kamal Nasrollahi Jun Wan**

MDPI • Basel • Beijing • Wuhan • Barcelona • Belgrade • Manchester • Tokyo • Cluj • Tianjin


*Editorial Office* MDPI St. Alban-Anlage 66 4052 Basel, Switzerland

This is a reprint of articles from the Special Issue published online in the open access journal *Entropy* (ISSN 1099-4300) (available at: https://www.mdpi.com/journal/entropy/special issues/Statistical Machine Learning).

For citation purposes, cite each article independently as indicated on the article page online and as indicated below:

LastName, A.A.; LastName, B.B.; LastName, C.C. Article Title. *Journal Name* **Year**, *Article Number*, Page Range.

**ISBN 978-3-03936-228-8 (Pbk) ISBN 978-3-03936-229-5 (PDF)**

c 2020 by the authors. Articles in this book are Open Access and distributed under the Creative Commons Attribution (CC BY) license, which allows users to download, copy and build upon published articles, as long as the author and publisher are properly credited, which ensures maximum dissemination and a wider impact of our publications.

The book as a whole is distributed by MDPI under the terms and conditions of the Creative Commons license CC BY-NC-ND.

## **Contents**



## **About the Special Issue Editors**

**Thomas B. Moeslund** received his PhD from Aalborg University in 2003 and is currently Head of the Visual Analysis of People lab at Aalborg University (www.vap.aau.dk). His research covers all aspects of software systems for automatic analysis of people. He has been involved in 14 national and international research projects, both as coordinator, WP leader and researcher. He has published more than 300 peer reviewed journal and conference papers. His awards include the Most Cited Paper in 2009, Best IEEE Paper in 2010, Teacher of the Year in 2010, and the Most Suitable for Commercial Application award in 2012. He serves as Associate Editor and editorial board member for four international journals. He has co-edited two Special Issues and acted as PC member/reviewer for numerous conferences. Professor Moeslund has co-chaired the following eight international conferences/workshops/tutorials: ARTEMIS'12 (ECCV'12), AMDO'12, Looking at People'12 (CVPR12), Looking at People'11 (ICCV'11), Artemis'11 (ICCV'11), Artemis'10 (MM'10), THEMIS'08 (ICCV'09), and THEMIS'08 (BMVC'08).

**Sergio Escalera** obtained his PhD degree on multiclass visual categorization systems for his work at Computer Vision Center, UAB. He obtained the 2008 Best Thesis award for Computer Science at Universitat Autonoma de Barcelona. He is ICREA Academia. He leads the Human ` Pose Recovery and Behavior Analysis Group at UB, CVC, and the Barcelona Graduate School of Mathematics. He is Full Professor at the Department of Mathematics and Informatics, Universitat de Barcelona. He is also a member of the Computer Vision Center at UAB. He is Series Editor of The Springer Series on Challenges in Machine Learning. He is Vice-President of ChaLearn Challenges in Machine Learning, leading ChaLearn Looking at People events. He is co-creator of Codalab open source platform for organization of challenges. He is also a member of the European Laboratory for Learning and Intelligent Systems ELLIS, the AERFAI Spanish Association on Pattern Recognition, ACIA Catalan Association of Artificial Intelligence, INNS, and Chair of IAPR TC-12: Multimedia and Visual Information Systems. He holds numerous patents and registered models. He has published more than 300 research papers and participated in the organization of scientific events. His research interests include automatic analysis of humans from visual and multimodal data, with special interest in inclusive, transparent, and fair affective computing and characterization of people: personality and psychological profile computing.

**Gholamreza Anbarjafari** (Shahab) is Head of the intelligent computer vision (iCV) lab at the Institute of Technology at the University of Tartu. He was also Deputy Scientific Coordinator of the European Network on Integrating Vision and Language (iV&L Net) ICT COST Action IC1307. He is Associate Editor and Guest Lead Editor of numerous journals, Special Issues, and book projects. He is an IEEE Senior Member and Chair of Signal Processing/Circuits and Systems/Solid-State Circuits Joint Societies Chapter of IEEE Estonian section. He has the recipient of the Estonian Research Council Grant and has been involved in many international industrial projects. He is an expert in computer vision, machine learning, human–robot interaction, graphical models, and artificial intelligence. He has supervised 17 MSc students and 7 PhD students. He has published over 130 scientific works. He has been in the organizing and technical committees of the IEEE Signal Processing and Communications Applications Conference in 2013, 2014, and 2016 and TCP of conferences such as ICOSST, ICGIP, SampTA, and SIU. He has been organizing challenges and workshops in FG17, CVPR17, ICCV17, ECML19, and FG20.

**Kamal Nasrollahi** is Head of Machine Learning at Milestone Systems A/S and Professor of Computer Vision and Machine Learning at Visual Analysis of People (VAP) Laboratory at Aalborg University in Denmark. He has been involved in several national and international research projects. He obtained his MSc and PhD degrees from Amirkabir University of Technology and Aalborg University, in 2007 and 2010, respectively. His main research interest is on facial analysis systems, for which he has published more than 100 peer-reviewed papers on different aspects of such systems in several international conferences and journals. He has won three best conference paper awards.

**Jun Wan** (http://www.cbsr.ia.ac.cn/users/jwan/research.html) received his BS degree from the China University of Geosciences, Beijing, China, in 2008, and PhD degree from the Institute of Information Science, Beijing Jiaotong University, Beijing, China, in 2015. Since January 2015, he has been worked at the National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences (CASIA). He received the 2012 ChaLearn One-Shot-Learning Gesture Challenge Award, sponsored by Microsoft, ICPR 2012. He also received the 2013, 2014 Best Paper Awards from the Institute of Information Science, Beijing Jiaotong University. His main research interests include computer vision, machine learning, especially for gesture and action recognition, facial attribution analysis (i.e., age estimation, facial expression, gender and race classification). He has published papers in top journals as the first author or corresponding author, such as JMLR, TPAMI, TIP, TCYB and TOMM. He has served as the reviewer on several top journals and conferences, such as JMLR, TPAMI, TIP, TMM, TSMC, PR, CVPR, ICCV, ECCV, ICRA, ICME, ICPR, FG.

## *Editorial* **Statistical Machine Learning for Human Behaviour Analysis**

#### **Thomas B. Moeslund 1, Sergio Escalera 2,3, Gholamreza Anbarjafari 4,5,6, Kamal Nasrollahi 1,7,\* and Jun Wan <sup>8</sup>**


Received: 22 April 2020; Accepted: 6 May 2020; Published: 7 May 2020

**Keywords:** action recognition; emotion recognition; privacy-aware

Human behaviour analysis has introduced several challenges in various fields, such as applied information theory, affective computing, robotics, biometrics and pattern recognition. This Special Issue focused on novel vision-based approaches, mainly related to computer vision and machine learning, for the automatic analysis of human behaviour. We solicited submissions on the following topics: information theory-based pattern classification, biometric recognition, multimodal human analysis, low resolution human activity analysis, face analysis, abnormal behaviour analysis, unsupervised human analysis scenarios, 3D/4D human pose and shape estimation, human analysis in virtual/augmented reality, affective computing, social signal processing, personality computing, activity recognition, human tracking in the wild, and application of information-theoretic concepts for human behaviour analysis. In the end, 15 papers were accepted for this special issue [1–15]. These papers, that are reviewed in this editorial, analyse human behaviour from the aforementioned perspectives, defining in most of the cases the state of the art in their corresponding field.

Most of the included papers are application-based systems, while [15] focuses on the understanding and interpretation of a classification model, which is an important factor for the classifier's credibility. Given a set of categorical data, [15] utilizes multi-objective optimization algorithms, like ENORA and NSGA-II, to produce rule-based classification models that are easy to interpret. Performance of the classifier and its number of rules are optimized during the learning, where the first one is obviously expected to be maximized while the second one is expected to be minimized. Testing on public databases, using 10-fold cross-validation, shows the superiority of the proposed method against classifiers that are generated using other previously published methods like PART, JRip, OneR and ZeroR.

Two published papers ([1,9]) have privacy as their main concern, while they develop their respective systems for biometrics recognition and action recognition. Reference [1] has considered a privacy-aware biometrics system. The idea is that the identity of the users should not be readily revealed from their biometrics, like facial images. Therefore, they have collected a database of foot and hand traits of users while opening a door to grant or deny access, while [9] develops a privacy-aware method for action recognition using recurrent neural networks. The system accumulates reflections of

light pulses omitted by a laser, using a single-pixel hybrid photodetector. This includes information about the distance of the objects to the capturing device and their shapes.

Multimodality (RGB-depth) is covered in [14] for sign language recognition; while in [11], multiple domains (spatial and frequency) are used for saliency detection. Reference [14] has applied restricted Boltzmann machine (RBM)s to develop a system for sign language recognition from a given single image, in two modalities of RGB and depth. Two RBMs are designed to process the images coming from the two deployed modalities, while a third RBM fuses the results of the first two RBMs. The inputs to the first two RBMs are hand images that are detected by a convolutional neural network (CNN). The experimental results reported in [14] on two public databases show the state-of-the-art performance of the proposed system. Reference [11] proposes a multi-domain (spatial and frequency)-based system for salient object detection in foggy images. The frequency domain saliency map is extracted using the amplitude spectrum, while the spatial domain saliency map is calculated using the contrast of the local and global super-pixels. These different domain maps are fused using a discrete stationary wavelet transform (DSWT) and are then refined using an encoder-decoder model to pronounce the salient objects. Experimental results on public databases and comparison with state-of-the-art similar methods show the better performance of this system.

Four papers in this special issue have covered action recognition [6,9,12,13]. Reference [12] has proposed a system for toe-off detection using a regular camera. The system extracts the differences between consecutive frames to build silhouettes difference maps, that are then fed into a CNN for feature extraction and classification. Different types of maps are developed and tested in this paper. The experimental results reported in [12] on public databases show state-of-the-art performance. Reference [6] proposes a system for individuals and then crowd condition monitoring and prediction. Individuals participating in this study are grouped into crowds based on their physical locations extracted using GPS on their smartphones. Then, an enhanced context-aware framework using an algorithm for feature selection is used to extract statistical-based time-frequency domain features. Reference [13] focuses on utilizing recurring concepts using adaptive random forests to develop a system that can cope with drastically changing behaviours in dynamic environments, like financial markets. The proposed system is an ensemble-based classifier comprised of trees that are either active or inactive. The inactive ones keep a history of market operators' reactions in previously recorded similar situations, while either an inactive tree or a background tree that has recently been trained replaces the active ones, as a reaction to drift.

In terms of face analysis, in [10] a system is proposed for detecting fuzziness tendencies and utilizing these to design human-machine interfaces. This is motivated by the fact that humans tend to pay more attention to sections of information with fuzziness, which are sections with greater mental entropy. The work of [4] proposes a conditional random field-based system for segmentation of facial images into six facial parts. These are then converted into probability maps, which are used as feature maps for a random decision forest that estimates head-pose, age, and gender.

The method introduced in [3] uses singular value decomposition for removing background of fingerprint images. Then, it finds fingerprints' boundaries and applies an adaptive algorithm based on wavelets extrema and Henry system to detect singular points, which are widely used in applications related to fingerprint, like registration, orientation detection, fingerprint classification, and identification systems.

Three papers have covered emotion recognition, one from body movements [5], and two from speech signals [2,7]. In [2] a committee of classifiers has been applied to a pool of descriptors extracting features from speech signals. Then, it is used as a voting scheme on the classifiers' outputs to get to a conclusion about the emotional status from the used speech signals. The paper in [2] shows that the committee of classifiers outperforms the single individual classifiers in the committee. The system proposed in [7] builds 3D tensors of spectrogram frames that are obtained by extracting 88-dimentional feature vectors from speech signals. These tensors are then used for building a 3D convolutional neural network that is employed for emotion recognition. The system has produced state-of-the-art results on three public databases. The emotional recognition system of [5] does not use facial images or speech signals, but body movements, which are captured by Microsoft Kinect v2 under eight different emotional states. The affective movements are represented by extracting and tracking location and orientation of body joints over time. Experimental results, using different deep learning-based methods, show the state-of-the-art performance of this system.

Finally, two databases have been introduced in this special issue, one for biometric recognition [1] and one for detecting sleeping issues and fatigue [8], the later containing a database of patients suffering from Fibromyalgia, which is a situation resulting in muscle pain and tenderness, accompanied by few other signs including sleep, memory, and mood disorders. It uses similarity functions with configurable convexity or concavity to build a classifier on this collected database in order to predict extreme cases of sleeping issues and fatigue.

**Acknowledgments:** We express our thanks to the authors of the above contributions and to the journal Entropy and MDPI for their support during this work. Kamal Nasrollahi's contribution to this work is partially supported by the EU H2020-funded SafeCare project, grant agreement no. 787002. This work is partially supported by ICREA under the ICREA Academia programme.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Privacy-Constrained Biometric System for Non-Cooperative Users**

#### **Mohammad N. S. Jahromi 1,\*, Pau Buch-Cardona 2, Egils Avots 3, Kamal Nasrollahi 1, Sergio Escalera 2,4, Thomas B. Moeslund <sup>1</sup> and Gholamreza Anbarjafari 3,5**


Received: 21 September 2019; Accepted: 23 October 2019; Published: 24 October 2019

**Abstract:** With the consolidation of the new data protection regulation paradigm for each individual within the European Union (EU), major biometric technologies are now confronted with many concerns related to user privacy in biometric deployments. When individual biometrics are disclosed, the sensitive information about his/her personal data such as financial or health are at high risk of being misused or compromised. This issue can be escalated considerably over scenarios of non-cooperative users, such as elderly people residing in care homes, with their inability to interact conveniently and securely with the biometric system. The primary goal of this study is to design a novel database to investigate the problem of automatic people recognition under privacy constraints. To do so, the collected data-set contains the subject's hand and foot traits and excludes the face biometrics of individuals in order to protect their privacy. We carried out extensive simulations using different baseline methods, including deep learning. Simulation results show that, with the spatial features extracted from the subject sequence in both individual hand or foot videos, state-of-the-art deep models provide promising recognition performance.

**Keywords:** biometric recognition; multimodal-based human identification; privacy; deep learning

#### **1. Introduction**

Biometric recognition is the science of identification of individuals based on their biological and behavioral traits [1,2]. In the design of a biometrics-based recognition or authentication system, different issues, heavily related to the specific application, must be taken into account. According to the literature, ideally biometrics should be universal, unique, permanent, collectable, and acceptable. In addition, besides the choice of the biometrics to employ, many other issues must be considered in the design stage. The system accuracy, the computational speed, and cost are important design parameters, especially for those systems intended for large populations [3]. Recently, biometric recognition systems have posed new challenges related to personal data protection (e.g., GDPR), which is not often considered by conventional recognition methods [4]. If biometric data are captured or stolen, they may be replicated and misused. In addition, the use of biometrics data may reveal sensitive information about a person's personality and health, which can be stored, processed, and distributed without the user's consent [5]. In fact, GDPR has a distinct category of personal data protection that defines

'biometric data', its privacy, and legal grounds of its processing. According to GDPR, what qualifies as 'Biometric data' is defined as 'personal data resulting from specific technical processing relating to the physical, physiological or behavioural characteristics of a natural person, which allow or confirm the unique identification of that natural person such as facial images' [6]. Furthermore, GDPR attempts to address privacy matters by the preventing of processing any 'sensitive' data revealing information such as health or sexual orientation of individuals. In other words, processing of such sensitive data can be only allowed if it falls under ten exceptions laid down in GDPR [6]. Apart from this privacy concern, in some scenarios, designing and deploying a typical biometric system where any subject has to cooperate and interact with the mechanism may not be practical. In care homes with elderly patients, for example, interaction of the user with typical device-dependent hardware or following specific instruction during biometric scan (e.g., direct contact with a camera, placing a biometric into a specific position, etc.) [7,8]. In other words, the nature of such uncontrolled environments suggest the biometric designer to consider strictly natural and transparent systems that mitigate the user non-cooperativeness behavior, providing an enhanced performance.

This possibility was explored in our earlier work [9] by considering identification of persons when they grab the door handle, which is an unchanged routine, in opening a door and no further user training is required. In our previous work, we designed a bimodal dataset (hand's dorsal, hereafter refer to as hand, and face) by placing two cameras above the door handle and frame, respectively. This was done in order to capture the dorsal hand image of each user while opening the door for multiple times (10 times per user) in a nearly voluntary manner. In addition, face images of users approaching the physical door were collected as a complementary biometric feature. In [9], we concluded that facial images are not always clearly visible due to the nonoperative nature of the environment, but, when visible, it provides complementary features to hand-based identification.

In [9], however, the study disregards the privacy of the users previously mentioned here as all the methods employ the visible face of each subject in the recognition task, which is considered as sensitive information in the new data protection paradigm.

In this paper, we deal with the problem of automatic people recognition under privacy constraints. Due to this constraint, it is crucial to conduct a careful data-collection protocol that excludes any sensitive biometric information that may comprise user's privacy. For instance, to protect the users, acquiring facial or full-body gait information of candidates is not possible. Consequently, we have collected a new data-set containing only the hands and feet of each subject using both RGB and near/infrared cameras. We verified the usefulness of the designed setup for user privacy-constrained classification by performing extensive experiments with both conventional handcrafted methods as well as recent Deep Learning models.

The remainder of this paper is organized as follows: Section 2 discusses related work in the field. In Section 3, the database is presented. In Section 4, the dataset is evaluated with classical and deep learning strategies. Finally, conclusions are drawn in Section 5.

#### **2. Related Work**

This section reviews the existing methods on hand and the footprint recognition focusing mostly on the use of geometric spatial information. There are a few detailed studies that are reviewing different hand-based biometric recognition systems [10,11]. Visual specifications of hands constitute a paramount criterion for biometric-based identification of persons, owing to the associated respectively low computational requirements and mild memory usages [12]. In addition, they provide superior distinctive representations of persons, which lead to unparalleled recognition success rates. Furthermore, the related procedures can be well adapted into the existing biometric authentication systems, which make them favorable for the foregoing purpose [13–17]. These systems, depending on the type of the features they extract from the hand, can be categorized as follows:

• Group 1: in which the geometric features of the hand are used for the identification. Examples of such features include the length and the width of the hand palm. Conventional methods such as General Regression Neural Network (GRN) [18], graph theory [18], or later methods like sparse learning [19] are examples of this group.


#### *Footprint*

Contrary to many well-established biometric techniques used in the context of automatic human recognition, the human foot features are rarely used as a feature in those solutions. Although the uniqueness property of the human foot is extensively addressed in the forensic studies [29], its commercial solution is considered mostly complicated due to complexity of the data acquisition in the environment [30]. The very early attempt of employing a human foot as means of identification emerged in the forensic study carried out by Kennedy [29] in which he examines the uniqueness of barefoot impression. In [31], the first notion of utilizing the Euclidean distance between a pair of human feet was presented. In [32], the authors propose a static and dynamic footprint-based recognition based on a hidden Markov model. The latter implemented a footprint based biometric system, similar to a hand, which involves exploiting the following foot features:


Minutiae-based ballprint [30] in the foot as well as different distance techniques such as city-block, cosine, and correlation [37] are further examples of the features that are employed in this context. It is also important to mention that gait biometrics [38] are also a potential approach that studies the characteristic of human foot strike.

#### **3. Acquisition Setup**

In this paper, in order to have a realistic testing environment, an acquisition setup has been designed by employing a standard-size building door with three camera sensors, one mounted above its handle, and two installed at the frame side, respectively.

During data collection, it is important to capture each modal in a clear visible form so that all unique meaningful features can be extracted. In other words, each modal has to be collected by a proper sensor. In this work, for example, each subject approaches a door and grabs its handle to open it. Therefore, each subject's hand should be recorded by a sensor while placed on the door handle. Based on several conducted tests with different available sensors, we choose to employ a near infrared light (NIR) camera (AV MAKO G-223B NIR POE) equipped with a band pass filter to cut off visible light. In this way, for hands, good feature candidates such as veins can be properly extracted. In addition, to guarantee that the hand modals on the door handles are visible in the captured frames, a near infrared light source (SVL BRICK LIGHT S75-850) was also mounted on the door frame. To capture each foot modal, a regular RGB camera (GoPro Hero 3 Black) on the door frame is installed to capture the subject's foot as they approach the door. The third camera in this setup has been used to acquire the face modality of each corresponding subject although it is not used to perform automatic classification. They are collected to conduct alternative studies beyond the scope of this paper and hence excluded. The overall door model together with the installed cameras and the light source are shown in Figure 1.

**Figure 1.** Illustration of the main set-up and the sensors.

A total of 77 persons of mixed gender and varying ages from 20 to 55 years participated in the data collection procedure at *Aalborg University*. There exist three paths that each subject can take to approach the door setup. Each person is requested to approach the door from any desired path randomly. These paths can cover both linear and curvature trajectory, making the scenario natural. The participant then walks toward to the door, grabs its handle and then passes through the door. This procedure is repeated two times. During data acquisition, no further instructions were given to the participants. This is done to have the participants grab the door handle as they would naturally perform in any context. As a result, all data are captured in a totally natural scenario where a variety of realistic situations such as occlusion, different pose and partial foot may occur. Furthermore, the lighting condition is not controlled and the data has been collected during different times of day for two months. Figure 2 shows samples acquired by the different cameras.

(**e**) Female (1) (**f**) Male (1) (**g**) Female (2) (**h**) Male(2)

**Figure 2.** Sample of captured frames of both hand and foot modalities for four subjects.

Each video sequence of the subject's hand/foot is post processed to enhance the quality of captured frames and remove any camera distortion. This is performed by using the well-known chessboard camera calibration tool in the vision library of MATLAB (2019, MathWorks) [39].

Privacy disclaimer: While our proposal moves in the direction of privacy constrained scenarios, we are aware that still some soft biometrics features used in this work could be used in some situations by specific external observers that could be able to identify the user. Without loss of generality, we use privacy-constrained to refer to the scenario where sensitive user information is avoided, making the biometric identification harder in case data are leaked.

#### **4. Experimental Results and Discussion**

In this section, we first discuss the evaluation protocol of the experiments. Then, we briefly explain the methods used and finally the obtained results and discussions.

#### *4.1. Evaluation Protocol*

In order to carry out experiments using different methods, we divided the database into mutually disjoint subsets of training, validation, and testing. As there are two cycles of complete action per modality (i.e., each user approaches the door twice), each video sample is divided into two sequences per modality. Next, we use all first sequences from all subjects for both hands and feet to train while

utilizing the second sequence of the subjects for validation and testing, respectively. In this manner, we have 77 sample sequences per modality for training, 37 sequences for validation and 40 samples for the test. Each test sample is then associated with a label during simulations.

In this paper, the main focus of all the experiments is around general spatial appearance models. In other words, for all the simulations, the spatial features are extracted through the analysis of each independent frame (uncorrelated frames per same subject). For the evaluated deep learning model, we have further analyzed the contribution of the motion as an input modality. Finally, we have also performed late-fusion on both modalities for all of the experiments.

To summarize, the performed experiments are divided into the following three categories:


#### *4.2. Conventional Techniques Evaluation*

#### 4.2.1. Local Binary Patterns and Support Vector Machine

Even tough deep neural networks dominate state-of-the-art solutions in image processing, it is still worthwhile to further test conventional methods to create baseline results, in particular in scenarios where a limited amount of annotated data are available. Local binary patterns (LBP) [40–42] are one of the most powerful handcrafted texture descriptors. The core implementation and its variants are extensively used in facial image analysis, including tasks as diverse as face detection, face recognition and facial expression analysis. Benzaoui et al. [43] showed that classification tasks which use LBP for feature extraction can improve various statistical procedures, such as principal component analysis (PCA) and discrete wavelet transform (DWT). For example, by using a combination of DWT, LBP and support vector machine (SVM) [44–46] for classification, it is possible to create a hybrid method for face recognition. Similarly, the same approach can be used for hand and foot classification. The performance of an LBP based feature extractor can be greatly improved, making input data robust against certain image transformations. For example, in the case of face images, this relates to aligned and cropped faces. When considering recordings of human gait, the size and foot orientation is constantly changing, thus adding additional challenges to the description and classification problems. The DWT method is widely used in feature extraction, compression and denoising applications. The process of recognition using DWT is as follows: the wavelet transform of a particular level is applied on the test image and the output is an approximation coefficient matrix, which we consider as a sub-image. Then, we extract rotation invariant LBP feature vectors from the sub-images for SVM training and classification [47]. The regions of interest from the video sequences are extracted using the frame difference method across multiple frames. This approach was robust enough to successfully pre-process all the videos in the database. The system was developed in MATLAB environment, where we used inbuilt functions for single-level 2D wavelet decomposition (dwt2) approximation coefficients matrix, rotation invariant local binary patterns (extractLBPFeatures) with 10 × 10 cells [48] and the linear multi-class support vector machine (fitcecoc).

• *Setup:* To acquire the regions of interest (ROI) for the moving object in each frame (hand or foot), in this experiment, we applied simple and fast frame difference. If the difference is greater than 80 pixels for the foot and 30 pixels for the hand videos (these values were found empirically), then the resulted difference frame will be recorded as a binary mask. Figure 3a shows an example of all masks within one sequence, where the color transition from dark gray to white represents transition from the start of the video to the end of the video. For the foot sequences, the bounding box for a particular frame is created by taking that frame and then superimposing the previous 10 frames, and the next 10 frames that contain binary masks (identical to a sliding window). In other words, a binary image was formed by repeating the logical OR operation for 21 consecutive frames after which a bounding box has been found for the detected region as shown in Figure 3b. On the other hand, for the hand sequences, a fixed bounding box was created by using OR operation for all binary frames. After drawing the bounding box, the images were cropped and then resized to 200 × 200 pixels.

• *Experiments:* For the same subject, the total number of frames depend on the video length and therefore can not be fixed to a specific amount. The image features were extracted from an approximation coefficients matrix, which is one of the single-level two-dimensional wavelet decomposition method outputs. In particular, we used the Symlet wavelet to find results of the decomposition low-pass filter. Afterwards, the extracted rotation invariant local binary patterns from the output of wavelet decomposition were obtained as feature vectors to train a linear one-versus-one multi-class support vector machine. This process is shown in Figure 4. Finally, the fusion results obtained via majority voting, where a video label was determined by independent foot and hand frames. Results for single frame recognition and fusion can be found in Table 1. Note that, with a limited amount of data, taking into account that a random prediction classifier score in our problem of 77 labels is 1.3% accurate, which can be still considered as a reasonably good performance for the base line method.

(**a**) All frame masks within one sequence obtained by frame difference

(**b**) Region of interest created by several frame masks

**Figure 3.** Movement detection and bounding box extraction.

**Figure 4.** Feature extraction flow chart.


**Table 1.** Recognition rate in (%) of the DWT-LBP-SVM approach.

#### 4.2.2. Dictionary Learning

Sparse based signal processing is a well-established method in the field of Computer Vision. This success is mainly due to the fact that important classes of signals such as audio and images have naturally sparse representations with respect to fixed bases (i.e., Fourier, wavelet), or concatenations of such bases [49]. It has been applied to many Computer Vision tasks such as face recognition [50], image classification [51], denoising [52], etc. In particular, the robust face recognition via sparse representation (SRC) algorithm proposed in [49] uses sparse representation for face recognition. In this method, the basic idea is to form an over-complete dictionary by using the training faces and then classifying a new face by searching the sparsest vector in this dictionary. Hence, this technique is called dictionary learning. Unlike conventional methods such as Eigenface and Fisherface, the dictionary learning can achieve superior results without any explicit feature extraction [53]. This superiority makes the SRC method a convenient method to employ in recognition tasks.


(**a**) Original frame (**b**) Extracted Frame

**Figure 5.** Sample of the extracted frame using a Kalman tracker.


**Table 2.** Average recognition rate in (%) of the network for sparse representation classifier.

#### *4.3. Deep Learning*

Deep Neural Networks, and especially Convolutional Neural Networks (CNN), have gained a lot of attention due to their state-of-the-art classification performance in many Computer Vision tasks since the breakthrough of AlexNet architecture [55] in the 2012 ILSVRC (ImageNet Large-Scale Visual Recognition Challenge).

In the context of this work, we opted to process the video frames as individual RGB images from both hand and feet datasets given the limited amount of data. However, for completeness, we have also considered motion maps as network inputs since some motion features may be unique to each subject regardless of their clothing. To that end, we have extracted the Optical Flow (OF) values (u,v) of each pair of consecutive video frames for each subject. The resulting OF values can then be used to generate a heat map that may potentially describe the motion features. Figure 6 shows a sample heat map generated by the Optical Flow vector of consecutive video frames of both hand and foot modality per subject. The rest of the simulations are arranged as follows:

**Figure 6.** The heat map generated from an optical flow vector of consecutive video frames per each subject's modality (**a**) and the corresponding heat map (**b**).

(**b**)

• *Setup:* Since the dataset under study can be clearly linked to a classification problem, we have found it convenient to conduct our experiments on a standard ResNet-50 neural network architecture as shown in Figure 7. ResNet-50 has been proved to have a faster performance and lower computational cost compared to those of standard classification architectures such as VGG-16 due to its skip connection configuration [56]. For this purpose, we have constrained the input data (frames) to a [224 × 224] image size, batch size of 32 and output classes to 77 (number of eligible subjects) during the training phase. We left the number of input channels as a degree of freedom that will be set according to the different experiments we conducted.

**Figure 7.** ResNet-50 neural network architecture [56].

	- 1. *Appearance*—In this setting, the extracted frames, as before, are fed to the network for both the hand and foot datasets. Hence, input channel dimension is set to 3 due to the RGB nature of the frames. The recognition accuracy rate of this network is reported in Table 3.


**Table 3.** Average recognition rate in (%) of the ResNet network for the appearance model.

2. *Optical Flow (OF)*—OF values (u,v) are extracted for each consecutive pair of frames for both the hand and foot in the dataset. In this case, we set the input channel parameter to 2 due to the OF dimensionality. Table 4 summarizes the results of this setting.

**Table 4.** Average recognition rate in (%) of the ResNet network for the OF model.


3. *Appearance + OF*—We apply an early fusion to the extracted frames and OF calculated values for both hand and feet datasets. That is, a 5-channel input parameter is set in order to match the RGB(3) + OF(2) new dimension. The results of this simulation are tabulated in Table 5.


**Table 5.** Average recognition rate in (%) of the ResNet network for both the appearance and OF model.

4. *Appearance + Optical Flow (late fusion)*—We finally bring appearance analysis and OF analysis, computed separately, together. From each 'branch' output, we apply the same principle from the late fusion modality and study its performance. The result of this mode category can be seen in Table 6.

**Table 6.** Average recognition rate in (%) of the ResNet network for the late-fused of both the appearance and OF model.


As it can be seen from the results in all of the tables, one can observe that the conventional techniques for both modalities and the late fusion can not effectively utilize the spatial information of the modalities and hence they may not be good candidate methods to be used in the context of a real solution for several reasons: on one hand, handcraft features can not properly model fine-grain information present in the data. Those small details are indeed the ones that may identify properly the subject in this challenging scenario. On the other hand, even for handcraft methods, the limited amount of data per subject in this dataset reduces the generalization capability of handcraft strategies. Still, please note that a random prediction guesser in this scenario will achieve 1.3% accuracy. Thus, an accuracy over 50% and a better result of the combined hand–foot model shows that handcraft methods, up to some degree, are able to learn some discriminative features and their complementary nature.

In Deep Learning, however, we find the best setup classification results (84.4% accuracy) when analyzing the appearance per subject sequence modality—that is, when we use the whole sequence of frames per subject to determine the resulting class. This makes sense because some uncertain frame predictions do not normally contribute too much to the subject's final estimation. We could imagine these misclassification outcomes as noisy samples, which are mostly cancelled out when averaging multiple data. Only the hand model achieves better performance than the only foot one. It was somehow expected because of the more controlled recording of hands and the freedom of the subject in terms of walking, i.e., different walking paths, different point of view, and different scales because of the distance to the camera. Interestingly, the late fusion combination increases around four points the results of the hand, suggesting that complementary and discriminative features are captured by the deep approaches. Some visual misclassified examples are shown in Figure 8. It can be seen that the frame on the left was misclassified as belonging to the same subject on the right. Some explainability can be found by just visual inspection (resemblance between subjects appearance). On the other hand, we find the worst performance when analyzing OF per independent frame analysis modality (32.6% and 35.1% accuracy). We believe motion can produce complementary features to the appearance ones and benefit from its appearance invariant descriptor. However, in order to obtain an increase in performance because of the use of motion, additional data and further strategies to mitigate the overfitting effect (e.g., data augmentation) should be considered.

(**a**)

**Figure 8.** Appearance of misclassified examples. Left frames are being misclassified as belonging to the subject on the right. Misclassified hand modality of a subject (**a**). Misclassified foot modality of a subject (**b**).

#### **5. Conclusions**

In this paper, we presented a dataset containing hand and foot sequences for 77 subjects with the goal of performing automatic people recognition under privacy constraints. The dataset was collected using both RGB and near/infrared camera. We carried out extensive simulations using: (1) handcraft conventional techniques such as LBP, DTW, SRC, and SVM, and (2) deep learning. The results show that poor recognition performance is achieved when applying handcraft techniques, independently of the usage of hand or foot modality. On the other hand, the ResNet-50 deep model evaluated achieves a recognition rate of over 70% for feet and 80% for hands, further improved when fused, showing their complementary nature, and obtaining a final score of 84.4%. Interestingly, the inclusion of optical flow maps to enrich the appearance network channel did not show any improvement. This could have happened because of the limited amount of training data available per participant in the data set. All in all, spatial appearance deep learning showed a high generation performance to recognize users by the combination of hand and foot data.

**Author Contributions:** Formal analysis—K.N., S.E., T.B.M., and G.A.; Methodology—M.N.S.J., P.B.-C., and E.A.; Writing—original draft, M.N.S.J.; Writing—review and editing, P.B.-C., E.A., K.N., S.E., T.B.M., and G.A.

**Funding:** This work has been partially supported by the Spanish project TIN2016-74946-P (MINECO/FEDER, UE), CERCA Programme/Generalitat de Catalunya, and the Estonian Centre of Excellence in IT (EXCITE).

**Acknowledgments:** We gratefully acknowledge the support of NVIDIA with the donation of the GPU used for this research. This work is partially supported by ICREA under the ICREA Academia program.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


c 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Emotional Speech Recognition Based on the Committee of Classifiers**

#### **Dorota Kami ´nska**

Lodz University of Technology, Institute of Mechatronics and Information Systems, Stefanowskiego 18/22 Str. 90-924 Lodz, Poland; dorota.kaminska@p.lodz.pl

Received: 12 August 2019; Accepted: 19 September 2019; Published: 21 September 2019

**Abstract:** This article presents the novel method for emotion recognition from speech based on committee of classifiers. Different classification methods were juxtaposed in order to compare several alternative approaches for final voting. The research is conducted on three different types of Polish emotional speech: acted out with the same content, acted out with different content, and spontaneous. A pool of descriptors, commonly utilized for emotional speech recognition, expanded with sets of various perceptual coefficients, is used as input features. This research shows that presented approach improve the performance with respect to a single classifier.

**Keywords:** emotion recognition; speech; committee of classifiers

#### **1. Introduction**

During a conversation people are constantly sending and receiving different nonverbal clues, communicated through speech signal (paralanguage), body movements, facial expressions, and physiological changes. The discrepancy between the words spoken and the interpretation of their actual content relies on nonverbal communication. Emotions are a medium of information regarding feelings of an individual and one's expected feedback. The ability to recognize the attitude and thoughts from one's behaviour was the original system of communication prior to spoken language. Understanding the emotional state enhances interaction. Although computers are now a part of human life, the relation between human and machine is far from being natural [1]. Proper identification of emotional state can significantly improve quality of human-computer interfaces. It can be applied for monitoring of psycho-physiological states of individuals e.g., to assess the level of stress or fatigue, forensic data analysis [2], advertisement [3], social robotic [4], video conferencing [5], violence detection [6], animation or synthesis of life-like agents xue2018voice, and many others. Automatic emotion recognition methods utilize various input types i.e., facial expressions [7–9], speech [10–12], gesture and body language [13,14], physical signals such as electrocardiogram (ECG), electromyography (EMG), electrodermal activity, skin temperature, galvanic resistance, blood volume pulse (BVP), and respiration [15]. Facial expressions have been studied most extensively and about 95% of literature dedicated to this topic focuses on faces as a source, at the expense of other modalities [16]. Speech is one of the most accessible form the above mentioned signals, thus recently it is increasingly significant research direction in emotion recognition. Despite an enormous amount of research, the issue is still far from its satisfactory solution. Analysis of emotional content embedded in speech is an issue that presents multiple difficulties. The main problem is gathering and compiling a database of viable and relevant experimental material. Most available corpora comprise speech samples uttered by professional actors, which are not guaranteed to reflect the real environment with its background noise or overlapping voices. Additionally, individual features of the speaker such as gender, age, origin and social influence can greatly affect universal consistency in emotional speech. The first most important work published before 20th century studying emotions was *The Expression of the Emotions in Man and*

*Animals* by Charles Darwin [17]. Darwin made the first description of the paralanguage conveying emotional states of the speaker. Based on the study of people and different species of animals, he came to the conclusion that there is a direct connection between the modulation of speech signal and the internal state of the individual. He also observed that acoustic signals could trigger emotional reactions of the listener. The theoretical and practical approach suggests that specific paralinguistic cues such as loudness, rate, pitch, pitch contour and formant frequencies contribute to the emotive quality of an utterance. Emotions may cause changes in the way of breathing, phonation or articulation, which are reflected in the speech. For example, states like anger or fear are characterized by fast pace, high values of pitch, wide range of intonation, sudden acceleration of heart rate, increased blood pressure and, in some cases, dry mouth and muscle tremor. The opposite phenomena occur in case of sadness and boredom. Speech becomes slow and monotonous, pitch is reduced without any major changes in intonation. This is caused partially due to activation of the parasympathetic system, relief of cardiac rhythm, blood pressure drop and increased secretion of saliva. Consequently, paralinguistic cues relating to emotion have a huge effect on ultimate meaning of the message [18]. This paper refers to my previous research [19], where the novel method for emotional speech recognition based on committee of classifiers was presented. This method is based on a set of classifiers (nodes) whose individual predictions are combined to make the final decision. Current paper is an extension of the previous approach. I investigated three different type of Polish corpora: acted out, in which the actors repeat the same sentence while expressing different emotional states [20]; acted out, in which the actors repeat several different sentences while expressing different emotional states [2]; spontaneous speech samples collected from live shows and programs such as reality shows [21]. I combined different classification methods as nodes (k-NN, MLP, SL, SMO, Bagging, RC, j48, LMT, NBTree, RF) and juxtaposed several alternative approaches to final voting. This research shows that some of presented approaches improve the performance with respect to a single classifier. A pool of descriptors, commonly utilized for emotional speech recognition, expanded with sets of various perceptual coefficients, is used as input feature vectors. The following list summarises the contributions of this work:


The structure of the paper is as following. Next section presents a brief review of works related to speech emotion recognition (SER). Section 3 describes proposed research methodology: relevant corporas of emotional voice, speech signal descriptors and outline of adopted strategy for emotion recognition. Section 4 presents obtained results followed by their discussion. Finally, Section 5 gives the conclusion and future directions of this research.

#### **2. Related Works**

Since emotion recognition from speech signal is a pattern recognition problem, standard approach consisting of three processes: feature extraction, feature selection, and classification is used to solve the task. The main research issue is selection of an optimal feature set that efficiently characterizes the emotional content of the utterance. The number of acoustic parameters proven to contain emotional information is still increasing. Generally, the most commonly used features can be divided into three groups: prosodic features (e.g., fundamental frequency, energy, speed of speech) [22], quality characteristics (e.g., formants, brightness) [23] and spectrum characteristics (e.g., mel-frequency cepstral coefficients) [24,25]. The final features vector is based on their statistics such as mean, maximum, minimum, change rate, kurtosis, skewness, zero-crossing rate, variance etc., [26,27]. However, a vector of too many features may give rise to high dimension and redundancy, making the learning process complicated and increasing the likelihood of overfitting [28]. Therefore prior to classification, methods of balancing a numerous features vector, feature selection or extraction are studied to speed up the learning process and minimize the curse of dimensionality problem [29,30]. Emotion classification is generally performed using standard techniques such as SVM [31–33], various types of artificial neural networks (NN) [34–37], different types of the k-NN classifier [19,38] or using Hidden Markov Model (HMM) and its variations [39]. However, it is a complex task with many unresolved issues. Therefore, hybrids and multilevel classifiers [40,41] or ensemble models [42] have been widely used to enhance the performance of single classifiers. Classifying committees (Ensemble, Committee, Multiple Classifier Systems) are based on the principle of *divide and conquer*: they consist of a set of classifiers (nodes) whose individual predictions are combined. A necessary condition for this approach is that member classifiers should have a substantial level of disagreement, i.e., mistakes made by nodes should be independent, regardless of the others. The most commonly used and most intuitive technique consists of several models *C* (Figure 1a) working separately on the same or similar feature set, with their results merged on decision *D* level

**Figure 1.** (**a**) Combining the results via simple voting, weighted or highest confidence voting, or other methods. (**b**) Multilevel classification.

This kind of approach was used in [43], where the authors present a multiple classifier system for 5 emotional states (anger, happiness, sadness, boredom and neutral) and task is performed on Mandarin speech. They investigated several classifiers such as k-NN, weighted k-NN, Weighted Average Patterns of Categorical k-NN, Weighted Discrete k-NN and SVM. To combine results, majority voting, minimum misclassification and maximum accuracy methods were compared. The experimental results have shown that classifier combination schemes perform better than the single classifiers with the improvement ranging from 0.9–6.5%. The improvement of the automatic perception of vocal emotion using ensemble methods over traditional classification is shown in [44]. The authors compared two emotional speech data sources: natural, spontaneous emotional speech and acted or portrayed emotional speech to demonstrate the advantages and disadvantages of both. Basing on prosodic features (namely: fundamental frequency, energy, rhythm, and formant frequencies) two ensemble methods (stacked generalisation and unweighted vote) were applied. These techniques shown a modest improvement in prediction accuracy. In [45], the authors analysed the effectiveness of employing five ensemble models such as Bagging, Adaboost, Logitboost, Random Subspace and Random Committee, estimating emotional Arabic speech. The system recognizes happy, angry, and surprise emotion from natural speech samples. The highest improvement in accuracy in relation to the classical approach (19.09%) was obtained by the Boosting technique having the Naïve Bayes

Multinomial as the node. Multilevel approach (see Figure 1b) is predicated on splitting the classification process into several consecutive stages. For example in [46] the authors propose a hierarchical classification, which achieves greater accuracy of SER than corresponding classical methods. In the first stage of this algorithm, features vector is used to separate anger and neutral (group 1) from happiness and sadness (group 2). Finally, group 1 is classified into anger and neutrality, and group 2 into happiness and sadness. Similar approach is presented in [47]. First, the emotional states are categorized according to the dimensional model into positive or negative valence and high or low arousal using Gaussian Mixture Model and Support Vector Machines. Final decisions are made inside subsets with fewer categories using spectral representation. Studies were performed using the Berlin Emotional database [48] and the Surrey Audio-Visual Expressed Emotion corpus. In [49], the authors studied the effect of age and gender of the speaker on the effectiveness of emotion recognition system. They proposed a hierarchical classification model to investigate the importance of identifying those features before identifying the emotional label. They compared the performance of four different models and presented the relationship between the age gender and the emotion recognition accuracy. The results proved that using a separate emotion model for each gender and age category gives a higher accuracy compared with using one classifier for all the data. Similarly, in [50], gender is identified on the first level. Next, the dimensional reduction using PCA, LDA and mixed algorithm is performed according to particular gender-set. In [51], the authors underline a fuzzy nature of particular emotional states (e.g., sadness and boredom) and suggest that global classifier cannot obtain effective results. Thus, they proposed a hierarchical approach, which divides the set of utterances into *active* and *passive* on the first level, in order to classify them into emotional categories on the second one. The experiments were conducted on two different corpora: Berlin and DES [52] database. Obtained results outperform those obtained via single classifier.

#### **3. Methods**

#### *3.1. Database*

As mentioned in Section 1, for the purpose of this project three different types of Polish datasets were investigated. They will be briefly described below and summarised in Table 1.


**Table 1.** Main characteristics of databases investigated in this research.

#### 3.1.1. MERIP Database

MERIP emotional speech database is a subset of the Multimodal Emotion Recognition in Polish project [20]. The database consists of 560 samples recorded in the rehearsal room of *Teatr Nowy im. Kazimierza Dejmka w Łodzi*. Samples were collected from separate utterances of 16 professional actors/actresses (8 male and 8 female) aged from 25 to 64. The subjects were asked to utter a sentence *Kazdy z nas odczuwa emocje na swój sposób ˙* (English translation: *Each of us perceives emotions in a different manner*) while expressing different emotional states in the following order: neutral, sadness, surprise, fear, disgust, anger, and happiness (this set of discrete emotions was based on examination conducted by Ekman in [53]). All emotions were acted out 5 times, without any guidelines or prompts from the researchers. This allowed to gather 80 samples per each emotional state. Audio files were captured using dictaphone Roland R-26 in the form of wav audio files 44.1 kHz, 16 bit, stereo). The samples were evaluated by 12 subjects (6 male and 6 female) who were allowed to listen each sample only once and determine the emotional state. The average emotion recognition rate was 90% (ranging from 84% to 96% for different emotional state).

#### 3.1.2. Polish Emotional Speech Database

The Polish Polish Emotional Speech Database (PESD) [2] was prepared and shared by the Medical Electronics Division, Lodz University of Technology. The database consists of 240 samples recorded in the aula of the Polish National Film Television and Theater School in Lodz. Samples were collected from separate utterances of 8 professional actors/actresses (4 male and 4 female). Each speaker was asked to utter five different sentences (*They have bought a new car today*, *His girlfriend is coming here by plane*, *Johnny was today at the hairdresser's*, *This lamp is on the desk today* and *I stop to shave from today on*) with six types of emotional load: joy, boredom, fear, anger, sadness, and neutral (no emotion). Audio data was collected in the form of wav audio files (44.1 kHz, 16 bit). The samples were evaluated by 50 subjects through a procedure of classification of 60 randomly generated samples (10 samples per particular emotion). Listeners were asked to classify each utterance into emotional categories. The average emotion recognition rate was 72% (ranging from 60 to 84% for different subjects).

#### 3.1.3. Polish Spontaneous Speech Database

The spontaneous Polish Speech Database (PSSD) [21] consists of 748 samples containing emotional carrier of seven basic states, from the Plutchik's wheel of emotions [54]: joy, sadness, anger, fear, disgust, surprise, anticipation and neutral. Speech samples were collected from discussions in TV programs, live shows or reality shows and the proportion of speakers' gender and age was maintained. Each utterance was unique and varied from one-word articulations such as *Yes* or *No*, single words, phrases to short sentences. Occasionally additional sounds such as screaming, squealing, laughing or crying are featured in the corpora. The data was collected in the form of wav audio files of varied quality. The samples were evaluated by 15 male and female volunteers aged from 21 to 58. All listeners were presented random samples that consisted of at least half of each prequalified basic emotions recordings. The evaluators listened to audio samples one by one, each assessment was recorded in the database. Every sample could have been played any number of times before the final decision, but after the classification, it was not possible to return to the recording. Average emotion recognition was 82.66% (ranging from 63% to 93% for different subjects).

To juxtapose these three different databases for the purpose of this project, an equal number of emotional sets was selected, which means that utterances expressing surprise and anticipation were omitted. Additionally, in case of PSSD, the number of samples for emotions has been unified to 80.

#### *3.2. Extracted Features*

Representation of speech signal in time or frequency domain is too complex to analyze, thus usually high-level statistical features (HLS) are sought to determine its properties. In most cases a large number of HLS features are extracted at the utterance level, which is followed by dimension reduction techniques to obtain a robust representation of the problem. Feature extraction comprises of two different stages. First, a number of low level (LL) features are extracted from short frames. Next, HLS features such as mean, max, min, variance, std, are applied to each of the LLs over the whole utterance, and the results are concatenated into a final feature vector. The role of the HLS is to describe temporal variations and contours of the different LLs during particular speech chunk [55]. Most commonly used LLs, for the purpose of emotional speech recognition, can be divided into two groups: prosodies and spectrum characteristics, both of them described below.

#### 3.2.1. Prosodies

Speech prosodic features are associated with larger units such as syllables, words, phrases, and sentences, thus are considered as supra-segmental information. They represent the perceptual properties of speech, which are commonly used by humans to carry various information [56]. As it has been repeatedly emphasised in the literature, prosodic features such as energy, duration, intonation

(*F0* contour) and their derivatives are commonly used as important information sources for describing emotional states.

*F0*, which is the frequency of vocal folds, is inextricably linked with the scale of the human voice, accent and intonation, all of which have a considerable impact on the nature of speech. *F0* does change during utterances and rate of those changes is dependent on the speaker's intended intonation [22]. For the purpose of this research *F0* was extracted using autocorrelation technique. The analysis window was set to 20 ms with 50% overlap.

Another feature that provides information useful in distinguishing emotions is signal energy, which describes the volume or intensity of speech. For example, some emotional states, like joy or anger, have increased energy levels in comparison to other emotional states.

#### 3.2.2. Spectrum Characteristics

Nowadays, perceptual features are a standard in voice recognition. They are also used in emotional speech analysis. Perceptual approach is based on frequency conversion, corresponding to subjective reception of the human auditory system. For this purpose, the perceptual scales such as Mel or Bark are used. In this paper Mel Frequency Cepstral Coefficients *MFCC* [57], Human Factor Cepstral Coefficients *HFCC* [58], Bark Frequency Cepstral Coefficients *BFCC* [59], Perceptual Linear Prediction *PLP* [60] and Revised Perceptual Linear Prediction *RPLP* [59] coefficients are employed. Additionally, Linear Prediction Coefficients (LPC) [61] were taken into consideration, as they are the most frequently used features for speech recognition. Initially, for all particular perceptual features sets, the number of coefficients has been specified to 12. For all above mentioned LLs sets. HLS such as maximum, minimum, range, mean and standard deviation were determined for all LLs.

Another important feature type, describing properties of vocal tract, are formant frequencies, at which local maxima of the speech signal spectrum envelope occur. They can be utilized to determine the speaker's identity and the form and content of their utterance [62]. Usually 3 to 5 formants are applied in practice, thus this paper estimates 3 of them and on their basis HLS such as mean, median, standard deviation, maximum and minimum are determined, giving a total of 15 features.

#### 3.2.3. Features Selection

Initially, the number of extracted HLS features amounted to 407. Correlation-based Feature Selection (CFS) algorithm [63] has been applied on the whole set of features as well as on all subsets separately in order to remove redundancy and select descriptors most relevant for analysis.

This procedure resulted in a significant reduction of the feature vector dimension, after CFS the final vectors length was: 93 in case of MERIP, 88 for PESD and 91 for PSSD. Distribution of features before and after the selection process applied on a particular subset is presented in Figure 2. Selected features are presented in Appendix A, Tables A1 and A2.

**Figure 2.** Distribution of features count for particular sets before and after selection process for each database. BS—before selection, AS—after selection.

#### *3.3. Classification Model*

Proposed algorithm, presented in see Figure 3, starts with division of the HFL feature vector, describing speech samples, into separate sub-vectors of particular group of features (i.e., sub-vector with MFCC coefficients). Each sub-vector is subjected to the selection process, followed by classification using different models *M* (e.g., M1: k-NN, M2: MLP etc.). Subsequently, among the models operating on particular sub-vector, one model with the lowest error rate is selected for further analysis. The error rate is calculated according to Equation (1). Final voting is done among the highest scoring models for particular sub-vectors.

$$errr = 1 - Accuracy = 1 - \frac{(\#classified\\_correct)}{(\#classification\\_total)} = \frac{(\# classified\\_incorrect)}{(\#classification\\_total)} \tag{1}$$

**Figure 3.** Proposed algorithm for emotion recognition using committee of classifiers.

In the basic algorithm, the final decision is made using equal voting. This method does not require additional calculations, only votes of individual models, rendering this process simple and effective. A decision is made collectively, using the following equations:

$$r\_i = \sum\_{j=1}^{m} d\_{ji} \tag{2}$$

$$Z = \arg\max\_{i=1}^{l} [r\_i] \tag{3}$$

where: *m*—number of classifiers (models), *l*—number of different classes, *dji*—decision of *j* classifier for *i* class.

Unequal impact of particular descriptors on the recognition provides the basis for replacing equal with the weighted voting. For each model different weights *w*1, *w*2,... *wm* are determined, which allows to prioritize more precise models. In this case the Equation (2) is replaced by the following:

$$r\_i = \sum\_{k}^{j-1} d\_{ji} w\_j \tag{4}$$

This approach requires the assessment (or at least comparison) of all models. In this study weights were selected experimentally, based on the error rate of individual classifiers. Appropriate weight *wi* for individual model was calculated based on the error rate *erri* according to the following equations.

$$w\_i = 1 - \varepsilon r r\_i \tag{5}$$

$$w\_i = \frac{1}{err\_i} \tag{6}$$

$$w\_i = (\frac{1}{err\_i})^2\tag{7}$$

#### **4. Results and Discussion**

#### *4.1. Efficiency of Features Subsets*

The verification of efficiency of feature subsets is carried out using several types of classifiers such as k-NN, Multilayer Perceptron (MLP), Simple Logistic (SL), SMO, Bagging, Random Cometee (RC), j48, LMT, NBTree and Random Forest (RF) using Weka [64], with 10-fold cross-validation. This approach allows to evaluate the efficacy of particular features set and determine the most efficient ones. Tables 2–4 present the efficiency of above mentioned feature subsets obtained for three independent speech corpora. In the course of research, the parameters for each classifier were identified and selected to achieve the highest recognition results.

**Table 2.** Average recognition results [%] of features subsets for MERIP database.



**Table 3.** Average recognition results [%] of features subsets for PESD database.

**Table 4.** Average recognition results [%] of features subsets for PSSD database.


It is clearly visible that the best results are achieved for the subsets containing perceptual coefficients (MERIP: 59.33% using MFCC, PESD: 61.5% using BFCC, PSSD: 80.68% using MFCC). In each case, these results are obtained using a different classification algorithms: k-NN, MLP, RF, for MERIP, PESD and PSSD respectively. The lowest results are collected in case of F0, formants and energy and this is noticeable for all datasets.

Analyzing results retrieved from different models, in most cases, a significant recognition rate improvement when using the RF classifier can be observed. it is very evident especially for MERIP and PSSD corpora, where the best results were gathered using RF for 6 out of 10 models in case of MERIP and 5 out of 10 in case of PSSD. When it comes to PESD, MLP gives the best recognition results for 4 out of 10 models. Other classifiers (k-NN, SMO, Bagging or LMT) give best results in individual cases, but without any repeatable pattern. SL, RC, NBTree and j48 algorithms did not take the lead in any model and thus will be omitted in further analysis.

There is a discrepancy between different types of databases (acted out: MERIP and PESD, and spontaneous PSSD) as well as between the same type of databases (MERIP and PESD). Thus, it can be assumed that recognition is affected not only by the type of database, but also by its size and by the type of samples such as uttered sentences and individual features of the speaker. Such varied results and the lack of repeatability indicates the necessity of conducting efficiency tests and selection of appropriate methods every time the corpora is modified.

#### *4.2. Efficiency of Proposed Algorithm*

Based on the results presented in the previous section, classifiers providing highest results on specific feature sets are selected to be part of the proposed algorithm. Thus, for example, in case of MERIP, the final algorithm consists of: RF for F0, LPC, HFCC, PLP, and RPLP LMT for energy and formants, k-NN for MFCC, MLP for BFCC. Next, the error rate of each model is taken into account to calculate the weights for weighted voting (see Figure 4). To assess the proposed method, the results are compared with those obtained using classical approach: using common classifiers on the whole feature set (see Table A1).

According to Tables 5 and 6 an improvement of the overall accuracy using proposed algorithm can be observed in comparison to commonly used classifiers for all datasets. The lowest increase of results is observed for MERIP: MLP gives 66.9% and the third method of weighted voting 69.38%. It is important to note that equal voting among best models gives lower recognition than MLP. Significantly improved recognition quality can be observed in case of PESD and PSSD, where the proposed method boost the overall accuracy from 66.83% (k-NN) and 83.52% to 76.25% and 86.14% for weighted voting respectively. In case of PSSD dataset equal voting gives the same results as MLP. The average accuracy on MERIP, PESD and PSSD databases is illustrated as a confusion matrix in Figure 5.

**Figure 4.** The discrete distribution of the error rate obtained by selected models for each features sets for (**a**) MERIP (**b**) PESD (**c**) PSSD.

**Table 5.** Average recognition results [%] of all features subsets for MERIP, PESD and PSSD database using commonly known classifiers.


**Table 6.** Average recognition results [%] of all features subsets for MERIP, PESD and PSSD database using proposed algorithm with equal voting (EV) juxtaposed with three different approaches for weighted voting.


**Figure 5.** Confusion matrices presenting the best results obtained for (**a**) MERIP (**b**) PESD (**c**) PSSD. Emotional states: An—anger, Di—disgust, Fe—fear, Ha—happiness, Ne—neutral, Sa—sadness.

The analysis of confusion matrix illustrates that the mistakes are different for each database. For example in case of MERIP anger and happiness are most confused and the same issue occurs for PESD. However, in case of PSSD misrecognition of anger and sadness is more clearly visible. Additionally, it can be observed that confusion between boredom-sadness-neutral is a common mistake for all datasets.

Table 7 presents the accuracy achieved in state-of-the-art research on PESD and PSSD datasets, which has been improved using the algorithm proposed in this paper. It is impossible to compare results for MERIP, since the database has been released recently and, up to now, there has been no research carried out on it.


**Table 7.** Comparison with similar works.

In order to verify if acted out database can be used as a training set for application operating in real environment, selected classifiers were tested using mixed sets. In the first experiment, the training set consists of one of acted out databases (MERIP or PESD). In the second experiment both sets are connected, creating a larger training set. PSSD is a testing set in both cases. Obtained results are presented in Table 8.

**Table 8.** The average emotion recognition rates for mixed database. Columns named *EV*, *Verr*1, *Verr*2, *Verr*<sup>3</sup> represent the voting methods proposed in this paper.


As expected, the effectiveness of classifiers whose testing and training sets comprised different datasets is much lower in comparison to those operating on one particular database. When the acted out database is the training one, the average emotion recognition rate barely exceeds 30%. Increasing the number of samples in the training set by combining both acted out datasets, increased the quality of the classification. However, even in this case, the results do not exceed 50%. It should be noted that, as in previous cases, the proposed algorithm gives better results.

#### **5. Conclusions**

In this paper, performance of a committee of classifiers working on small subsets of features was studied and competitive performance in speech-based emotion recognition was shown. The proposed algorithm was tested on three different types of databases and in every case it achieved performance equal or better than current state-of-the-art methods. Although obtained results look promising when working within one particular database. When it comes to mixed database classification, the results are much lower and require further study. The research indicates that using the acted out database as a training set of a model that is supposed to operate in real conditions is not the perfect approach. To achieve higher results, it is recommended either to use a training set with bigger number of samples than a test set or train the model using spontaneous speech samples. This is crucial to create a system operating in real-world environment. Future works may include adding a gender recognition module right before emotional states classification, since a huge impact of gender on SER is noticed in many papers. It is also worth to explore and examine robust features, which have an impact on differentiation between emotional states with similar resonance such as anger and happiness, as well as neutral, sad and boredom. Additionally, replacing classic algorithm models with deep learning e.g., CNN or LSTM-RNN can be considered on the grounds that the use of neural networks provides good results in SER. At the same time, it must be emphasized that deep learning requires a large number of training samples whereas widely used and accessible databases still have their limitations.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The author declares no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **Appendix A**

This section provides the details about selected features for each set.

**Table A1.** Feature selected applied on the whole sets for MERIP, PESD and PSSD corpora.


**Table A2.** Feature sets after subsets selection obtained for MERIP, PESD and PSSD corpora.



#### **References**


c 2019 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Entropy-Based Clustering Algorithm for Fingerprint Singular Point Detection**

#### **Ngoc Tuyen Le 1, Duc Huy Le 2, Jing-Wein Wang 1,\* and Chih-Chiang Wang <sup>3</sup>**


Received: 19 July 2019; Accepted: 9 August 2019; Published: 12 August 2019

**Abstract:** Fingerprints have long been used in automated fingerprint identification or verification systems. Singular points (SPs), namely the core and delta point, are the basic features widely used for fingerprint registration, orientation field estimation, and fingerprint classification. In this study, we propose an adaptive method to detect SPs in a fingerprint image. The algorithm consists of three stages. First, an innovative enhancement method based on singular value decomposition is applied to remove the background of the fingerprint image. Second, a blurring detection and boundary segmentation algorithm based on the innovative image enhancement is proposed to detect the region of impression. Finally, an adaptive method based on wavelet extrema and the Henry system for core point detection is proposed. Experiments conducted using the FVC2002 DB1 and DB2 databases prove that our method can detect SPs reliably.

**Keywords:** singular point detection; boundary segmentation; blurring detection; fingerprint image enhancement; fingerprint quality

#### **1. Introduction**

Fingerprint biometrics is increasingly being used in the commercial, civilian, physiological, and financial domains based on two important characteristics of fingerprints: (1) fingerprints do not change with time and (2) every individual's fingerprints are unique [1–5]. Owing to these characteristics, fingerprints have long been used in automated fingerprint identification or verification systems. These systems rely on accurate recognition of fingerprint features. At the global level, fingerprints have ridge flows assembled in a specific formation, resulting in different ridge topology patterns such as core and delta (singular points (SPs)), as shown in Figure 1a. These SPs are the basic features required for fingerprint classification and indexing. Local fingerprint features are carried by local ridge details such as ridge endings and bifurcations (minutiae), as shown in Figure 1b. Fingerprint minutiae are often used to conduct matching tasks because they are generally stable and highly distinctive [6].

Most previous SP extraction algorithms were performed directly over fingerprint orientation images. The most popular method is based on the Poincaré index [7], which typically computes the accumulated rotation of the vector field along a closed curve surrounding a local point. Wang et al. [8] proposed a fingerprint orientation model based on 2D Fourier expansions to extract SPs independently. Nilsson and Bigun [9] as well as Liu [10] used the symmetry properties of SPs to extract them by first applying a complex filter to the orientation field in multiple resolution scales by detecting the parabolic

and triangular symmetry associated with core and delta points. Zhou et al. [11] proposed a feature of differences of the orientation values along a circle (DORIC) in addition to the Poincaré index to effectively remove spurious detections, take the topological relations of SPs as a global constraint for fingerprints, and use the global orientation field for SP detection. Chen et al. [12] obtained candidate SPs by the multiscale analysis of orientation entropy and then applied some post-processing steps to filter the spurious core and delta points.

**Figure 1.** The global and local features in the fingerprint. (**a**) Singular points (SPs) (square: core; triangle: delta) and (**b**)minutiae (red circle: ridges ending; blue circle: bifurcation).

However, SP detection is sensitive to noise, and extracting SPs reliably is a very challenging problem. When input fingerprint images have poor quality, the performance of these methods degrades rapidly. Noise in fingerprint images makes SP extraction unreliable and may result in a missed or wrong detection. Therefore, fingerprint image enhancement is a key step before extracting SPs.

Fingerprint image enhancement remains an active area of research. Researchers have attempted to reduce noise and improve the contrast between ridges and valleys in fingerprint images. Most fingerprint image enhancement algorithms are based on the estimation of an orientation field [13–15]. Some methods use variations of Gabor filters to enhance fingerprint images [16,17]. These methods are based on the estimation of a single orientation and a single frequency; they can remove undesired noise and preserve and improve the clarity of ridge and valley structures in images. However, they are not suitable for enhancing ridges in regions with high curvature. Wang and Wang [18] first detected the SP area and then improved it by applying a bandpass filter in the Fourier domain. However, detecting the SP region when the fingerprint image has extremely poor quality is highly difficult. Yang et al. [19] first enhanced fingerprint images in the spatial domain with a spatial ridge-compensation filter by learning from the images and then used a frequency bandpass filter that is separable in the radial- and angular-frequency domains. Yun and Cho [20] analyzed fingerprint images, divided them into oily, neutral, and dry according to their properties, and then applied a specific enhancement strategy for each type. To enhance fingerprint images, Fronthaler et al. [21] used a Laplacian-like image pyramid to decompose the original fingerprint into subbands corresponding to different spatial scales and then performed contextual smoothing on these pyramid levels, where the corresponding filtering directions stem from the frequency-adapted structure tensor. Bennet and Perumal [22] transformed fingerprint images into the wavelet domain and then used singular value decomposition (SVD) to decompose the low subband coefficient matrix. Fingerprint images were enhanced by multiplying the singular value matrix of the low-low(LL) subband with the ratio of the largest singular value of the generated normalized matrix with mean of 0 and variance of 1 and the largest singular value of the LL subband. However, the resulting images were sometimes uneven. This is because SVD was applied only to the low subband and a generated normalized matrix was used. To overcome this problem, Wang et al. [23] introduced a novel lighting compensation scheme involving the use of adaptive SVD on wavelet coefficients. First, they decomposed the input fingerprint image into four subbands by 2D discrete wavelet transform (DWT). Subsequently, they compensated

fingerprint images by adaptively obtaining the compensation coefficients for each subband based on the referred Gaussian template.

The aforementioned methods for enhancing fingerprint images can reduce noise and improve the contrast between ridges and valleys in the images. However, they are not really effective with fingerprint images having very poor quality, particularly blurring. To overcome this problem, we need to segment the fingerprint foreground with the interleaved ridge and valley structure from the complex background with non-fingerprint patterns for more accurate and efficient feature extraction and identification. Many studies have investigated segmentation on rolled and plain fingerprint images. Mehtreet al. [24] partitioned a fingerprint image into blocks and then performed block classification based on gradient and variance information to segment fingerprint images into blocks. This method was further extended to a composite method [25] that takes advantage of both the directional and the variance approaches. Zhang et al. [26] proposed an adaptive total variation decomposition model by incorporating the orientation field and local orientation coherence for latent fingerprint segmentation. Based on a ridge quality measure that was defined as the structural similarity between the fingerprint patch and its dictionary-based reconstruction, Cao et al. [27] proposed a learning-based method for latent fingerprint image segmentation.

This study proposes an efficient approach by combining the novel adaptive image enhancement, compact boundary segmentation, and a novel clustering algorithm by integrating wavelet frame entropy with region growing to evaluate the fingerprint image quality so as to validate the SPs. Experiments were conducted on FVC2002 DB1 and FVC2002 DB2 databases [28]. The experimental results indicate the excellent performance of the proposed method.

The rest of this paper is organized as follows. Section 2 introduces the proposed image enhancement, precise boundary segmentation, and blurring detection based on wavelet entropy clustering algorithm. Section 3 describes the proposed algorithm for SP detection. Section 4 presents experimental results to verify the proposed approach. Finally, Section 5 presents the conclusions of this study.

#### **2. Blurring Detection for Fingerprint Impression**

#### *2.1. Fingerprint Background Removal*

SVD has been widely used in digital image processing [29–31]. Without loss of generality, we suppose that *f* is a fingerprint image with a resolution of *M* × *N* (*M* ≥ *N*). The SVD of a fingerprint image *f* can be written as follows:

$$f = \mathcal{U}\Sigma V^T,\tag{1}$$

where *U* = [*u*1, *u*2, ... , *uN*] and *V* = [*v*1, *v*2, ... , *vN*] are orthogonal matrices containing singular vectors and Σ = [D, O] contains the sorted singular values on its main diagonal. D = diag (λ1, λ2, ... , λ*k*) with singular values λ*i*, *i* = 1, 2, ... , *k* in a non-increasing order, O is a *M* × (*M* − *k*) zero matrix, and *k* is the rank of *f*. We also can expand the fingerprint image as follows:

$$f = \lambda\_1 \mu\_1 v\_1^T + \lambda\_2 \mu\_2 v\_2^T + \dots + \lambda\_k \mu\_k v\_{k'}^T \ \ \ \lambda\_1 \ge \lambda\_2 \ge \dots \ge \lambda\_k. \tag{2}$$

The terms λ*iuiv<sup>T</sup> <sup>i</sup>* containing the vector outer-product in Equation (2) are the principal images. The Frobenius norm of the fingerprint image is preserved in SVD transformation:

$$\left\|f\right\|\_{F}^{2} = \sum\_{i=1}^{k} \lambda\_{i}^{2}.\tag{3}$$

Equation (3) shows how the signal energy of f can be partitioned by the singular values in the sense of the Frobenius norm. It is common to discard the small singular values in SVD to obtain matrix approximations whose rank equals the number of remaining singular values. Good matrix approximations can always be obtained with a small fraction of the singular values. The highly concentrated property of SVD helps remove background noise from the foreground ridges.

We performed some experiments to observe the effects of singular values on a fingerprint image. Figure 2a shows a fingerprint image in the FVC 2002 DB2 database. First, all singular values of the fingerprint image, as shown in Figure 2a, were set to 1 and the fingerprint image was then reconstructed. Figure 2b shows the reconstructed fingerprint image without the effect of singular values, implying that the singular vectors represent the background information of the given fingerprint image. Next, all singular values of the fingerprint image shown in Figure 2a were multiplied by 2 and the fingerprint image was then reconstructed. As shown in Figure 2c, the fingerprint image looks clearer and the background of the fingerprint image was removed. It suggests that the singular values represent the foreground ridges of the given fingerprint image. Thus, SVD can be used for enhancing the ridge structure and removing noise from the background of the fingerprint image. In addition, if the fingerprint image is a low-contrast image, this problem can be corrected by replacing Σ with an equalized singular matrix obtained from a normalized image, which is considered to be that with a probability density function involving a Gaussian distribution with a mean and variance calculated using the available dataset. This normalized image is called a Gaussian template image.

<sup>ġ</sup> **Figure 2.** <sup>E</sup>ffects of singular values on a fingerprint image. (**a**) Fingerprint image in FVC 2002 DB2 database; (**b**) reconstructed fingerprint image when all singular values of Figure 2a are set to 1; (**c**) reconstructed fingerprint image when all singular values of Figure 2a are multiplied by 2; (**d**) equalized fingerprint images of Figure 2a.

Based on observations of the effects of SVD on a fingerprint image, and to effectively remove the background, we examined the singular values of the fingerprint image, which contains most of the foreground information. We automatically adjusted the illumination of an image to obtain an equalized image that has a normal distribution. If the fingerprint image had low contrast, the singular values were multiplied with a scalar larger than 1. A normalized intensity image with no illumination problem can be considered an image that has a Gaussian distribution and that can easily be obtained by generating random pixel values with Gaussian distribution. Moreover, the first singular value contributes 99.72% of energy to the original image and the first two singular values contribute 99.88% of the total energy [31]. The larger singular value represents the energy of the fingerprint pattern and the smaller one, the energy of the background and noise. To effectively remove the background, we set a compensation weight, α, that enhanced the image contrast. It is easy to remove the ridge of images when the compensation weight is larger than 1, and the image contrast is reduced when the compensation weight is smaller than 1. Therefore, we compared the maximum singular value of the Gaussian template with the maximum singular value of the original fingerprint image to compute the compensation weight as follows:

$$\alpha = \begin{cases} \max\left(\frac{\max(\Sigma\_G)}{\max(\Sigma)}, \frac{\max(\Sigma)}{\max(\Sigma\_G)}\right) & \text{, } \max(\Sigma) < \eta \\ 1 & \text{, } \max(\Sigma) \ge \eta \end{cases} \tag{4}$$

where the threshold value η is experimentally set as 90,000, and Σ*<sup>G</sup>* is the singular value matrix of the Gaussian template image with mean and variance calculated from the adopted database as shown in Table 1. The equalized image, *feq*, having the same size as the original fingerprint image can be generated by the following:

$$f\_{eq} = \mathcal{U}(a\Sigma)V^T.\tag{5}$$

This task that actually equalizes the fingerprint image can eliminate the undesired background noise. As shown in Figure 2d, the background of the fingerprint image has been removed, thereby providing an image with nearly normal distribution. It also improves the clarity and continuity of ridge structures in the fingerprint image.

**Table 1.** Mean and standard deviation of Gaussian distribution function in each database.


#### *2.2. Impression Region Detection and Boundary Segmentation*

The fingerprint texture should be distinguished from the background by a suitable binary threshold obtained from the energy analysis as a very useful and distinctive preprocessing for boundary segmentation. An analysis of the energy distribution of fingerprint images from the public fingerprint image database indicates a prominent distinction between the fingerprint object and the undesired background owing to the construction of ridges and valleys. In this section, we propose an impression region detection approach based on the energy difference between the impression contour and the background scene. The most obvious feature of the fingerprint ridge is the texture; it exhibits variances in the energy roughness of the impression region. Roughness corresponds to the perception that our sense of touch can feel with an object, and it can be characterized in two-dimensional scans by depth (energy strength) and width (separation between ridges). Before ridge object extraction, a smoothing filter is used to smooth the image and enhance the desired local ridge. The local standard average μ and energy ε of the 7 × 7 pixels defined by the mask are given by the following expressions:

$$
\mu(\mathbf{x}, \ y) = \frac{1}{N} \sum\_{i=-3}^{3} \sum\_{j=-3}^{3} f\_{cq}(\mathbf{x} + i, \ y + j),
\tag{6}
$$

$$\varepsilon(\mathbf{x}, \ y) = \frac{1}{N} \sum\_{i=-3}^{3} \sum\_{j=-3}^{3} \left( f\_{\text{eq}}(\mathbf{x} + i, \ y + j) - \mu(\mathbf{x}, \ y) \right)^2,\tag{7}$$

where *feq*(*x*, *y*) is the equalized image, as discussed in Section 2.1, and *N* = 49 is a normalizing constant. For transforming the grayscale intensity image in Figure 3a into a logical map, a binarized image of the equalized image, *fb*(*x*, *y*), is obtained by extracting the interesting object from the background as follows:

$$\begin{array}{ll} \{f\_b(\mathbf{x}, y) = \begin{cases} 255, & \text{if } \varepsilon(\mathbf{x}, y) \ge 255 \\ 0, & \text{if } \varepsilon(\mathbf{x}, y) < 255 \end{cases} \end{array} \tag{8}$$

where *fb*(*x*, *y*) is a binarized image; pixel values labeled 255 are objects of interest, whereas pixel values labeled 0 are undesired ones.

Figure 3b shows the binarized image obtained by applying Equation (8) to the equalized image. Based on the binary images, as shown in Figure 3b, we can detect the region of impression (ROI),

which is very useful as a distinctive preprocessing of boundary segmentation. Figure 3b shows that the proposed algorithm can perform very well for discriminating the blur region. Pixel (*x, y*) with energy ε(*x,y*) ≥ 255 is an object of the ROI; therefore, we can detect the ROI, *fROI*(*x*, *y*), as follows:

$$f\_{ROI} = \{(\mathbf{x}, y) \Big| \varepsilon(\mathbf{x}, y) \ge 255\}.\tag{9}$$

To define the fingerprint contour, we determine the boundary location of the fingerprint. Most human fingerprint contours have elliptical shapes. Thus, the left, right, and horizontal projections for an elliptical fingerprint contour are divided to search for landmarks by commencing from two sides in every 15 pixels from top to down. Based on the located landmarks, the contour of the fingerprint is acquired in a polygon. As illustrated in Figure 3c, the blue, green, and red lines present the contours received by using left, right, and horizontal projections, respectively. This method is advantageous because it is simple and is less influenced by finger pressure.

<sup>ġ</sup> **Figure 3.** (**a**) Original fingerprint image in FVC 2002 DB2 database; (**b**) binary image by using energy transformation and blur detection obtained with2D non-separable wavelet entropy filtering for Figure 3a; (**c**) segmented image of Figure 3a.

#### *2.3. Blurring Detection*

Our proposed method improves the fingerprint image quality, as discussed in Section 2.1, and the ROI is defined, as discussed in Section 2.2. However, the fingerprint image still contains a blur region within the ROI, leading to the false detection of SPs. In this section, we propose a method for detecting the blur region in a fingerprint image and then ignoring it during detection to reduce the time and improve the accuracy of SP detection.

To locate the blur region, we perform region segmentation by finding a meaningful boundary based on a point aggregation procedure. Choosing the center pixel of the region is a natural starting point. Grouping points to form the region of interest, while focusing on 4-connectivity, would yield a clustering result when there are no more pixels for inclusion in the region. After region growing, the region is measured to determine the size of the blur region. Entropy filtering for blur detection of pixels in the 11 × 11 (*N* = 11) neighborhood defined by the mask is given by the following:

$$\varepsilon\_{\rm NSDWF} = \frac{-1}{N^2} \sum\_{\mathbf{x}, \mathbf{y} = \mathbf{0}}^{N-1} |d^{HH}(\mathbf{x}, \mathbf{y})| \log |d^{HH}(\mathbf{x}, \mathbf{y})| \,\tag{10}$$

where *dHH* is the coefficient of a non-subsampled version of the 2D non-separable discrete wavelet transform (NSDWT) [32,33] in the high-frequency subband decomposed at the first level (*dHH <sup>j</sup>*+1), *j* = 0, as shown in Figure 4. Figure 3b shows that the proposed algorithm can perform very well for discriminating the blur region.

**Figure 4.** Filter bank implementation of 2D non-separable discrete wavelet transform (NSDWT), *j*: level.

#### **3. SP Detection**

In general, SPs of a fingerprint are detected by a Poincaré index-based algorithm. However, the Poincaré index method usually results in considerable spurious detection, particularly for low-quality fingerprint images. This is because the conventional Poincaré index along the boundary of a given region equals the sum of the Poincaré indices of the core points within this region, and it contains no information about the characteristics and cannot describe the core point completely. To overcome the shortcoming of the Poincaré index method, we propose an adaptive method based on wavelet extrema for core point detection. Wavelet extrema contain information on both the transform modulus maxima and minima in the image, considered to be among the most meaningful features for signal characterization.

First, we align the ROI based on the Poincaré's core points and the local orientation field. The Poincaré index at pixel (*x*,*y*), which is enclosed by 12 direction fields taken in a counterclockwise direction, is calculated as follows:

$$Poincare(x, y) = \frac{1}{2\pi} \sum\_{k=0}^{M-1} \Delta(k),$$

where

$$\Delta(k) = \begin{cases} \begin{array}{c} \delta(k), & \text{if } \left| \delta(k) \right| < \pi/2 \\ \pi + \delta(k), & \text{if } \left| \delta(k) \right| < -\pi/2 \\ \pi - \delta(k), & \text{otherwise} \end{array} \tag{12}$$

and

$$\delta(k) = \theta(\mathbf{x}(k'), y(k')) - \theta(\mathbf{x}(k), y(k)); \; k' = (k+1) \bmod M; \; M = 12,\tag{13}$$

where (*x*(*k* ), *y*(*k* )) and (*x*(*k*), *y*(*k*)) are the paired neighboring coordinates of the direction fields. A core point has a Poincaré index of +1/2. By contrast, a delta point has a Poincaré index of -1/2. The core pointsdetected in this step are called rough core points.

Next, we align the fingerprint image under the right-angle coordinate system based on the number and location of preliminary core points. Because fingerprints may have different numbers of cores, the first step in alignment is to adopt the preliminary Poincaré indexed positions as a reference. If the number of preliminary cores is 2, the image is rotated along the orientation calculated from the midpoint between the two cores. If the number of cores is equal to 1, the image is rotated along

the direction calculated from the neighboring orientation of the core. If the number of cores is zero, the fingerprint is kept intact. The rotation angle is calculated as follows:

$$\underset{j < y\_c}{\mathcal{Q}} = \frac{1}{2} \tan^{-1} \frac{\sum\_{i \in \zeta} \sin 2O\_{i,j}}{\sum\_{i \in \zeta} \cos 2O\_{i,j}} \, \tag{14}$$

where *Oi*,*<sup>j</sup>* is the local orientation around a pixel and ζ is the core subregion of interest (COI) centered at the Poincaré index core point (*xc*, *yc*) with size of 60 × 60 pixels, which was determined to avoid possible variability near the boundary while one is fingerprinted by the reader. Fingerprint alignment is performed to make the pattern rotation-invariant and to reduce the false rejection rate. The rotations are given by the following Equation:

$$\begin{cases} \ y' = \ge \sin \phi + y \cos \phi \\ \ x' = \ge \cos \phi - y \sin \phi \end{cases} \tag{15}$$

and point (*x*, *y*) with orientation angle φ is mapped to point (*x* , *y* ). Figure 5 shows some fingerprint alignment by our method with different numbers of cores.

**Figure 5.** Fingerprint alignment: (**a**) number of cores = 0; (**b**) number of cores = 1; (**c**,**d**) number of cores = 2.

After alignment, the COI subregion with size of 60 × 60 pixels centered at the Poincaré's detected point is further segmented from the aligned image. The COI then goes through a skeletonization process to peel off as many ridge pixels as possible without affecting the general shape of the ridge, as shown in Figure 6a, and is then transformed to a skeletonized ridge image, as shown in Figure 6b. The skeletonized ridge image is used to compute the wavelet extrema, as shown in Figure 6c.

**Figure 6.** (**a**) COI subregion; (**b**)skeletonized ridges; (**c**) 2D wavelet extrema.

Wavelet modulus maxima representations for two-dimensional signals were proposed by Mallat [33] as a tool for extracting information on singularities, which were considered to be among the most meaningful features for signal characterization. Most wavelet transform local extrema are actually modulus maxima (there are examples of signals for which the wavelet extrema and modulus representations are the same). The set of indices and the local maximum, denoted as *M*(*f*), and local minimum, denoted as *m*(*f*), of skeletonized ridge image *f* are defined as follows:

$$M(f) = \{(z, f(z)) : f(z - 1) \le f(z) \text{and} f(z + 1) \le f(z)\},\tag{16}$$

$$m(f) = \{(z, f(z)) : f(z - 1) \ge f(z) \\
and f(z + 1) \ge f(z)\},\tag{17}$$

Where *z*∈*Z*. Similarly, the indices and values of wavelet transform extrema for an image *f* is defined as follows:

$$E(f) = \left| \left\{ M(w\_j(f)) \right\} \cup \left\{ m(w\_j(f)) \right\} \right|; j = 1, 2, \dots, J \right|, \tag{18}$$

where *wj*(*f*) is the 2D non-separable wavelet transform of image *f* at scale *j*, *j* = 1, 2, ... , *J*. The SP of a fingerprint image can be found by extracting curvature primitives and discovering the location of these primitives in the subregion, as shown in Figure 6c.

We find the exact location of the core point defined by the Henry system and trace the skeletonized ridge curves with 8-adjacency to explore wavelet extrema in 1-pixel increments by starting at 10 pixels apart from two sides. The highest extrema in the ridge curve correspond to core point candidates. We devise two 8-adjacency grids to locate the wavelet extrema (Figure 7a,b). Beginning from two opposite ends and moving toward the center of the subregion, the black-colored pixel of each grid is designated as the central point to trace. Based on this central point, the moving guideline is as follows: if the gray-level of the adjacent pixel is 0, then move toward that pixel, where the number shown in the grid indicates the moving sequence. This method enables one to follow the real track of the ridge curve. Whenever a singularity is detected, its location is noted. Figure 7c shows that it is common to find multiple core point candidates with small vertical displacements, and the area below the lowest ridge curve is circumscribed for locating the core point. In the Henry system, exact core point location can be performed as follows: (a) locate the topmost extrema in the innermost ridge curve if there is no rod; (b) otherwise, locate the top of the rods. The following equation summarizes this process:

$$s = \begin{cases} \ \ \ \omega\_{\varepsilon, 0\prime} \qquad i = 0\\ \ \ \ \ \ \ \omega\_{\varepsilon, (i/2) + (\text{mod}2)\prime} \quad i \ge 1 \quad \text{'} \end{cases} \tag{19}$$

Where *s* is the determined core point, *i* is the number of rods below the innermost ridge curve, ω*e*,0 is the topmost extrema in the innermost ridge curve, and ω*e*,(*i*/2)+(*i*mod2) is the located rod extrema below the innermost ridge curve. Figure 7d presents an example marked with the blue cross.

**Figure 7.** Core point detection based on wavelet extrema and Henry system. (**a**) Two 8-adjacency grids moving toward each other along the ridge curve indicated in yellow; (**b**) traced path of the ridge curve (green line: from left to right); (**c**) SP located at the lowest ridge curve (red square) and the area beneath (blue line: searching extrema from right to left); (**d**) SP detection in accordance with the Henry system (blue cross).

#### **4. Experimental Results and Discussion**

In this section, to illustrate the effectiveness of our proposed method, we present some of the performed experiments using both FVC2002 DB1 and DB2 fingerprint databases. FVC2002 includes four databases, namely, DB1, DB2, DB3, and DB4, collected using different sensors or technologies that are widely used in practice. Each database is 110 fingers wide (w) and 8 impressions per finger deep (d) (880 fingerprints in all). Fingerprints from 101 to 110 (set B) have been made available to the participants to allow for parameter tuning before the submission of the algorithms. The benchmark is then constituted by fingers numbered from 1 to 100 (set A). Volunteers were randomly partitioned into three groups (30 persons each); each group was associated with a database and therefore to a different fingerprint scanner. Each volunteer was invited to present themselves at the collection place in three distinct sessions, with at least two weeks between each session. The forefinger and middle finger of both hands (in total, four fingers) of each volunteer were acquired by interleaving the acquisition of the different fingers to maximize differences in finger placement. No efforts were made to control image quality and the sensor platens were not systematically cleaned. In each session, four impressions were acquired of each of the four fingers of each volunteer. During the second session, individuals were requested to exaggerate the displacement (impressions 1 and 2) and rotation (impressions 3 and 4) of the finger without exceeding 35◦. During the third session, fingers were alternatively dried (impressions 1 and 2) and moistened (impressions 3 and 4). The SPs of all fingerprints in the testing database were manually labeled beforehand to obtain the ground truth. For a ground-truth *SP*(*x*0, *y*0), if a detected *SP*(*x*, *<sup>y</sup>*) satisfies (*x* − *x*0) <sup>2</sup> <sup>−</sup> (*<sup>y</sup>* <sup>−</sup> *<sup>y</sup>*0) <sup>2</sup> < 10, it is said to be truly detected; otherwise, it is called a miss.

The singular point detection rate (SDR) is defined as the ratio of truly detected SPs to all ground-truth SPs:

$$\text{SDR} = \frac{\text{Num(tuly \, detected \, SPS)}}{\text{Num(ground \, truth \, SPS)}} \times 100\%. \tag{20}$$

The singular point miss rate (SMR) is defined as the ratio of the number of missed SPs to the number of all ground-truth SPs. The sum of the detection rate and miss rate is 100%:

$$\text{SMR} = \frac{\text{Num}(\text{missed SPs})}{\text{Num}(\text{groundtruth SPs})} \times 100\% = 100 - \text{SDR}.\tag{21}$$

The singular point false alarm rate (SFR) is defined as the ratio of the number of falsely detected SPs to the total number of ground-truth *SP*s:

$$\text{SFR} = \frac{\text{Num}(falesly\text{ }detected\text{ }SPs)}{\text{Num}(\text{groundtruth\text{ }SPs})} \times 100\%. \tag{22}$$

The singular point correctly detected rate (SCR) is defined as the ratio of all truly detected SPs to all detected SPs in a fingerprint of all fingerprint images:

$$\text{SCR} = \frac{\text{Num}(\text{true} \,\text{by detected } \,\text{SPs})}{\text{Num}(\text{detected } \,\text{SPs})} \times 100\%. \tag{23}$$

First, the compensation weight coefficients are calculated by using Equation (4) and the equalized image, *feq*, having the same size as the original fingerprint image can be generated by Equation (5). Figures 8 and 9 show example image results of the proposed method for FVC2002 DB1 and DB2, respectively. As shown in Figures 8c and 9b, the background of the fingerprint image has been removed, thereby providing an image with nearly normal distribution. It also improves the clarity and continuity of ridge structures in the fingerprint image.

**Figure 8.** Results of our proposed method for the FVC2002 DB1 database. (**a**) Original fingerprint images; (**b**) histogram of Figure 8a; (**c**) equalized fingerprint images of Figure 8a; (**d**) histogram of Figure 8c.

**Figure 9.** Results of our proposed method for the FVC2002 DB2 database. (**a**) Original fingerprint images and (**b**) equalized fingerprint images of Figure 9a.

Then, we show the effectiveness by comparing the amount of information in our method and in the original fingerprint images by using the entropy of an image. The entropy of information *H* was introduced by Shannon [34] in 1948, and it can be calculated by the following equation:

$$H = -\sum\_{v=0}^{255} p\_i \log\_2 p\_{i\prime} \tag{24}$$

where *pi* denotes the probability mass function of gray level *i*, and it is calculated as follows:

$$p\_i = \frac{Number\ of\ occurrences\ of\ intensity\ levels}{Number\ of\ intensity\ levels}.\tag{25}$$

In digital image processing, entropy is a measure of an image's information content, which is interpreted as the average uncertainty of the information source. The entropy of an image can be used for measuring image visual aspects [35] or for gathering information to be used as parameters in some systems [36]. Entropy is widely used for measuring the amount of information within an image. Higher entropy implies that an image contains more information.

Entropy is measured to quantify the information produced from the enhanced image. For good enhancement, the entropy of the enhanced image should be close to that of the original image. This small difference between entropies of the original and the enhanced images indicates that the image details are preserved. It also shows that the histogram shape is maintained; thus, the saturation case can be avoided. Table 2 shows the entropy of equalized images compared with original images for each image shown in Figures 8 and 9. The result shows that the equalized fingerprint images have smaller entropy while they are still close to the entropy of the original image. It means that our method can remove noise from the original image while retaining the structure of the fingerprint image.


**Table 2.** Entropy of equalized images compared with original images for each database.

Next, the equalized fingerprint image was used to determine the contour and detected the blur region of the fingerprint, as discussed in Sections 2.2 and 2.3. Figure 10 shows the binarized image obtained by applying Equation (8) to the equalized image. Based on the binary images, as shown in Figure 10b, we can detect the region of impression (ROI), and the contour of the fingerprint is acquired in a polygon, as shown in Figure 10c. Figure 11b presents the blur detection result obtained by 2D non-separable wavelet entropy filtering for low-quality images, as discussed in Section 2.3. In what follows, an ROI with a 30% blur region is considered to have bad quality, and its SP detection is not good enough.

Our experiments were tested on the FVC2002 DB1\_A and FVC2002 DB2\_A databases. We compared the results of our proposed SP detection with results obtained using other methods, including a rule-based algorithm [5], Zhou's algorithm [11], Tico's algorithm [37], Ramo's algorithm [38], and Chikkerur and Ratha's algorithm [39]. In these methods, the singular points were measured on Euclidean distance. While no standard terms exist to define a correct detection, we devoted our attention in this research to a method for detecting a singular point precisely and followed the convention for adopting the 10-pixel deviation on the distance between the expected and the detected singular points to validate the performance of the proposed method. In addition, the singular point detection based on the Poincaré index method is sensitive for low-quality fingerprints. In this paper, we show that by combining a novel adaptive image enhancement, compact boundary segmentation,

and NSDWT for localization, the detection of singular points is more robust. Moreover, a novel clustering algorithm by integrating wavelet frame entropy with region growing is introduced to evaluate the fingerprint image quality to validate the detected singular points. Tables 3 and 4 show the correctly detected rate, detection rate, miss rate, and false alarm rate. The results in the tables indicate that our method not only has a higher correctly detected rate than other methods but also has a low false alarm rate. Figure 12 presents the results of truly detected SPs on the FVC2002 database; the core points and the delta points are closer as ground truth SPs. Figure 13 presents some comparison results of SP detection for the FVC2002 database using our proposed method and the Poincaré index method. In this figure, blue and green crosses indicate the core and delta points, respectively, detected by our proposed method, and the red cross indicates the core point detected by the Poincaré index method. The results show that the location of the SPs detected using our method is more accurate than those of the SPs detected using the Poincaré index method.

**Figure 10.** Binary images by using energy transformation for the FVC 2002 DB1 and DB2 databases. (**a**) Equalized images of five fingerprint images in the FVC 2002 database; (**b**) binary images of Figure 10a; (**c**) segmented images of Figure 10a.

**Table 3.** Comparison results of various detection algorithms for the FVC2002 DB1-A fingerprint database.


(**b**)

**Figure 11.** Blur detection result obtained by 2D non-separable wavelet entropy filtering for low-quality images: (**a**) original images and (**b**) blur detection results.

**Table 4.** Comparison results of different detection algorithms for the FVC2002 DB2-A fingerprint database.


**Figure 12.** Truly detected SPs for the FVC2002 database (blue: core point; green: delta point) by our proposed method: (**a**) FVC2002 DB1 and (**b**) FVC2002 DB2 databases.

**Figure 13.** Some comparison results of SP detection for the FVC2002 database. The blue and green crosses indicate the core and delta points, respectively, detected by our proposed method, and the red cross indicates the core point detected by the Poincaré index method.

#### **5. Conclusions**

Because the conventional Poincaré index along the boundary of a given region equals the sum of the Poincaré indices of the core points within this region, it contains no information about the characteristics and cannot describe the core point completely. To solve this problem, we proposed an adaptive method to detect SPs in a fingerprint image. First, a novel fingerprint enhancement algorithm was proposed to considerably eliminate the background, thereby improving the clarity and continuity of ridge structures. Second, we demonstrated that the proposed algorithm could effectively detect low-quality regions with a high correct rate. Third, based on the threshold value, the proposed algorithm inspected and made a True/False decision about whether a detected SP was accepted. Experimental results demonstrate that the proposed algorithm effectively detects SPs and the results are better than those obtained by rule-based [5], Zhou [11], Tico [37], Ramo [38], and Chikkerur [39].

**Author Contributions:** N.T.L. developed the fingerprint hardware and software coding, and wrote the original draft. J.-W.W. guided the research direction and edited the paper. D.H.L. designed the experiments. C.-C.W. contributed to editing the paper. All authors discussed the results and contributed to the final manuscript.

**Funding:** This research was funded in part by MOST 107-2218-E-992-310 and 108-2221-E-992-076 from the Ministry of Science and Technology, Taiwan.

**Acknowledgments:** The authors appreciate the support from National Kaohsiung University of Science and Technology in Taiwan.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **A Unified Framework for Head Pose, Age and Gender Classification through End-to-End Face Segmentation**

#### **Khalil Khan 1\*, Muhammad Attique 2,\*, Ikram Syed 3, Ghulam Sarwar 3, Muhammad Abeer Irfan <sup>4</sup> and Rehan Ullah Khan <sup>5</sup>**


Received: 2 June 2019; Accepted: 24 June 2019; Published: 30 June 2019

**Abstract:** Accurate face segmentation strongly benefits the human face image analysis problem. In this paper we propose a unified framework for face image analysis through end-to-end semantic face segmentation. The proposed framework contains a set of stack components for face understanding, which includes head pose estimation, age classification, and gender recognition. A manually labeled face data-set is used for training the Conditional Random Fields (CRFs) based segmentation model. A multi-class face segmentation framework developed through CRFs segments a facial image into six parts. The probabilistic classification strategy is used, and probability maps are generated for each class. The probability maps are used as features descriptors and a Random Decision Forest (RDF) classifier is modeled for each task (head pose, age, and gender). We assess the performance of the proposed framework on several data-sets and report better results as compared to the previously reported results.

**Keywords:** face analysis; face segmentation; head pose estimation; age classification; gender classification

#### **1. Introduction**

The problem of human face image analysis is a fundamental and challenging task in computer vision. It plays a key role in various real world applications such as surveillance, animation and human computer interaction. However, it is still a challenging task due to changes in facial appearance, visual angle, complicated facial expressions and the background. In particular, in the un-constrained conditions it has much more complications.

Each of these face analysis tasks (head pose, age and gender recognition) are approached as individual research problem through various sets of techniques [1–8]. We argue that all these tasks are *very closely* related and essentially can help each other if a prior efficiently segmented face image is given as input. It is also confirmed by psychology literature that face parts such as nose, hair, and mouth helps human visual system in face identity recognition [9,10]. Therefore, performance of all related applications can be improved if a well segmented face image is provided as input to the framework.

The facial attribute information such as head pose estimation, age classification, and gender recognition is already being predicted using facial landmarks information [4,11]. However, the performance of head pose and any other applications in such cases heavily depends on accurate localization of these landmarks [5,7,12]. Locating these face landmarks is itself a *big challenge*. These points localization are greatly affected in certain cases such as occlusion, face rotation and if the quality of the image is very low. Similarly, in far-field imagery conditions, these landmarks extraction are not only difficult but some-times impossible. Lighting conditions and complicated facial expressions also make the localization part challenging. Due to all problems mentioned above, we approach the face analysis task in a complete different way.

In this paper we introduce a unified framework, which addresses all the three face analysis tasks (head pose, age, and gender recognition) through a prior multi-class face segmentation model that was developed through CRFs. We named the newly proposed multitask framework HAG-MSF-CRFs. It is a jointly estimation probability task that tackles it using a very powerful random forest algorithm. Specifically, the proposed framework can be formulated as;

$$p(h, a, \emptyset) = \underset{h, a, \emptyset}{\text{arg}\max} \ p(h, a, \emptyset | \mathbf{I}, \mathbf{B}) \tag{1}$$

where head pose, age, and gender recognition are represented by *h*, *a* and *g* respectively. Similarly, in Equation (1), *I* is the input face image and *B* is the bounding box which is provided by the face detector.

In our previous work we already tackle the problem of multi-class semantic face segmentation (MSF) [13] and its application to head pose estimation [14,15] (MSF-HPE) and gender classification [16]. In most of the previous works, face segmentation is considered as three or some-times four classes face segmentation task. In the MSF, face segmentation is extended to six classes (eyes, nose, mouth, skin, back and hair). However, we were facing some major problems in previously proposed MSF. Firstly, the computation cost of MSF is quite high, as MSF provides a class label to each and every pixel in an image, which ultimately takes a long time. A super-pixel based model is used instead which reduces the processing cost. Secondly, the MSF does not consider any conditional hierarchy between different face parts. For example, it is not possible for the eye region to be near to the mouth region and vice versa. A CRFs based model is introduced in this paper, which couples all labels in a face image in a scaled hierarchy. Going from MSF to the newly proposed MSF-CRFs improves the performance of the segmentation part.

Our proposed multi-task framework is comparable to another approach known as the influence model (IM). This model was first introduced by researchers in the MIT media laboratory [17,18]. The IM estimates how the state of one actor affects another in the system. Our proposed model is somehow similar to the model proposed in [17,18]. In such cases, an outcome in one entity in a system causes outcome in another entity in the same system. In simple words, if one domino is flipped, the next domino will fall automatically and vice versa. In IM it is necessary to know how certain dominoes interact with each other and how one is influenced by another. If the initial state of the dominoes is known with relative location to another, then the outcome of the system is predicted with more accuracy. When the system network structure is already known, the IM enables researchers to infer interaction; however, information about signals from different observations are needed.

To summarize, contributions of the paper are three fold:


The structure of the remaining paper is as follows: Section 2 describes related works for all the three cases i.e., head pose, age, and gender recognition. Several data-sets are use to evaluate the framework. Details about these databases is given in Section 3. The segmentation model MSF-CRFs is presented in Section 4, whereas the proposed algorithm for face analysis (HAG-MSF-CRFs) is discussed in Section 5. All obtained results are discussed and compared with SOA in Section 6. The paper is summarized with some future directions in Section 7.

#### **2. Related work**

Our newly proposed model is closely related to IM based built systems. The IM framework is already used in the automatic recognition tasks of social and task-oriented functional roles in group-meetings [17,18]. The classification of social functional roles has been improved as compared to Hidden Markov Models (HMM) and support vector machine (SVM) [18] through IM. The two versions proposed in [18] outperform both HMM and SVM based results in the social functional role problems. The IM methods showed excellent performance, particularly in less populated classes. Media segmentation is performed with IM in cases particularly having rich information [19–21]. The keywords information are exploited in [22] to identify journalists, anchors, and guest speaker if any in a radio program. The maximum entropy algorithm is used for getting the classification accuracy. The IM based algorithms are applied to many audio and visual recognition tasks, for details, more papers can be explored in [23–28].

Before describing the proposed framework, we briefly review related methods for head pose, age, and gender classification. A rich literature and history is already present about all these three topics. However, in this section of the paper we provide a cursory overview of how these tasks were previously approached by researchers.

#### *2.1. Head Pose Estimation*

Pose of an image can be classified into three broad categories; yaw, pitch, and roll. The yaw angles represents the horizontal orientation and the pitch vertical orientation of a face image. The image plane is represented by the roll angles. We evaluated our proposed algorithm for head pose estimation on four data-sets, which included Pointing'04 [29], Annotated Facial Landmarks in the Wild (AFLW) [30], Boston University (BU) [31], and ICT-3DHPE [32] data-sets.

Two types of information were previously used to approach the head pose estimation i.e., facial landmarks and face image appearance. In the former case, a POSIT algorithm [9] is used to find correspondence between pints in 2*D* shapes and points in 3*D* models. In the latter case, various image appearance features such as SIFT, LBP, HOG etc. are exploited for head pose estimation. Discriminative learning models such as Random Forest and Support Vector Machine (SVM) are trained and tested using the extracted features [4,10]. A more detailed survey on head pose estimation can be explored in [5].

#### *2.2. Age Classification*

Age classification is a well-researched topic in computer vision society. Previously, age estimation was studied as a classification or regression problem. In the first case, age is associated with a specific range or age group. In the second case, the exact age of a face image is estimated. Recently a survey paper was reported on age estimation in [33]. All data-sets used for age estimation were discussed and a detailed overview was presented about the algorithms proposed thus far. A detailed investigation of age classification between specific ranges or age groups was presented in [34]. Similarly, another algorithm is introduced to classify age from facial images in [35]. Initially, the appearance of face wrinkles is detected and then age categorization is performed based on the extracted wrinkles. The previous idea [35] was further extended in [36] by first localizing the facial features. The modeling of craniofacial growth was performed through psychophysical and anthropometric evidences in [36]. The main drawback of this approach was: accurate localization of facial features is needed in any case.

A subspace method called AGing PatErn subspace is introduced in [37,38]. In these algorithms, aging features from face images were extracted and an adjusted robust regressor was trained to categorize face ages. These methods showed excellent performance compared to SOA methods. However, two serious weaknesses are faced by these algorithms. The input images must be frontal, and the face images must be well-aligned. The approaches proposed in these algorithms are suited for

databases collected in indoor environmental conditions. Practical applications of these methods in the un-constrained conditions is almost impossible.

A cost-sensitive hyper-planes ranking method is introduced in [39]. The algorithm proposed in [39] is a multi-stage learning method which is also known as 'a grouping estimation fusion' (DEF) method in the literature. Similarly, a novel features selection method was proposed in [40]. In a nutshell, all these previously mentioned methods showed good performances in indoor lab conditions, but failed when exposed to the real-world conditions.

Recently introduced Deep Convolutional Networks (CNNs) showed excellent performance for different visual recognition problems. A hybrid system for age and gender classification is proposed in [41]. CNNs are used to extract features from the face images, whereas an extreme learning machine (ELM) is used as a classification tool. The authors of the paper named their proposed method as CNNs-ELM. The system is evaluated on two data-sets, MORPH-II [42] and Adience [43]. To the best of our knowledge, this is the best algorithm performing on a joint problem of gender and age recognition thus far. A weakness reported by the authors of the paper is: miss-classification occurs when the system is exposed to younger faces.

#### *2.3. Gender Classification*

A detailed investigation about gender recognition was conducted by Makinen and Raisamo [44]. The early researchers who worked on gender recognition used neural network [45]. An SVM classifier was used by Moghaddam and Yang [46]. Similarly, an Adaboost classifier was adapted by Baluja and Rowley [47]. In all these methods image was used as one dimensional feature vector and certain features are extracted from it. A joint framework of age and gender recognition was proposed by Toews and Arbel [48]. The model proposed by the authors is a view-point invariant appearance model which is robust to local scale rotations.

Gender classification analysis based on human gait and linear discriminant algorithms was provided by Yu et a. [49]. A new benchmark to study age and gender classification was suggested in [43]. Through the available data, a classification pipeline is presented by the authors of the paper. Khan et al. [50] proposed a semantic pyramid, dealing both gender and action recognition. Annotation for face and upper body was not needed in the proposed method. First part of the name was used as a feature and a modeling mechanism of the name part and face images was performed in the next stage in a method proposed in [51]. Higher accuracy was reported with proposed method as compared to SOA. Recently, a generic algorithm to estimate gender, race, and age in a single framework is proposed in [52].

All the above-mentioned approaches made lots of progress and contribution towards gender recognition. However, most of these methods were aimed either at non-automated estimation methods or only worked well in very constrained imaging environments.

#### **3. Databases**

In this paper we use six different face databases to perform the three tasks i.e., head pose, age and gender classification. For head pose estimation we use Pointing'04, AFLW, BU, and ICT-3DHPE data-sets. For age classification we use Adience and FERET [53] data-sets. For gender recognition we perform tests with Adience database only.

#### *3.1. Head Pose Estimation*

• **Pointing'04 database:** The Pointing'04 database is a manually annotated face database. Even though it is a comparatively old head pose data-set, it is still used for research purposes [54–56] due to its challenging nature and large variety with consecutive poses. All the images in the Pointing'04 database are low resolution images captured in low lighting conditions. The Pointing'04 contains 15 sets of face images. Each set is further divided into 2 sets having 93 images for each candidate at various orientations. The age of each subject in the database is kept between the range 20–40 years. To add more complexity to the database images, five subjects were included with facial hair and seven were wearing glasses. The pan and tilt angle determined the head pose of a subject. Each subject in the database acquisition was asked to look into 93 markers marked on the wall. Each marker represented a specific pose. The given face localization in Pointing'04 may not be accurate due to manual labeling. A sample of the images of a single candidate at 93 different locations is shown in Figure 1. For yaw, the head orientation varied between −90◦ to +90◦ with a step size of 15◦ between two adjacent poses. For pitch, the positive values corresponded to the top poses and negative to the bottom poses. The difference between two consecutive poses in the pitch is 30◦.


**Figure 1.** Pointing'04 database images of a single subject in all 93 poses.


included variations such as pose, lighting, appearance, noise, and more—meaning the data-set has all conditions of un-constrained image database. The total number of images in Adience are 26,580, whereas the total number of participants are 2284. The exact age of each candidate is not specified, and each subject is assigned to 8 different age groups i.e., [0,2], [4,6], [8,13], [15,20], [25,32], [38,43], [48,53], [60,+]. The data-set can be obtained from the Open University of Israel (computer vision lab).


#### **4. Proposed MSF-CRFs**

The overview of the MSF-CRFs model for semantic face segmentation is shown in Figure 2. The labeling problem is modeled efficiently with the proposed MSF-CRFs, which combines the output from the built classifier with image location information. This modeling process helps in maximizing a posteriori. The unary potential models each pixel belonging to each class and the pairwise potential models the relationship between two pixels.

**Figure 2.** The MSF-CRFs graphical model. The input face image in grid cell represents a random variable. The unary potentials are represented by the white circles and the pairwise potential by solid white lines.

As face parts are not localized in most of the images, a face localization algorithm is applied in start. In the literature there are many good methods for face detection, so we use a CNNs based face detector [57]. After localizing the face parts, all face images are re-scaled to a fixed size with a height 256 pixels and the width is adjusted accordingly to keep the original image ratio.

The proposed MSF-CRFs model encodes segmentation probability with features of an image. Initially an image is segmented into super-pixels. The segmentation is represented by *Z* and this can be represented as *Z* = *z*1, *z*2, ..., *zn*, where n is the total number of super-pixels in the input image. *zi* can take the value of any of the six face parts (nose, eyes, mouth, hair, back and skin). For super-pixel segmentation we use SEEDs [58] algorithm.

*Entropy* **2019**, *21*, 647

We also need to develop some conventions about node and edge features. We represent the node features by *Zm* and edge features by *Ze*. We develop a log linear CRFs model which can be written as:

$$\Psi(s\_i = q, z\_i^{\mathfrak{m}}) = \sum\_{f=1}^{F\_{\mathfrak{m}}} (X\_q^{\mathfrak{m}})\_f (z\_i^{\mathfrak{m}})\_f \tag{2}$$

$$\psi(s\_i = q\_1, s\_j = q\_{2'} = z\_{i,j}^{\varepsilon}) = \sum\_{f=1}^{F\_{\varepsilon}} (X\_{q\_1, q\_2}^{\varepsilon})\_f (z\_{i,j}^{\varepsilon})\_f \tag{3}$$

In Equations (2) and (3), super-pixel features are represented by F*<sup>m</sup>* whereas Z*<sup>m</sup> <sup>i</sup>* represents a vector having length F*m*. The neighboring super-pixels features are represented by F*e*. The final resultant feature vector developed is Z*<sup>e</sup> i*,*j* . Similarly, each node and edge weight are adjusted with *X<sup>m</sup>* and *X<sup>e</sup>* respectively. A pair of classification labels in the above Equations is represented by *q*1*, q*2. In the proposed MSF-CRFs model we use symmetric edge potential.

The probability of segmentation conditional on *Z* can be represented as:

$$P(s|z) = \frac{\exp\left(-\sum\_{i=1}^{m} \psi(s\_i, z\_i^m) - \sum\_{i,j} \psi(s\_i, s\_j, z\_{i,j}^\ell)\right)}{N(Z)}\tag{4}$$

N(Z) represents the partition function in Equation (4). This function acts as a normalization factor for the distribution. We use Bethe Approximation [55] for the partition function in the MSF-CRFs model. Similarly, for marginal approximation we use a loopy belief propagation algorithm. For CRFs optimization, we use the algorithm as in L-BFGS [59]. For weight regulations we also added the Gaussian to the model.

To assess the accuracy of the segmentation estimates, we apply an L1 error to each segmentation estimate. We also penalize each super-pixel as per the difference between the correct label prediction probability and a value 1.0. For example, if a super-pixel has a probability value of 0.7 for being skin (and skin is also the ground truth label of the super-pixel), a penalty value of 0.3 will be incurred as a result.

We compute three types of features for the node listed as; position, HSV color and shape related information (HOG).

For spatial information an 8 × 8 grid is considered, and then the relative location of the central pixel is extracted. This location is defined as:

$$f\_{\rm loc} = [\mathbf{x}/\mathcal{W}, \mathbf{y}/H] \in \mathbb{R}^2 \tag{5}$$

Where *W* represents the width and *H* height of the input face image.

For color features, the information from HSV histogram is extracted. The three values (hue, saturation, and variance) are encoded in a single vector constituting a unique feature vector for color information. The dimension of each patch for HSV is kept as *DHSV*= 16 × 16, whereas the number of bins are set 32. The resulting feature vector for the color information with these values will be *F*<sup>16</sup> *HSV* <sup>∈</sup> *<sup>R</sup>*48.

For shape information we use HOG. We keep the dimension of the patch for HOG as *DHOG*= <sup>64</sup> <sup>×</sup> 64, which results a feature vector *<sup>F</sup>*64×<sup>64</sup> *HOG* <sup>∈</sup> *<sup>R</sup>*<sup>1764</sup>

All the three features are concatenated with each other to form a single vector.

#### **5. Proposed HAG-MSF-CRFs**

Our proposed algorithm is summarized in Algorithm 1. Initially a segmentation model is developed through the CRFs. For face segmentation, the built model MSF-CRFs outputs the most likely class for each super-pixel. The same label is then assigned to each pixel within the super-pixel. For the classification of head pose, age and gender we use the probability maps created during segmentation

of each class. Probability maps generated for each class are represented as: *Pnose*, *Pback*, *Peyes*, *Pskin*, *Pmouth*, and *Phair*. Figure 3 show some images from Pointing'04 data-set and their probability maps. In the gray-scale images in Figure 3, higher intensity represents higher probability of prediction for a particular class and *vice versa*. For each task (head pose, age, and gender) we train an RDF classifier with a feature vector of the corresponding probability maps. The probability maps are used as feature descriptor.

#### **Algorithm 1** proposed HAG-MSF-CRFs algorithm

**Input:** *Mtrain* = {(*In*,*Tn)*} *m <sup>n</sup>*=1, *Mtest*.

where *Mtrain* is the data used for training model A, *Mtest* is the testing data, *I* is the input training image and *T(i,j)* ∈ {1,2,3,4,5,6} is the ground truth data.

#### **a: Face segmentation part:**

Step a.1: Training a segmentation model A through training data (training images and labels)

Step a.2: Finding the center of each super-pixel, extracting patches and passing to the model A

Step a.3: Using the probabilistic classification method and creating probability maps for each class, represented as:

*pskin*, *pmouth*, *peyes*, *pnose*, *phair*, and *pback*

#### **b. Head pose, age and gender classification part:**

**if** head pose estimation:

*f* = *pskin* + *pmouth* + *peyes* + *pnose* + *phair*

**Else if** age classification:

*f* = *pskin* + *pmouth* + *peyes* + *pnose* + *phair*

**Else if** gender recognition:

*f* = *pskin* + *peyes* + *pnose* + *phair*

where f is the feature vector.

c. Training an RDF classifier for each case (head pose, age and gender)

**Output:** estimated pose, age class and gender.

#### *5.1. Head Pose Estimation*

We manually labeled 10 images from each pose of each data-set. The manually labeled images are used to build an MSF-CRFs model as discussed previously. For all images of every data-set, the probability maps are generated. When a test image is given as input, the MSF-CRFs model creates the probability maps for all classes and all images.

To understand which facial parts help in head pose estimation we conducted a large number of experiments. We use probability maps for the eyes, nose, mouth, skin, and hair. Probability maps in the form of feature descriptors are concatenated to train and test an RDF classifier. We use 10-fold cross validation experiments in our work. Those 10 images, which were previously used to create an MSF-CRFs model were not included in the 10-fold cross validation experiments. The probability maps of a single subject from Pointing'04 data-set are shown in Figure 3. From the Figure 3, it is clear that variation occurs as the pose changes from one position to another. For example taking the skin class (third row), forehead is more exposed to the camera in frontal images. As a result, probability map for brighter part is more concentrated to the center part. Similarly, on extreme left and right profile images, high intensity values are occupied on smaller area. We encoded this information for all classes in the form of feature descriptors and developed a new head pose estimation algorithm.

**Figure 3.** Probability maps of a single subject from Pointing'04. Poses vary from −90◦ to +90◦ with a step of 15◦ in the horizontal orientation. Row wise order of the images is as: 1—original images, 2—ground truth images, 3—probability maps for skin, 4—probability maps for hair, 5—probability maps for mouth, 6—probability maps for nose, and 7—probability maps for eyes.

#### *5.2. Age Classification*

In age classification a face image is assigned to one of the specific age range. From each age group of each data-set, 10 images are manually labeled. The manually labeled images are used to build an MSF-CRFs model. The test face images are passed to the MSF-CRFs model to produce segmentation results and probability maps.

We noted during the experiments that each face part has a contribution towards age classification. Probability maps for each face part differ from one age group to another. Therefore, for age classification we use information about all five face classes, i.e., skin, mouth, hair, and eyes. The probability maps generated are used to train and test an RDF classifier. As in case of head pose, 10-fold cross validation experiments are performed here as well. Manually labeled images which were previously used to create MSF-CRFs model were not included in the 10-fold cross validation experiments.

#### *5.3. Gender Recognition*

For gender classification, we manually label 30 images for each gender and each data-set. These total 60 images are used to build an MSF-CRFs model for the gender test. A number of qualitative and quantitative experiments are conducted to know which face parts help in gender recognition. After these experiments we train an RDF classifier through probability maps of four classes namely; nose, hair, eyes, and skin.

We perform a detailed study from computer vision and human anatomy literature to know which face parts make a face more feminine or masculine. In the following paragraphs we summarize why we use four classes (skin, nose, hair, and eyes) for gender recognition.


pixel labeling accuracy noted was 79%, resulting better segmentation with brighter probability map. For female the labeling accuracy reduced to 69%, which results a comparatively dimmer probability map.


**Figure 4.** Face segmentation results with MSF-CRFs for frontal images on Pointing'04. Images in rows are in order as: row 1—original images, row 2—manually labeled images, row 3—segmentation results produced by MSF-CRFs

**Figure 5.** Face segmentation results with MSF-CRFs for profile images (+60◦) on Pointing'04. Images in rows are in order as: row 1—original images, row 2—manually labeled images, row 3—segmentation results produced by MSF-CRFs

Thus, probability maps for skin, nose, hair, and eyes are concatenated with each other to form a single feature vector. We perform 10-fold cross validation experiments here as well. However, we excluded 60 images which were previously used for training part from each database tests.

#### **6. Results and Discussion**

#### *6.1. Face Segmentation Results*

To the best of our knowledge, previously proposed MSF is the first work that considered all six face parts in face segmentation. The main problem with MSF is its computational cost. To remove this deficiency, we used a super-pixel based segmentation in the current model (MSF-CRFs). The processing time of segmentation was improved four times with the MSF-CRFs as compared to the MSF. For example, an image with a 256 × 240 pi size took 1.2 min in the MSF model. The same image was segmented with MSF-CRFs in just 18 seconds.

An image is segmented into super-pixels initially. Super-pixel segmentation reduces processing time of segmentation as the number of pixels to be labeled are reduced immensely. In the proposed method we used SEEDs [58] algorithm for super-pixel segmentation. We prefer SEEDS over SLIC and other methods as the speed of the SEEDS is much better than other methods used in SOA [58]. Moreover, SEEDS has much better super-pixel segmentation as reported in standard error metrics.

Face segmentation results for frontal images are much better than profile images. For different super-pixel parameters setting we performed experiments. We noticed better segmentation results with 900 super-pixels. The exact number of super-pixels were less than 900 due to certain segmentation restrictions. The number of super-pixels obtained during the experiments depended on the block levels used and the image size. The super-pixel segmentation was better when the block levels were higher. We used the number of block levels 3, and histogram bins 5. For better accuracy iteration accuracy was kept twice.

Few images from Poinint'04 dataset are shown in Figures 4 and 5. Figure 4 shows some good segmentation results. In Figures 4 and 5, the first row shows the original images, row 2 shows manually labeled images and row 3 shows images segmented with the MSF-CRFs. The frontal images are segmented in Figure 4, whereas the same images rotated at +60◦ are shown in Figure 5. From these Figs. it is clear that pixel labeling accuracy for frontal images is much better than profile images. It can be noted that as the pose moves to the left or right, labeling accuracy dropped particularly for smaller classes (eyes, nose, and mouth). For extreme profile poses (+90◦ and −90◦) these smaller classes in some images were completely missing.

Performance of the segmentation part highly depends on the quality of the images as well. For example, in the case of AFLW data-set, the images were collected from the internet which included very low quality images. Therefore, poor segmentation results were noticed, ultimately leed to the poor performance of head pose and gender recognition.

#### *6.2. Head Pose Estimation*

We used two evaluation methods for head pose estimation. The first one is a regression measure i.e., mean absolute error (MAE). MAE is the absolute error between the estimated and ground truth pose. The second one is a classification measure i.e., pose estimation accuracy (PEA). PEA estimates how a particular pose is predicted by a model.

**Pointing'04 data-set:** The results obtained with HAG-MSF-CRFs on the Pointing'04 data-set and its comparison with SOA for both yaw and pitch angles is shown in Table 1. From the Table 1, it is clear that we achieved better results as compared to previously reported results for both the MAEs and PEAs. All possible combination of the six face classes were tried in the experiments. The best results for yaw (average MAE = 2.32◦ and average PEA = 87%) and pitch (average MAE =1.18◦ and average PEA = 95%) were obtained with five classes i.e., 'nose', 'mouth' 'skin', 'hair', and 'eyes'. It must be noted that some of the previous methods mentioned in Table 1 may have used a differential experimental setup. For example, 5-fold cross validation experiments were performed in the MLD. We performed our experiments with 10-fold cross validation protocol. Corresponding papers can be explored for the experimental setup and more details for each case.


**Table 1.** Head pose estimation results and its comparison with SOA on Pointing'04 database.

For a more clear comparison with SOA methods, we also reported the results for each pose both for the MAEs and PEAs. The MAEs results are compared in Figures 6 and 7 for pitch and yaw angles respectively. We had the best results for MAE for all yaw poses (except, 0◦ and +30◦). Similarly, Figures 8 and 9 shows the PEAs results obtained with proposed method and its comparison with SOA for each discrete pose. From the Figure 8, we can see that better results are obtained as compared to SOA for pitch angles. However, CNNs and KCovGA algorithms were performing better at pose −30◦.

**Figure 6.** MAE comparison with SOA on Pointing'04 (pitch)

**Figure 7.** MAE comparison with SOA on Pointing'04 (yaw)

**Figure 8.** PEA comparison with SOA on Pointing'04 (pitch)

**Figure 9.** PEA comparison with SOA on Pointing'04 (yaw)

For the remaining three data-sets (AFLW, BU and ICT-3DHPE), the results were previously reported in the literature for MAE values only. For a fair comparison, we also compared our results with SOA for MAE only. The summary of the results for all the three cases is reported and compared with SOA in Tables 2–4 for all the three data-sets respectively. From the Tables, it is clear that we had better results in the two cases (BU and ICT-3DHPE) and competitive results for the AFLW database.

AFLW is a database that is collected from the internet. All the images in AFLW are real-world images which are obtained in un-constrained conditions. Importantly, the quality of the images in most of the cases is very poor. Due to this reason, our proposed MSF-CRFs model was not producing promising segmentation results. As a result, we had poor performance as can be seen in the Table 2.


**Table 2.** Head pose estimation results and its comparison with SOA on AFLW database.

The BU and ICT-3DHPE data-sets are also collected in the real-wold conditions. However, in these cases, the quality of the images is much better. We had better results for both the BU and ICT-3DHPE data-sets, as can be seen in the Tables 3 and 4.

**Table 3.** Head pose estimation results and its comparison with SOA on BU database.


**Table 4.** Head pose estimation results and its comparison with SOA on ICT-3DHP database.


From the head pose estimation results, it is clear that we had better results in most of the cases, even considering recently proposed CNNs based methods. Through this comparison, we are not disparaging deep learning based methods—rather we believe we need better understanding of the deep learning based methods and their implementation to various tasks.

#### *6.3. Age Classification*

We reported our age and gender recognition results with term the Classification Rate (CR). We use Adience data-set for age classification. The Adience data-set has eight age categories. We manually labeled 10 images from each age category. A total of 80 images were used to build the MSF-CRFs model for age test. The MSF-CRFs model was used to create segmented images and probability maps. After generating probability maps for all images and all classes, 10-fold cross validation experiments

were performed on the remaining images (excluding 80 images which were previously used to build MSF-CRFs model).

For age classification we tried all combination of facial features, as in head pose estimation (excluding background). We noticed that every face part contributed to the age classification. The results reported with HAG-MSF-CRFs and its comparison with SOA are shown in Table 5. From the Table 5, It is clear that we had better results for Adience data-set. Interestingly, for age classification we obtained better results as compared to previous results by a big margin.


**Table 5.** Comparative experiments on age classification using Adience databas.

We created Ground truth masks through a commercial image editing software. We did this labeling without any automatic segmentation tool. Such kind of labeling has two main drawbacks. Firstly, this labeling highly depends on subjective perception of a single subject involved in this labeling process. Hence it is very difficult to provide an accurate label to all pixels in an image—particularly on the boundary region of the different face parts. For example, differentiating the nose region from the skin and drawing a boundary between the two is very difficult. Secondly, creating manually labeled images is very time consuming and tedious work. Due to this reason, our age part is limited to age classification only. We did not perform tests on the regression part of the age task. For that case, we would need a large number of manually labeled face images for each age number.

#### *6.4. Gender Recognition*

We performed gender recognition tests with three data-sets, which included Adience, LFW and FERET. The CR values for all three data-sets are shown in Table 6. We also compared our reported results with SOA methods in Table 6.

As in head pose estimation, the possible combinations for all facial features were tried. We obtained the best results with skin, hair, eyes, and nose. After localizing face parts, each image was re-scaled to a height 256 and width was varied accordingly. We manually labeled 30 images from each gender and each data-set. A total of 60 images were used to train an MSF-CRFs (gender) model for each database individually. We performed no cross tests, same database images were used to train an MSF-CRFs model and then some other images of the same data-set were used to evaluate the model.

A fair and exact comparison is very hard to achieve, as different authors use different image settings and different validation protocols. For evaluation of gender recognition, we performed 10-fold cross validation experiments. We manually labeled 60 images, performed 10-fold cross validation experiments, while excluding 60 images which were previously used to build MSF-CRFs model for gender.

Gender classification results with proposed HAG-MSF-CRFs and its comparison with SOA are reported in Table 6. In general, classification accuracy was better than previously reported results. Again, we had poor results as compared to other results for LFW data-set.

As a whole, performance of the newly proposed HAG-MSF-CRFs was very interesting. We introduced a new idea of face image analysis which is using pixel level labeling information for a face image. In a nutshell, we derived an important observation from the reported results *"a strong correlation exists between face parts segmentation and its pose, age and gender. An accurate face segmentation leads to exact head pose, age and gender recognition and vice versa."*


**Table 6.** Comparative experiments on gender recognition using Adience, LFW and FERET data-sets.

#### **7. Conclusions**

In this paper we propose an end-to-end semantic face segmentation algorithm (MSF-CRFs) which tries to solve the challenging problems of head pose, age, and gender recognition. The segmentation model is built using the idea of CRFs between various face parts. Three kinds of features are extracted to build the segmentation model. The MSF-CRFs model classify each pixel in the face image to one of the six classes (hair, eyes, skin, nose, mouth, and background). A probabilistic classification strategy is used to generate probability maps for each face class. Random Decision Forest classifier is trained for each task (head pose, age and gender) through different probability maps combination. A large number of experiments are conducted to know which face parts help in head pose, age and gender recognition. Experimental results are validated on six different face data-sets obtaining better or competitive results compared to SOA.

The segmentation results provide sufficient information for different hidden variable in a face image. A route towards different more classification problems in a face image is provided. For example, we are planing to add some more tasks to the framework such as complicated facial expression recognition, ethnicity classification and many more. We are also planing to improve performance of the segmentation part for example using recently introduced CNNs based methods.

**Author Contributions:** Conceptualization, K.K. and I.S.; methodology, K.K.; software, R.U.K.; validation, K.K. and A.G.; formal analysis, A.I.; investigation, R.U.K.; resources, M.A.; data curation, K.K.; writing—original draft preparation, K.K.; writing—review and editing, K.K., A.I., I.S. and M.A.; visualization, K.K.; supervision, K.K.; project administration, K.K.

**Funding:** This research received no external funding.

**Acknowledgments:** We are immensely grateful to the anonymous reviewers and editor for their comments on an earlier version of the manuscript.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

1. Asthana, A.; Zafeiriou, S.; Cheng, S.; Pantic. M. Robust discriminative response map fitting with constrained local models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 23–28 June 2013; pp. 3444–3451.


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Emotion Recognition from Skeletal Movements**

#### **Tomasz Sapi ´nski 1,†, Dorota Kami ´nska 1,\*,†, Adam Pelikant <sup>1</sup> and Gholamreza Anbarjafari 2,3,4**


Received: 1 March 2019; Accepted: 26 June 2019; Published: 29 June 2019

**Abstract:** Automatic emotion recognition has become an important trend in many artificial intelligence (AI) based applications and has been widely explored in recent years. Most research in the area of automated emotion recognition is based on facial expressions or speech signals. Although the influence of the emotional state on body movements is undeniable, this source of expression is still underestimated in automatic analysis. In this paper, we propose a novel method to recognise seven basic emotional states—namely, happy, sad, surprise, fear, anger, disgust and neutral—utilising body movement. We analyse motion capture data under seven basic emotional states recorded by professional actor/actresses using Microsoft Kinect v2 sensor. We propose a new representation of affective movements, based on sequences of body joints. The proposed algorithm creates a sequential model of affective movement based on low level features inferred from the spacial location and the orientation of joints within the tracked skeleton. In the experimental results, different deep neural networks were employed and compared to recognise the emotional state of the acquired motion sequences. The experimental results conducted in this work show the feasibility of automatic emotion recognition from sequences of body gestures, which can serve as an additional source of information in multimodal emotion recognition.

**Keywords:** emotion recognition; gestures; body movements; Kinect sensor; neural networks; deep learning

#### **1. Introduction**

People express their feelings through different modalities. There is evidence that the affective state of individuals is strongly correlated with facial expressions [1], body language [2] voice [3] and different types of physiological changes [4]. On the basis of external behaviour one can easily determine the internal state of the interlocutor. For example, burst of laughter generally signals amusement, frowning signals nervousness or irritation, crying is closely related to sadness and weakness [5–7]. Mehrabian formulated the principle 7-38-55, according to which the percentage distribution of the message is as follows: 7% verbal signals and words, 38% strength, height, and rhythm and 55% body movements and facial expressions [8]. This suggests that words serve in particular to convey the information and the body language to form conversation or even to substitute the verbal communication. However, it has to be emphasised that this relation is applicable only if a communicator is talking about their feelings or attitudes [9].

Currently, human-computer interaction (HCI) is one of the most rapidly growing fields of research. The main goal of HCI is to facilitate the interaction using several parallel channels of communication between the user and the machine. Although computers are now a part of human life, the relation between human and computer is not natural. Knowledge of the emotional state of the user would

allow the machine to boost the effectiveness of cooperation. That is why affect detection became an important trend in pattern recognition and has been widely explored, especially in the case of facial expressions and speech signals [10]. Body gestures and posture receive considerably less focus. With recent developments and the increasing reliability of motion capture technologies, the literature about automatic recognition of expressive movements has been increasing in quantity and quality. Despite the rising interest in this topic, affective body movements in automatic analysis are still underestimated [11].

The most natural and intuitive method for body movement projection is based on the skeleton, which represents hierarchically arranged joint kinematics along with body segments [12]. In the past, research on body tracking was based on video data, which made it extremely challenging and usually amounted to single frame analysis [13–15]. However, the definition of motion is a change in position over time, thus it should be described as a set of consecutive frame sequences. Skeleton tracking has become much easier with the appearance of motion capture systems, which automatically generate the human skeleton represented by 3-dimensional (3D) coordinates. Additionally, it brought up an increase of research on body movement, such as unusual event detection and crime prevention [16–20].

Affective movement may be described by displacement, distance, velocity, acceleration, time and speed by extracting dynamic features from analysed model. For example in Reference [21], the authors were tracking trajectories of head and hands from a frontal and a lateral view. They combined shape and dynamic expressive gesture features, creating a 4D model of emotion expression that effectively classified emotions according to their valence and arousal. Dynamic features were also considered in Reference [22], where the authors suggested that the timing of the motion is an accurate representation of the properties of emotional expressions.

Very promising results are presented in Reference [23]. The authors analysed Microsoft Kinect v2. recordings of body movements expressing five basic emotions, namely, anger, happiness, fear, sadness and surprise. They used a deep neural network consisting of stacked RBMs , which outperformed all other classifiers, achieving an overall recognition rate of 93%. However, it must be emphasised that the superior performance is associated with the type of analysed data. In Reference [23] emotions are represented as predetermined gestures (each emotion is assigned to particular type of gesture, for example, power pose to happiness). The actors/actresses are instructed how to present particular emotional state prior to recording. Such an approach narrows the research down to the posture recognition problem, which may not be as effective with more complex gestures, despite such promising results.

More viable research is presented by Kleinsmith et al. in Reference [24], where the Gypsy 5 motion capture system was used to record the spontaneous body gestures of Nintendo Wii sports games players. The authors used low-level posture configuration features to create affective movement models for states of concentration, defeat and triumph. An overall accuracy of 66.7% was obtained using a multilayer perceptron. The emotional behaviour of Nintendo Wii tennis players was also analysed in Reference [25]. The authors based their experiment on time-related features such as body segment rotation, angular velocity, angular frequency, orientation, angular acceleration, body directionality and amount of movement. Results obtained using recurrent neural network (RNN), whose average recognition rate is 58.4%, are comparable to human observers' benchmarks.

More recent research [26] presents analysis of human gait recordings performed by professional actors/actresses, captured by Vicon system. The motion data is encoded with HMMs , which are subsequently used to derive a Fisher Score (FS) representation. SVM classification is performed in the HMM-based FS space. The authors obtained a total average recognition rate of 78% for the same subject and 69% for interpersonal recognition. Classification was performed for four emotional states: neutral, joy, anger and sadness. In Reference [27], Vicon was used to collect a full body dataset of emotion including anger, happiness, fear and sadness, expressed by 13 subjects. The authors proposed a stochastic model of the affective movement dynamics using hidden Markov models, performance of which was tested with SVM classifier and resulted in 74% recognition rate.

Despite much lower accuracy compared to affective speech or facial expressions, gesture analysis can serve as a complement to a multimodal system. For example in Reference [28], the authors expanded their studies on emotional facial expressions by analysing sequences of images presenting the motion of arms and upper body. They used a deep neural networks model to recognise dynamic gestures with minimal image pre-processing. By summing up all the absolute differences of each pair of images of particular sequence they created a shape representation of the motion. The experiment demonstrated a significant increase of recognition accuracy achieved by using multimodal information. Their model improves the accuracy of state-of-the-art approaches from 82.5% reported in the literature to 91.3%, using the bi-modal face and body benchmark database (FABO) [29].

Considering all these works, one can observe that there is still a lack of comprehensive affective human analysis from body language [30] mainly because there is no clear consensus about the input and output space. The contributions of this paper are summarised as follows:


This paper adopts the following outline. First, in Section 2, we describe our pipeline for automatic recognition of emotional body gestures and discuss technical aspects of each component. In Section 3, we present results obtained using proposed algorithm, which are thoroughly discussed. Finally, the paper concludes with a summary, followed by suggestions for potential future studies in Section 4.

#### **2. The Proposed Method**

In this section, we present the main components of the proposed system, starting with data acquisition, followed by its pre-processing and ending with classification methods. The structure of proposed emotional gestures expression recognition approach is presented in Figure 1.

**Figure 1.** The structure of proposed emotional gestures expression recognition approach.

#### *2.1. 3D Point Data—Emotional Gestures and Body Movements Corpora*

Motion capture data used for the purpose of this research is a subset of the multimodal database of emotional speech, video and gestures. In this work, we used our recently gathered database [31]. This section is dedicated to recordings of human skeleton. The recordings were conducted in the rehearsal room of *Teatr Nowy im. Kazimierza Dejmka w Łodzi*. Each recorded person was a professional actor/actress from the aforementioned theatre. A total of 16 people were recorded: 8 male and 8 female, aged from 25 to 64. Each person was recorded separately. Before the recording, all actors/actresses were asked to perform the emotional states in the following order: neutral, sadness, surprise, fear, disgust, anger and happiness (this set of discreet emotions was based on examination conducted by Ekman in Reference [32]). In addition, they were asked to utter a short sentence in Polish, with the same emotional state as their corresponding gesture. The sentence was *Kazdy z nas odczuwa emocje ˙ na swój sposób* (English translation: *Each of us perceives emotions in a different manner*). No additional instructions were given on how a particular state should be expressed. All emotions were acted out 5 times, without any guidelines or prompts from the researchers. The total number of gathered samples amounted to 560, which includes 80 samples per each emotional state. Recordings took place in a quiet environment with no lighting issue, against a green background. Cloud point and skeletal data feeds were captured using a Kinect v2 sensor. The full body was in frame, including the legs, as shown in Figure 2. The data were gathered in the form of XEF files.

We are fully aware that there are many disadvantages of an acted emotional database. However, in order to obtain three different modalities simultaneously and gather clean and high quality samples in a controlled, undisturbed environment the decision was made to create a set of acted out emotions. This approach provides crucial fundamentals for creating a corpus with a reasonable number of recorded samples, diversity of gender and age of the actor/actress and the same verbal content. What is more, the actor/actress had complete freedom during recording: movements were not imposed and previously defined, there were no additional restrictions, every repetition is different and simulated by the actor/actress themselves. Thus, presented database may be treated as a quasi-natural one. The database is available for research upon request.

**Figure 2.** Selected frames of actor/actress' poses in six basic emotions: fear, surprise, anger, sadness, happiness, disgust.

For the purpose of this research some of the samples were rejected due to technical reasons, for example, inaccurate position recognition of upper or lower extremities. The final database of affective recordings selected for this study contains 474 samples. The exact number of recordings as well as their average length for each emotional state is presented in Table 1.

**Table 1.** The amount of samples used in the research and the average length of recordings per emotion (in seconds).


Data acquired from the Kinect v2 determines the 3D position and orientation of 25 individual joints, as shown in Figure 3a. The position of each joint is defined by the vector [*x*, *y*, *z*], where the basic unit is 1*m* and the origin of the coordinate system is Kinect v2 sensor itself. The orientation is also determined with three values expressed in degrees. The device does not return orientation values of head, hands, knees and feet.

**Figure 3.** (**a**) Skeleton mapping in relation to the human body [33]. (**b**) An example frame of Kinect recording showing the skeleton.

#### *2.2. Preprocessing*

Raw Kinect v2 data output needs to be subjected to several steps of processing before it can be used in classification—each step is described in following section. The assumption of this research was to reduce data preprocessing to minimum in order to make the path between data acquisition and classification as short as possible, maintaining effective emotion recognition at the same time.

#### 2.2.1. Normalisation—Frame of Reference

Kinect v2 provides data of 3D joints position and orientation, in the space relative to the sensor itself [*x*, *y*, *z*] (where *x* is pointing left from the sensor, *y* is pointing upwards, *z* is the forward axis of the sensor). This kind of data is influenced by the distance between the actor/actress and the sensor during recording. Thus, skeleton coordinates had to be projected from the sensor space [*x*, *y*, *z*] onto a local space of the body [*u*, *v*, *w*] with the center of this space in the *SpineBase* joint of the Kinect skeleton (presented in Figure 3a, called the main joint or root joint), where *u* is pointing left, *v* is pointing up, *w* is pointing forward in relation to the *SpineBase* joint, all [*u*, *v*, *w*] coordinates were calculated in respect to the main joint rotation, as shown in Figure 3b. As a result, a vector containing the positions and orientation of all joints in relation to the main one was obtained. This operation is performed for each frame in every sample. Positions and orientations of the main joint in the first frame are treated as the initial state, while the changes in the displacement or rotation of the main joint in subsequent frames are calculated in relation to the first frame.

#### 2.2.2. Key Frame Extraction

Gestures and body movements can be analyzed as a set of key frames. The key frame should contain crucial information about a particular pose for a given motion sequence. For this purpose, body movement should be divided into separate frames as can be seen in Figure 4.

**Figure 4.** Sequence of three key frames extracted from point cloud data representing happiness.

There are many methods for key frame extraction. Most of them fall into three categories, namely, curve simplification (CS), clustering and matrix factorisation [34]. For the purpose of this research, CS method was used. In this method, the motion sequence is represented as a trajectory curve in 3D space of features and CS algorithms are applied to these trajectory curves. CS utilises Lowe's algorithm [35] for curve simplification, which represents the values of a single joint in a sequence of motion. Starting with the line connecting the beginning and the end of the trajectory, the algorithm divides it into two sublines (intervals), if the maximum deviation of any point on the curve is greater than a certain level of error. The algorithm performs the same process recursively for each subline, until the error rate is small enough for each subline. In this study, we examined the following values of error rate: 1 cm, 2 cm, 3 cm, 5 cm, 10 cm and 15 cm. For the error rate of 1 cm and 2 cm, the obtained number of key frames is almost identical to the number of frames of the recording, even for neutral state in which the actor/actress stay almost still. Thus, this level of error rate is considered as a Kinect v2 measurement error (especially in the case of hand movement, which is described in Section 2.3). For the error rate of 10 cm and 15 cm, the obtained number of key frames is not sufficient to adequately describe emotional movement. The average number of key frames oscillates around 2, which means that only a few frames between the first and the last one were selected. Thus, error rate values of 1 cm, 2 cm, 10 cm and 15 cm were excluded from further analysis.

#### 2.2.3. Normalisation—Reduction of Individual Features

It is assumed that every human is built in proportion to his or her height and the length of legs and arms is proportional to the overall body structure [36]. To unify the value of the position of the joints between the higher and lower individuals, we propose normalisation based on the distance between two joints with the lowest noise value of their position on all recordings: *SpineBase* and *SpineShoulder*. The distance used for normalisation is measured for each frame of the actor/actress's neutral recordings. Normalisation of all joints within a given sequence of frames follows Equation (1), where skeleton consists of 25 joints, *di* is the distance vector between the *i* and *J*<sup>0</sup> joints normalised to the median of distances between the joints *J*<sup>0</sup> and *J*<sup>20</sup> (*SpineBase* and *SpineShoulder*) of all neutral recordings for each individual. *J*20-

$$d\_i = \frac{J\_i}{J\_{20}\stackrel{\smile}{\longrightarrow} J\_0} \tag{1}$$

where *i* = 1, ... , 25 is the number of joints. This process is performed for all joints, relative to the skeleton in the neutral position of particular individual. Neutral state is used to preserve information about special movements such as jump or squat occurring in emotional recordings (e.g., joyful hop). Considering the same degree of freedom of each body part for all recorded individuals, values of joint orientation did not require any additional processing.

The output of the key frames extraction is a set of sequences of varying lengths, which can not be considered as an input for all types of classifiers, in our case CNN. In order to unify the length of the sequences, we applied zero padding algorithm to prepare the data for CNN.

Next, all sequences are subjected to z-score normalisation, which is a widely used step to accelerate the process of neural networks learning [37–39]. For the purpose of this research we apply *sequence-wise* *normalisation* [38] for each key frame sequence. In this method, mean and standard deviation is calculated among data from all sequences excluding zero frames added during the previous step.

#### *2.3. Datasets Division*

During data preparation, a relative average quantity of motion (distance covered by a specific joint) was measured for each emotional state. Calculations were made according to the formula (2).

$$avg\_{j\epsilon} = \sum\_{nc=1}^{N\_{\varepsilon}} \frac{|p\_{j\epsilon}(f\_{nc}) - p\_{j\epsilon}(f\_{nc} - 1)|}{F\_{nc}N\_{\varepsilon}} \tag{2}$$

where: *j* = 0, ... , 25—the number of the joint; *e*—emotional state (Ne—neutral, Sa—sadness, Su—surprise, Fe—fear, An—anger, Di—disgust, Ha—happiness); *Ne*—is a number of recordings per emotion *e*; *ne* = 1,. . . ,*Ne*—the index of the emotional state *e* recording; *Fne*—is a number of frames per recording *n* of emotional state *e*; *fne* = 2,. . . ,*Fne*—frame index in recording *n* of emotional state *e* (excluding first frame); *pje*(*fne*)—position of joint *j* in frame *fne* in hierarchical local coordinates.

For each joint, the calculated values are relative, based on changes in the local coordinate system of the given joint, the centre of which is located in a superior joint in hierarchical skeleton construction (e.g., for the *WristRight* joint corresponding to the position of right hand wrist, the origin of the local coordinate system is the *ElbowRight* joint corresponding to the position of the right hand elbow. These calculations were made separately for each emotion. The results are shown in Figure 5.

**Figure 5.** Heat-map presenting distribution of joints involvement for particular emotional state (**a**) for all joints (**b**) excluding hands.

One can observe in Figure 5a that the largest involvement in emotional expression is observed for hands and thumbs (*HandTipLe f t*, *HandTipRight*, *HandTipLe f t*, *HandTipRight*, *HandLef t*, *HandRight*). However, the intensity of movement of these particular joints is caused by the measurement error of Kinect v2. Thus, in further analysis it is assumed that the hand position is determined by position of the wrists (*WristLef t* i *WristRight*) and all hand related joints were excluded from the datasets. According to Figure 5b, the largest involvement is observed for wrists and arm related joints, which is common for emotion expression. It is worth emphasising that the involvement of legs is visible, especially for the knees (*KneeLef t*, *KneeRight*) and ankles (*AnkleLef t*, *AnkleRight*).

Most state-of-the-art research focuses only on the upper body, thus in this study, the influence of leg movement on affective gestures was examined. In addition, we investigated which type of data (joint orientation, position or mixture of both) is best suited for the classification of emotional sates from gestures. In order to conduct such research we examined the datasets presented in Table 2.


**Table 2.** Input datasets for classification

#### *2.4. Classification—Models of Neural Networks*

The final step of the proposed method is classification, which aims to assign input data to a specific category *k* (in this case: neutral state, sadness, surprise, fear, anger, disgust, happiness). In this work, we apply different deep learning Neural Networks (NN) to the proposed combination of datasets Table 2 in order to compare their performances, based on the recognition rates. We use a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN) and a Recurrent Neural Network with Long Short-Term Memory Network (RNN-LSTM) with low level features (positions and orientation of joints within the skeleton), in terms of motion emotion recognition efficiency. The proposed approach of adjusting the abovementioned neural networks to motion sequence analysis is presented in the following section.

#### 2.4.1. Convolutional Neural Network

The scope of use of CNNs has expanded greatly to different application domains, including the classification of signals representing emotional states [40,41]. Due to its well configured structures consisting of multiple layers, this kind of network is able to determine the most distinctive features based on enormous collections of data. The possibility of reducing the number of parameters required for images over a regular network makes CNN the most commonly used classifier for image processing. CNN considers an image as a matrix and uses the convolution operation [42] to implement a filter, which is sliding through the input matrix. In a multi-layered CNN, the input of each convolution layer is comprised of the filtered output matrix of the previous layer. The convolutional filter values are adjusted during the training phase. The process of using a CNN for gestures-based emotion recognition from sequence of movement is presented in Figure 6a.

**Figure 6.** (**a**) The process of using a Convolutional Neural Network (CNN) for gestures-based emotion recognition shows the process of creating an matrices based on motion sequence. (**b**) The process of using a Recurrent Neural Network (RNN) for motion sequence analysis—each time step of the motion sequence is evaluated by a RNN.

#### 2.4.2. Recurrent Neural Network

RNNs allow operation directly on time sequences. They are successfully applied to tasks involving temporal data such as speech recognition, language modelling, translation, image captioning or gestures analysis. In RNN, the output of the previous sequence time step is taken into consideration when calculating the result of the next one. However, standard RNN does not handle long term dependencies well, due to the vanishing gradient problem [43].

The Long Short Term Memory network (RNN-LSTM) is an extension for RNN, which works much better than the standard version. In RNN-LSTM architecture, RNN uses gateway units in addition to the common activation function, which extend its memory [44]. Such an architecture allows the network to learn and "remember" dependencies over more time steps, linking causes and effects remotely [45]. The process of using a RNN and RNN-LSTM for gestures-based emotion recognition sequence of movement is presented in Figure 6b.

#### **3. Results and Discussion**

#### *Selection of the Optimal Classification Model*

For each of the neural network types mentioned in Section 2.4, the following architectures were tested:


For all NN types, separate models were built increasing the neuron count on each layer by 25 for each new model (i.e., for RNN starting with a network containing 2 layers of 50 recurrent neurons and finishing with 4 layers containing 400 neurons). Table 3 shows the results obtained using three types of neural network for the above mentioned datasets. For CNN, the best results were obtained for a network of 4 layers, 3 layers of convolution neurons 250, 250, 100 for each layer respectively and a dense layer of 100 neurons. For RNN best results were obtained for a 3 layer model with 3 recurrent layers of 300, 150, and 100 neurons. RNN-LSTM achieved best results for a 3 layer architecture of 250, 300, 300 neurons. In addition, all NNs had a single dense layer of 7 neurons as the output layer. We used 10-fold leave-one-subject-out cross-validation and repeat the process for 10 iterations, averaging the final score. All NNs were trained using ADAM [46] for gradient descent optimisation and cross–entropy as the cost function, as it is a robust method based on well known classical probabilistic and statistical principles and is therefore problem-independent. It is capable of efficiently solving difficult deterministic and stochastic optimisation problems [47]. Training was set to 500 epochs with an early stop condition if no loss decrease was detected for more than 30 epochs.

**Table 3.** Classification performances of different feature representations in for the set of 7 basic emotions. Numbers in bold highlight the maximum classification rates achieved in each column. PO—Positions and orientation, upper and lower body, POU—Positions and orientation, upper body, P—Positions, upper and lower body, O—Orientation, upper and lower body, PU—Positions, upper body, OU—Orientation, upper body.


One can easily observe that the best results (69%) were obtained using RNN-LSTM on the *P* set containing position of all skeletal joins (upper and lower body). In general, this set of features gives the best results for all types of networks (58.1% for CNN, 59.4% for RNN). This suggests that this kind of features provide the best description for emotional expressions from all considered feature types. In case of the *PU* set, results for all networks are lower than 5%, which indicates the effect of the lower part of the body on recognition. Using orientation *O* as a features set, even if complimenting the position (*PO* or *POU*), results in much lower recognition. According to Table 3 results indicate a slight impact of error rate—better results were achieved using the 3 error rate almost for every dataset and NN, in few cases the results were equal. This may suggest that even a small movement or displacement can affect the recognition of emotions and the error rate of 5 cm might not be precise enough to represent all relevant movement data.

In addition, the experiment was conducted on sequences without the keyframing step in the pre-processing (containing all the recorded frames) for all NN models and all the datasets. The results of classification were 5–10% lower (depending on the model and set) than those acquired by key frames with error rate of 3 cm. Moreover, the time of NN training rose significantly due to a large increase in the data volume. Lower recognition results for sets without keyframing might have been caused by the Kinect v2 sensor noise, as the device output is not very precise and produces small variations in returned positions and orientations from frame to frame. This can be mitigated by applying filtering on the signal, however it is a time and computational consuming process, which does not fall into the assumption of reducing data pre-processing to a minimum. In our approach, the keyframing process allowed us to avoid the sensor precision related issues.

Performance of the proposed NN models was compared with the state-of-the-art NN architecture, ResNet. It has won several classification competitions, achieving promising results on tasks related to detection, localisation and segmentation [48]. The core idea of this model is to use a so-called identity shortcut connection to jump over one or more layers [49]. ResNets use the convolutional layers with 3 × 3 filters, which are followed by batch normalisation and rectified linear unit ReLU. Plenty of experiments showed that the use of the shortcut connections makes the model more accurate and faster compared to their equivalent models. We recreated the exact process as described in Reference [48], as the results obtained for action recognition in Reference [48] look very promising (accuracy over 99%) and as initially assumed, the method might be applicable for emotional gestures classification. The 3D coordinates of the Kinect skeleton (from our *P* and *PU* datasets) were transformed into RGB images. The sets were also augmented according to the description in the source paper. For our experiment, we prepared the testing and training set following the 10-fold leave-one-subject-out cross-validation method, meaning that the testing set did not contain the training samples and samples obtained from training set samples augmentation. Accuracy achieved using ResNet is significantly lower than that of the other NN types. This might be caused by the size of the original dataset, which contains only 474 unique samples and the process of argumentation presented in Reference [48] does not produce a diverse enough set to train such a deep NN.

For each type of neural network, the best results are presented in a form of confusion matrix (see Figure 7). One can observe that the best results were obtained for the neutral state as it differs greatly from other expressions (the actor/actress stood still, while there was a relatively bigger amount of movement while expressing other states).

Happiness, sadness and anger have a high rate of recognition and are sporadically classified as other emotions, as gestures in those three states are highly distinctive and differ from other emotional states (in terms of dynamics, body and limb positions and movement), even when the gestures are not exaggerated. Disgust and fear were confused with one another most frequently, this might be caused by the way they were performed by the actor/actress, as this confusion pattern is analogous for all three NN types. It is clearly visible on the recordings that those two emotions were acted out very similarly in terms of gestures (usually backing out movement with hands placed near head or neck for both states).


**Figure 7.** Confusion matrix for (**a**) CNN on *P* set with 3 cm error rate (**b**) RNN on *P* set with 3 cm error rate (**c**) RNN-LSTM on *P* set with 3 cm error rate. Seven emotional states: Ne—neutral, Sa—sadness, Su—surprise, Fe—fear, An—anger, Di—disgust, Ha—happiness.

Since the recognition accuracy of the neutral class far exceeds other emotional states, as the samples for this state contain the least amount of motion and it differs from all the other states greatly, in the next step we analyse two sets without this class. From the first one we merely exclude neutral state, thus it consists of sadness, surprise, fear, anger, disgust and happiness. The second set contains emotional states, which are most commonly used in the literature: sadness, fear, anger, and happiness. Experimental results of the above-mentioned datasets are presented in Table 4. As in the case of seven classes, the best results were obtained using *P* set. Similarly, RNN-LSTM proved to be the most effective, providing 72% in case of 6 classes and 82.7% in the case of 4. Confusion matrices for the above-mentioedn sets are presented in Figures 8 and 9.

**Table 4.** Classification performances of different feature representations for the set of basic emotions. PO—Positions and orientation, upper and lower body, POU—Positions and orientation, upper body, P—Positions, upper and lower body, O—Orientation, upper and lower body, PU—Positions, upper body, OU—Orientation, upper body.


**Figure 8.** Confusion matrices for (**a**) CNN on *P* set with 3 cm error rate (**b**) RNN on *P* set with 3 cm error rate (**c**) RNN-LSTM on *P* set with 3 cm error rate. Six emotional states: Fe—fear, Ha—happiness, Sa—sadness, Su—surprise, An—anger, Di—disgust.


**Figure 9.** Confusion matrices for (**a**) CNN on *P* set with 3 cm error rate (**b**) RNN on *P* set with 3 cm error rate (**c**) RNN-LSTM on *P* set with 3 cm error rate. Four emotional states: Fe—fear, Ha—happiness, Sa—sadness, An—anger.

In order to compare the proposed method with other classification methods, we calculated the most commonly used features, such as kinematic related features (velocity, acceleration, kinetic energy), spatial extent related features (bounding box volume, contraction index, density), smoothness related features, leaning related features and distance related features. During features extraction we strictly followed approach presented in Reference [23], since the authors obtained very promising results on a database derived from Kinect recordings. We juxtaposed several well known classification methods to verify the above-mentioned features and their effectiveness in gestures-based emotion recognition. The obtained results are presented in Table 5.

**Table 5.** The performance of some well-known classifiers.


To determine the performance of the above-mentioned classifiers we used the WEKA [50] environment. All parameters of the classifiers were set empirically in order to achieve the highest efficiency. As one can easily observe, the best results were obtained in the case of Random Forests. However, it should be emphasised that none of the methods listed above achieve better results than the proposed approach. This is a result of the generalisation of features from the whole recording, an approach which might be appropriate for simple gestures recognition; however, it becomes inaccurate for more complex and non-repeatable expressions.

#### **4. Conclusions**

In this paper, we presented a sequential model of affective movement as well as how different sets of low level features (positions and orientation of joints) performed on CNN, RNN and RNN-LSTM. The training and testing data contained samples representing seven basic emotions. The database consisted of recordings of constant affective movements, in contrast with other research, which is mostly reduced to specific single gesture recognition. Thus, we did not analyse solely separated selected frames but the whole movement as a unit. This experiment highlighted how challenging the task of recognising an emotional state based merely on gestures might be. The performance was much lower than in the case of particular gesture recognition; however, it was still higher than a human's performance (63%) [31].

The obtained results showed that body movements can serve as an additional source of information in a more comprehensive study. Thus, for future work we plan to combine all the three modalities, namely audio, facial expressions and gestures, which are signals perceived by a healthy human during a typical conversation. We believe that additional patterns extracted from affective movement may have a significant impact on the quality of recognition, especially in the case of emotion recognition in the wild [41]. In addition, we plan to extend our analysis using the Denspose [51] method and fuse and juxtapose with features provided by Kinect v2.

What is more, we will explore and compare methods used for action recognition, such as those presented in References [52–54], as they provide interesting expansion of the models used in this paper. For example, in Reference [52] the authors use a similar RNN-LSTM network architecture, instead of raw skeletal data, geometrical features extracted from the skeleton are fed to the NN. Also an interesting approach for RNN-LSTM is presented in Reference [53], where spatial attention joint-selection gates and temporal attention with frame-selection gates are added to RNN-LSTM. In Reference [54], the authors used F2C CNN -based network architecture for action recognition, with superior results compared to other classification modes. We plan to incorporate methods used in action recognition for the purpose of gesture based emotion classification, as the problem poses similar challenges in both areas.

**Author Contributions:** conceptualisation, T.S. and D.K.; methodology, T.S. and D.K.; software, T.S. and D.K.; validation, T.S. and D.K.; formal analysis, T.S. and D.K.; investigation, T.S. and D.K.; resources, T.S. and D.K.; data collection, T.S. and D.K.; writing—original draft preparation, T.S. and D.K.; writing—review and editing, G.A.; visualisation, T.S. and D.K.; supervision, G.A. and A.P.; project administration, G.A.; funding acquisition, D.K.

**Funding:** This research received no external funding.

**Acknowledgments:** The Estonian Centre of Excellence in IT (EXCITE) funded by the European Regional Development Fund and the Scientific and Technological Research Council of Turkey (TÜBITAK) (Project 1001 - 116E097).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


c 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

*Article*

## **Enhanced Approach Using Reduced SBTFD Features and Modified Individual Behavior Estimation for Crowd Condition Prediction**

#### **Fatai Idowu Sadiq 1,2,\*, Ali Selamat 1,3,4,\*, Roliana Ibrahim <sup>1</sup> and Ondrej Krejcar <sup>3</sup>**


Received: 28 February 2019; Accepted: 7 May 2019; Published: 13 May 2019

**Abstract:** Sensor technology provides the real-time monitoring of data in several scenarios that contribute to the improved security of life and property. Crowd condition monitoring is an area that has benefited from this. The basic context-aware framework (BCF) uses activity recognition based on emerging intelligent technology and is among the best that has been proposed for this purpose. However, accuracy is low, and the false negative rate (FNR) remains high. Thus, the need for an enhanced framework that offers reduced FNR and higher accuracy becomes necessary. This article reports our work on the development of an enhanced context-aware framework (EHCAF) using smartphone participatory sensing for crowd monitoring, dimensionality reduction of statistical-based time-frequency domain (SBTFD) features, and enhanced individual behavior estimation (IBEenhcaf). The experimental results achieved 99.1% accuracy and an FNR of 2.8%, showing a clear improvement over the 92.0% accuracy, and an FNR of 31.3% of the BCF.

**Keywords:** context-aware framework; accuracy; false negative rate; individual behavior estimation; statistical-based time-frequency domain and crowd condition

#### **1. Introduction**

Crowd abnormality monitor (CAM) is a process of determining individual behavior in a crowd to prevent accidents in crowd-prone areas. Crowd monitoring using activity recognition (AR) to analyze individual behavior is maturing rapidly due to the current advancement in sensor technologies [1]. Increased research focus on human activity recognition (HAR) in diverse application domains highlights the significance of human–computer interaction (HCI) [2]. Two conventional methods are employed in the analysis of abnormal behavior in crowds. According to Zhang et al. [3], the "object-based" method identifies a crowd as a collection of individuals, while segmentation methods are used for analyses of crowd behaviors. In crowd behavior analysis, the performance of segmentation or detection of objects is usually faced with the complexity in the detection of objects [3]. Previous studies have demonstrated the object-based method with individual activity recognition. Issues in ongoing research have been extensively discussed, with initial solutions suggested in [4]. Context-aware approaches have been proposed previously; for example, [5]. However, only one [6] focused on crowd abnormality monitor and mitigation with the use of individual AR. However, the threshold used for crowd density

in terms of the prediction of crowd condition is unclear [6]. An efficient approach should be able to accurately determine the number of persons within a square meter in order to prevent accidents during an emergency in a crowd scenario [7]. In the study by [6], the simulation was done inside a university building and conducted with a system of CAM [6], thus reducing the practical applicability of the system. Therefore, an alternative with high accuracy performance and a low false negative rate (FNR), which measures the false alarm to promote the efficient and reliable prediction of crowd conditions based on individual behavior [6], is needed. This will be based on an extension of the proposed basic context-aware framework (BCF) proposed [6]. A potential solution is to advance the previous BCF using the reduction of relevant statistical-based time-frequency domain (SBTFD) features with improved accuracy, reduced the FNR, and IBEenhcaf for individual and crowd condition prediction.

The motivation of this article proposes an enhanced context-aware framework using IBEenhcaf to improve the safety of human lives in a crowd-prone environment. The proposed approach utilized reduced features, with high-accuracy performance previously reported [4,8]. This study reports the result of an ongoing study on other sensor data validation, which included the effect of low FNR, and a clear definition of crowd density threshold for individuals per square meter (m2) for crowd monitoring. The proposed approach employs the crowd density definition suggested in [7] and utilizes individual contexts from sensor signals in real time. In addition, the detection of five or more persons per m2 is considered an extremely high density [9] to minimize the risk of accident in a moving crowd. The suggested solution promises accurate and reliable feedback to likely accident victims in an unforeseen situation. In this article, the context-aware framework is defined as a BCF that utilizes contexts such as individual user activities, location, and time [6]. The contexts are hidden information derived from smartphone sensor data [6]. The contributions of this article are:

(1) To present the validation result of other sensors used for individual behavior estimation (IBE) to extend the BCF.

(2) To suggest a clear crowd density threshold (CDT) per m<sup>2</sup> using a low FNR from reduced features to extend BCF.

(3) To propose an enhanced approach with reduced SBTFD features and modified IBE for crowd condition prediction with CDT to improve on BCF.

The proposed solution has the potential to minimize incessant death occurrences in social gatherings through a viable technology concept. The rest of the article is organized as follows: Section 2 discusses the current approaches to crowd monitoring, Section 3 presents the materials and methodology used in the study, Section 4 presents experimental results for the investigated issue to achieve the contributions in the article. The results are discussed in Section 5, while Section 6 addresses the conclusion and future work.

#### **2. Current Approaches in Crowd Monitoring System**

The crowd monitoring system (CMS) currently has three approaches, namely: (i) computer vision-based methods, (ii) sensor data analysis, and (iii) social media data analysis [10]. The most commonly used is sensor data analysis, which is also employed in this study [11] for several reasons. These include (i) a tendency for the provision of accurate and real-time information, (ii) nowadays, the new sensors on smartphones having the potential to revolutionize how we manage information, (iii) offering safety and enhancing security if well utilized in crowded places, (iv) wider coverage, as smartphones are used by almost everyone, and (v) feedback to potential victims in case of accidents [12]. Besides, sensor data analysis is widely used in AR with promising results [1,2,5]. Several feature extraction methods (FEM) have been employed in recent studies [13,14]. Table 1 presents the strengths and limitations of existing feature extraction methods.

The following section presents an analysis of FEM, including time domain (TD), frequency domain (FD), and feature reduction, and highlights those that can potentially be used for individual and crowd condition monitoring. Then, feature reduction based on feature selection methods (FSM) is examined for CMS for the minimization of time, classification, and accurate prediction. Related studies in context-aware frameworks are also discussed.

#### *2.1. Time Domain (TD)*

TD features include mean, median, range, variance, maximum, minimum, skewness, and kurtosis, to name a few. The features are widely used in HAR [15–17]. According to [17], the integral method has been applied to extract energy expenditure information from raw sensor signal data, where the total integral of the modulus of acceleration (IMA) was employed. The method is referred to as the time integral of the module of accelerometer signals, and is expressed in Equation (1):

$$IMA\_{tot} = \int\_{t=1}^{N} |a\_x| dt + \int\_{t=0}^{N} |a\_y| dt + \int\_{t=0}^{N} |a\_z| dt \tag{1}$$

where *ax*, *ay*, *az* represent the orthogonal components of acceleration, t denotes time, and *N* is the window length. Some of the methods of extracting features rely on the ability to transform input signals to and from different domains [14]. To apply feature computations on a smartphone, one needs to be careful due to computational complexity as a result of limited memory, processing time, and battery lifetime. According to [18], almost all TD features are suitable for mobile devices, because their correlation operations have higher computational cost. A feature extracted from the raw sensor signal's data from individual activity recognition is such a piece of information, and can be used when classifying activity recognition to determine the characteristics of the individual in a crowd scenario in this thesis. In order to create features from the AR sensor raw dataset, different methods and mathematical calculations are applied to the raw dataset, and new features are extracted. Other time domain features such as zero crossing, signal vector magnitude, the signal magnitude area, and angular velocity have also been used in AR [19,20].

#### *2.2. Frequency Domain (FD)*

Features in this domain are important because the Fourier domain in AR sensor data has a much greater range than the AR in the spatial domain. To be sufficiently accurate, its values are usually calculated and in float values. Fast Fourier transform (FFT) also preserves information from the original raw signal and ensures that important features are not lost as a result of FFT [21]. FD splits the signal into sinusoidal waves with various frequencies using Equation (2):

$$f = \int\_{1}^{w} \mathbf{x}(t) e^{-j2\pi ft} dt; \; \mathbf{x}(t) = \int\_{1}^{w} X(f) e^{j2\pi ft} dt \tag{2}$$

where *t* = time; *f* = frequency; *X(f)* = inverse Fourier transform; and *x(t)* depicts Fourier transformation [22].

The proper selection of FD feature and sampling frequency is a key factor for extracting the frequency components; an inability to realize this may result in a false prediction of an individual in a crowd [3]. Zheng [3] transforms *x(t)* to overcomes the drawback of inaccurate detection by introducing a frequency domain component and obtaining relevant information for AR [3,23]. Other important domains include the wavelet domain (WD), which are better noted in the analysis if irregular data patterns are used; that is, impulses exist at different time intervals [12], and therefore, require the selection of a proper mother wavelet. The heuristic domain (HD) works by using the assignment of the correct value to suggest the best corrective measure of sensor signals [16]. Therefore, HD requires input from multiple experts aggregates the result. The time domain–frequency domain (TDFD) produces an efficient performance for individual's representation in the crowd [14]; however, the use of FFT\_RMS as the only FD may not assume the performance of other TD features.


**Table 1.** Strength and limitations of existing feature extraction methods.

<sup>1</sup> Note: TD = Time domain feature; FD = Frequency domain feature; TDFD = Time domain–frequency domain feature; FFT\_RMS = Fast Fourier Transform of Root Mean Square.

Table 2 presents a synthesis of existing FEMs and their names in AR. It shows the features used in a crowd condition, the application domain, and the researcher, and those that have not been used in crowd conditions are also indicated. Table 2 shows that only conventional FEMs have been used in previous crowd-related research with Mean, Std, along x, y, and z [16,18,22], and variance along x, y, and z [14,18]. This could be responsible for the observed inaccuracy of 92% reported for CAM, which has also been noted by [24] to be generally low. It can also be noted that some salient TDFD features that are capable of accurate prediction were overlooked in the BCF, thus strengthening the need for further studies.

**Table 2.** Summary feature extraction methods (FEM) methods used and those that have not been used in crowd-related studies.


#### *2.3. Related Works on Feature Reduction, Context-Aware Framework (CAF), and Activity Recognition (AR)*

Feature reduction methods are important approaches that help avoid the cause of dimensionality [30], that is, the number of feature spaces in a feature vector. It targets a reduction in the number of previously used features on a mobile device in AR. High dimensionality on the accuracy of classification performance has been an important domain of research in HAR [31,32]. Feature reduction can facilitate the early detection of an emergency in an unforeseen circumstance [29]. Thus, the risk associated with individual activity recognition (IAR) in a crowd condition can be minimized by the reduction of FNR. The issue of high false alarm with FNR was not addressed in BCF. The solution proposed in our previous work as Phase 2 was reported [4].

The review of AR recognition works on individuals and crowds explains the potential of features dimensionality reduction for accurate and efficient crowd conditions; however, a feature reduction-based feature selection method has never been applied for this purpose. The work of [33] on early recognition supports this objective; it predicts a one-shot learning-based pattern transition for early detection recognition. A great benefit of the approach proposed by [34] utilized a smaller number of features for the prediction of ovarian cancer survival, and achieved very limited computational efforts. The use of a smart selection of a lesser number of relevant features compared with the number of features used with FEM in BCF diminishes the computational effort greatly, and reduced the false negative alarm. Moreover, an unclear definition of CDT has been noted by [7,9] as a major challenge in BCF. An inappropriate threshold of high density used for individual behavior estimation by [6], and a lack of feedback to victims resulting to a high false alarm in an emergency led to an unreliable prediction of crowd conditions, such as for example crowd abnormality behavior. Chang et al. [35] introduced a context-aware mobile platform for an intellectual disaster alerts system (IDAS); it focused on how environmental changes can result in accidents and disasters. According to the authors, a quick and accurate alert delivered to victims is essential in a disaster situation. However, their work focuses on addressing disaster issues, rather than crowd monitoring for safety.

Context-aware computing, an application concept that can sense the physical environment and reacts accordingly, was proposed by [36]. It is aimed at facilitating the quick and efficient development of a framework that combines context-aware service and machine learning [36]. The study led to the development of context-aware and pattern oriented machine-learning framework (CAPOMF). It focused on how commuters can avoid potholes to save vehicle repair costs. In previous context-awareness research, machine learning is rarely used [36–38] for the realization of context-aware framework. The studies of [6,39] also emphasized that context-aware application and its services remain open research issues. Prior to [6], no context-aware research with activity recognition have been applied or proposed for crowd abnormality mitigation in the literature. The outstanding problems that constitute a challenge in context-aware research regarding their affects on crowd disaster mitigation are itemized as follows:


As of June 2018, context-aware computing was worth US\$120 billion [40]. Its research finds application in many domains with only few in disaster management. The extant literature highlights three methods used in context-aware framework: (i) scenario-based with a hypothetical example using a develop application, (ii) comparative analysis using a side-by-side comparison of components [41], and metric evaluation with accuracy, precision, recall, and f-score with an experiment on related activities [35]. Table 3 presents related works and highlights gaps in previous research.


**Table 3.** Related context-aware frameworks and activity recognition methods with the research gaps for individual and crowd condition prediction.

<sup>3</sup> Note: ARAC = Activity recognition accuracy, AR = Activity recognition, FSM = Feature selection method adopted to reduce features and CCP = Crowd condition prediction. CFS = Correlation-based feature selection, CHI = Chi-square feature selection and MRMR = Minimum redundancy–maximum relevance feature selection.

#### **3. Materials and Methods**

This section presents the methodology employed in this study. It provides a description of the development of the context-aware activity recognition application used for data collection, data validation outcome, adopted and modified algorithm implementation, and results in analysis approaches.

We developed an Android application called Context Activity Data Collector (CADC) based on Java programming as a client, and the crowd controller station (CCS) as a server to store the CADC in real-time for offline data analysis. The CADC runs on an Android 3.0.2 version of a Samsung Galaxy SM-G530H. Figure 1 shows the CADC data collection interface. An example of the sensor signals collected at a Malaysian public institution between March and April (2015) is shown in Figure 1. The eight (8) classes considered in the experiment conducted are selected from multiple possible conditions of an individual in the considered scenario. The scenarios considered are: climb down (V1), climb up (V2), fall (V3), jogging (V4), peak shake while standing (V5), standing (V6), still (V7), and walking (V8).


**Figure 1.** Sensor signals dataset collection interface used by volunteers during the experiment.

Several instances were captured for each scenario performed by volunteers (node S), yielding 22,350 class instances. In this case, S is referred to as the volunteers that make use of Figure 1 in the experiment conducted. The class instances obtained from S during the experiment include V1: 1975, V2: 2410, V3: 3159, V4: 2952, V5: 2937, V6: 2757, V7:3230, and V8: 3470 for dataset D1. The validated results of other sensor signals (captured as six additional classes, V12 to V18) for D1, which include a digital compass, longitude, latitude, and timestamps used for individual behavior estimation, were reported for dataset D1 based on IAR. Table 4 summarized the D1 dataset used for this research.



#### *3.1. Methodology for the Proposed Enhanced Approach*

The methodology in this article focuses on Phase 4 of Figure 2, while phases 1–3 were activities presented in the previous work [4,8]. They are important to achieve Phase 4 focused in this article as stated in the objective highlighted in Section 1, and the need for the reflection of these parts in Figure 2 for clear flow and understanding of this article.

**Figure 2.** The process flow of the methodology used for the enhanced context-aware framework approach (EHCAF).

A high accuracy and reduction of a negative false alarm are highly desirable and central to crowd condition prediction; however, the approach cannot be adopted without adequate changes to the algorithm using the same data collection with the activity recognition method as shown in Figure 1 using Table 4. This was done by adopting the suitable threshold, which is called the crowd density threshold (CDT) (Figure 2) in Equation (4), while modifying the algorithms presented in BCF with a clear threshold definition of crowd density estimation to accurately detect individual per m2 in crowd scenarios experimented. The crowd density in this study is defined as >2 persons/m2. In order to achieve the stated objectives, the following tasks were carried out as summarized in Figure 2:

Step 1: Design: experimental; data type: sensor-based real-time IAR; Sample: 20 volunteers; provided: 22350 instances for D1 dataset.

Step 2: Procedure: development of CADC application (Figure 1) with algorithm implemented based on CDT using Java installed on volunteers' phones; sensors (digital compass, longitude, latitude as Global Positioning System (GPS) data for location etc., as presented in Table 4.

Step 3: Functioning of CADC: internet-enabled with hotspots; 50 to 100 m2 coverage.

Step 4: Server setup: crowd controller station (CCS); volunteers (node S) launch the CADC app by pressing the start button; select activity scenario; perform each for 10 min while maintaining a range of 1 m2 to each other, which was done collectively until all activity is reached; CCS store the sensor signals' collected data in text format; each volunteer stops the app as specified to end the data collection; duration was 5 h for each round of data collection. The guideline in the previous AR data set is employed [11,13,20]. The D1 collection became necessary because the sensors required were not available in the public domain [11,13,20] at the time of this study.

Step 5: Validation: The validation of raw sensor signals [44] was performed using an analysis of variance (ANOVA). This helps for the significant test of the dataset used in this study.

Step 6: Data analysis: Missing data was handled by employing moving average; noise removal from D1 was achieved using segmentation with 50% overlapping based on 256 sliding windows; for detail, see [4].

Step 7: Improved SBTFD features with newly suggested 39 features based on FEM (total 54 features) yields 7.1% accuracy improvement; this was implemented in Python; and reported in [4].

Step 8: Feature reduction using a feature selection method newly introduced to this domain produced seven (7) effective features; this again yields 99.1% accuracy, which is also an enhancement in AR and crowd monitoring studies; details are provided in [8].

This section described the procedure for enhanced IBE. Following the AR in steps 7 and 8; it is necessary to obtain other necessary features that can identify and estimate the behavior of an individual [6]. It begins with the implementation of a modified algorithm for the identification and grouping of individual participants (smartphone) as node S by the crowd controller station (CCS) using GPS as sensor data [5]. This is followed by the implementation of adopted algorithms, which determines abnormal movement behavior among individuals using the flow velocity Vsi estimation and flow direction Dsi identification [44]. The Vsi and Dsi were computed using the sensor fusion method based on Kalman filter as reported in [44].

The next stage picks the Vsi and Dsi, and combines them with the seven best (reduced) features previously achieved in step 8 from each class of activity scenario e.g., V2; for detail, see [33]. Thereafter, the combined Vsi, Dsi, and reduced features were used as input to modify the pairwise behavior estimation algorithm (PBEA). The PBEA was implemented to identify and determine the behavior of the individual in a crowd with a disparity value computed using the disparity matrix. The final stage employs the IBE using the reduced features based on CDT to evaluate the individual crowd density determination (CDD) per m2. The CDD help to appraise the inflow and outflow of moving individuals to ascertain crowd turbulence. This was realized using the CCS, which triggers up a context-aware alert to predict the abnormal behavior of an individual and crowd condition. It also determines the participation of the individual in a crowd scenario based on disparity values to develop the proposed approach, an enhanced context-aware framework (EHCAF), which is an improvement on the BCF.

The following sections present details of the steps in the research methodology after the IAR using the reduced features in Phase 3 to achieve an IAR flow pattern. The flow pattern differentiates the behavior of one node from the other nodes in the experiment [5]. In the following section, a brief description of these sensors' validation is presented.

#### *3.2. D1 Validation of Sensor Signals apart from Accelerometer Data*

The result of the accelerometer signals of D1 was earlier reported [4]. D1 validation was carried out to validate the processed raw sensor signals for other sensors used for IBEehcaf in this article. The validation task was carried out to ascertain the quality of the D1 dataset displayed in Figure 1. We have applied the statistical validation technique (SVT) commonly used in the literature [3,22] based on the parametric nature of the dataset. For the validation, two hypotheses were formulated and tested using IBM SPSS 22.0. The hypotheses are as follows:

(1) Null hypothesis H0: μ<sup>1</sup> = μ<sup>2</sup> = μ<sup>3</sup> ... , μ11; there is no significant difference between the means of the variables V12, V13, ... , V18 used for the analysis of D1 for prediction in this study.

(2) Alternative hypothesis HA: μ<sup>1</sup> μ<sup>2</sup> μ<sup>3</sup> - ... ; there is a significant difference in at least one of the means of the variables V12, V13, ... , V18 used for the analysis of D1 for prediction in this study.

#### 3.2.1. Reduced Features from Improved Statistical-Based Time-Frequency Domain (SBTFD)

This section discusses the reduced features from SBTFD employed for an enhanced context-aware framework for individual activity recognition (IARehcaf) in (Phase 2 of Figure 2) based on improved SBTFD features reported in our previous works [4]. In this article, we focus on the individual behavior estimation enhancement (*IBEehcaf*) while utilizing the reduced features (Phase 3 of Figure 2) for crowd condition prediction using the feature selection method (*CCPFSM*) to enhance the proposed approach shown in Equation (5) in Phase 4 of Figure 2 using Equation (3). The *EHCAF* is discussed as follows:

$$EHCAF = IAR\_{elecaf} + IBE\_{elecaf} + CCPFSM \tag{3}$$

where *EHCAF* comprises the improved *SBTFD* and reduced features from the FSM in our previous work [8]. *IBEehcaf* represents the newly reduced features achieved using the employed FSM combined with Vsi and Dsi performed for IBE implementation with the modified and adopted algorithms (1) and (2). This serves as input to the modified Algorithm (3) in Figure 2, and are employed in this article. Note that the detail about improved SBTFD features and dimensionality reduction based on FSM (phases 1–3) are out of the scope of this article.

*CCPFSM* denotes the prediction achieved by the reduced features and other parameters known as flow velocity Vsi and flow direction Dsi in Equation (2) (Phase 4), which were used to perform a task for the prediction of crowd condition in Equation (3). It employs an enhanced context-aware framework through the use of context-sensing from node S and crowd density determination (CDD) in Phase 4 for the inflow and outflow movement of individual behavior to evaluate the possible causes of abnormality in a crowd using the proposed approach as a solution. This helps to realize the development of *EHCAF* shown in Equation (3).

#### 3.2.2. Modified Algorithm for Region Identification and Grouping of Nodes S

Crowd behavior monitoring was done with the use of sensor signals for identifying each participant with a smartphone as node S, based on an individual followed up by a grouping of the nodes (S) (see Algorithm 1 in Appendix A). It was conducted using the individual sensor analyses in Step 4 (Section 3.1) with context recognition performed on the activity recognition of an individual, in order to estimate participants' behavior. The mapping between the program sensors and activities considered were utilized as input to algorithm 1 (Appendix A) implementation. In Algorithm 1, S is the participant node used as input in Step 4 (Section 3.1).

The crowd formation distribution is divided into sets of sub-regions using the crowd controller station (CCS). When a new participant node S is detected, the context-aware application notifies the crowd controller station, which automatically adds the new node to the specific sub-region of the present location in line 19 (Algorithm 1 in Appendix A). The region identification of participant is actualized with the smartphone of the participant as a node S, line 1, with the GPS data in lines 2–3 with respect to time (line 4 of Algorithm 1 in Appendix A) using the data displayed in Figure 1.

The grouping of participants into the sub-region list SA1, SA2, and SAn is achieved using line 20 of Algorithm 1 in Appendix A. It takes care of the movement of the participant from one place to another for the scenario used in the experiment. Node S was equipped with the context-aware mobile application prototype during the experiment, whenever the distance moved by the participant is greater than a threshold value in (line 18 of Algorithm 1 in Appendix A), as adopted in the work of [6]. The threshold value is about 20 m from the hotspot for effective monitoring via communication within the coverage area. Once the node is outside the hotspot range, it is exempted. The algorithm also determines the neighbouring nodes in a sub-area by estimating the distance between two participant nodes and other nodes monitored by the CCS. Based on the work of [6], if the distance between nodes is less than 10 m, the new participant node will be added to the same area using line 19 of Algorithm 1 in Appendix A. The distance of 10 m was selected for the hotspot to allow for ease of assessments

in case of an emergency. The distance estimation is based on Vincenty's formula and is adopted for computing latitude and longitude coordinate points [5,44].

#### 3.2.3. Flow Velocity Estimation and Flow Direction Identification Based on Activity Recognition

The implementation of this algorithm takes the contexts from sensor signals—specifically latitude, longitude (GPS data), accelerometer x, accelerometer y, accelerometer z, and timestamp—as input to Equation (3) of Figure 1. The input data were used to compute the flow velocity estimation and also used to determine the flow direction of individual movement behavior. The output from the implementation of the algorithm is flow Velocity (*Vsi*) and flow Direction (*Dsi*) [44]. The *Vsi* and *Dsi* are important informative features used to obtain hidden context information from individual behaviors in a crowd scenario that is considered to determine flow patterns of individual movement.

#### 3.2.4. Implementation of Modified PBEA Algorithm

The disparity matrix is the difference between a node and any other nodes used in (Algorithm 2 of Appendix B). For example, u and v; *si* or *sj*. The diagonal elements of the disparity matrix are usually defined as zero, which implies that zero is the measure of disparity between an element and itself [44,45].

Given two R-dimensional *xi* = (*x*<sup>1</sup> *<sup>i</sup>* , *<sup>x</sup>*<sup>2</sup> *<sup>i</sup>* , ... *<sup>x</sup><sup>R</sup> <sup>i</sup>* ) and *xj* = (*x*<sup>1</sup> *j* , *x*<sup>2</sup> *j* , ... *xR <sup>j</sup>* ), the Euclidean distance (EUD) *d (i, j)* as observed in [45] is expressed in Equation (4):

$$d\_{i,j} \sqrt{\left(\mathbf{x}\_i^1 - \mathbf{x}\_j^1\right)^2 + \left(\mathbf{x}\_i^2 - \mathbf{x}\_j^2\right)^2 + \dots + \left(\mathbf{x}\_i^R - \mathbf{x}\_j^R\right)^2} \tag{4}$$

where *d i,j* denotes the Euclidean distance in Equation (4).

The computation was performed to calculate the distance between nodes for the input data from S1 to S20. This is to determine the disparity value for individual estimation in each region where node S is located. The variables *x*<sup>1</sup> *<sup>i</sup>* , *<sup>x</sup>*<sup>1</sup> *<sup>j</sup>* correspond to the features and their instances in pairs; based on SBTFD, a reduced feature set (fft\_corxz, y\_fft\_mean, z\_fft\_mean, z\_fft\_min, y\_fft\_min, z\_fft\_std, y\_fft\_std) is then combined with Vsi and Dsi contexts from the sensor signals of D1. These serve as input to the PBEA. Euclidean distance (EUD) is commonly used in research across different domains. It has been used to compute the distance between two points with reliable results; hence, the choice of using it to generate distance from each participant to every other participant based on nodes [45,46]. In addition, the investigation revealed that EUD is suitable for the modified PBEA adopted from the BCF implemented in this research.

The algorithm caters for n numbers of nodes, but the location used for an experiment does not vary for all the activities performed. This was due to the aforementioned communication range stated in (Algorithm 1 of Appendix A). Thereafter, the clustered results obtained were similar beyond three sub-areas, since the location considered is uniform for the experiment. This was noticed from the GPS data for longitude and latitude obtained in the experiment used with D1. It was observed that there is a variation between nodes whose monitor's device is represented by S for identification. The cluster of nodes was performed using Equation (5):

$$ELID\left(d\_{i,j}\right) = \sum\_{i=1}^{n} \sum\_{\substack{p \ \text{ $c \gets k\_i$ }}} dist(p, k\_i)^2 \tag{5}$$

In Equation (5), EUD represents the Sum of the Square Error (SSE). SSE is determined by using the node of the participant that is nearest to each pair of the participant node, which helps for S identification in the monitoring group and subsequent ones in the group. The advantages of K-means that were adopted and used in Algorithm 1 in Appendix A were discussed in [44,46]. Equation (6) was applied to perform the IBE*ehcaf* in Equation (3) (of Phase 4).

For the IBEehcaf task, let δ be a matrix of pairwise between *n* attributes in Equation (6) [26]:

$$
\delta\_{i,j} = \begin{pmatrix}
\delta\_{1,1} & \delta\_{1,2} & \delta\_{1,3} & \delta\_{1,4} & \dots \delta\_{1,n} \\
\delta\_{2,1} & \delta\_{2,2} & \delta\_{2,3} & \delta\_{2,4} & \dots \delta\_{2,n} \\
\delta\_{3,1} & \delta\_{3,2} & \delta\_{3,3} & \delta\_{3,4} & \dots \delta\_{3,n} \\
\delta\_{n,1} & \delta\_{n,2} & \delta\_{n,3} & \delta\_{n,4} & \dots \delta\_{n,n}
\end{pmatrix} \tag{6}
$$

where δ*i*,*<sup>j</sup>* represents the disparity between the aforementioned features *i* and *j*. Also, let *f (*δ*i*,*j*) be a monotonically increasing function that transforms differences into disparities using Equation (6).

The equation produces an R-dimensional matrix (where R ≤ n) configuration of points.

*xi* = (*x*1, *x*2, ... , *xi*, ... *xj*, ... *xn*); likewise, *xi* = (*x*<sup>1</sup> *<sup>i</sup>* , *<sup>x</sup>*<sup>2</sup> *<sup>i</sup>* , ... *<sup>x</sup><sup>R</sup> <sup>i</sup>* ) and *xj* = (*x*<sup>1</sup> *j* , *x*<sup>2</sup> *j* , ... *xR <sup>j</sup>* ), for (*1* ≤ *i, j* ≤ *n*). The *EUD* between any two nodes, S of *xi* and *xj* in this configuration, equals the disparities between features *i* and *j* expressed using Equation (7):

$$d\_{i,j} \approx f(\delta\_{i,j}) \tag{7}$$

The *di*,*<sup>j</sup>* is defined by Equation (6). The measure has been applied to find the pairwise (Euclidean distance) between two cities with minimum possible distortion by [47], as reported in [46]. In this case, we represent the *n* nodes of the matrix D (N, A) where u =N and v =A for B(s) with the positive integers 1, 2, 3,... n. Then, a distance matrix, B(s<sup>+</sup>1), is set up with elements, and is expressed using Equation (8) [46]:

$$d\_0(i,j) = \begin{cases} l(i,j) & \text{if } \text{participant}(node)(i,j) \text{ exist} \\ d\_{i,j} = 0 & \text{if } i = j \\ d\_{i,j} > 0 & \text{if } i \neq j \end{cases} \tag{8}$$

The length, *d(i, j)*, of the path from node *i* to node *j* is given by element D (*u*, *v*) of the final matrix D(n) B(n), which makes it possible for the tracing back of each one of the node paths. An example of disparity matrix computation can be computed using Equation (9) as employed for the participant estimation algorithm noted in [5,24]:

$$D\_{(u;T)} = \mathcal{g}(Corr(f(B\_{si}, T), f(B\_{si+1}, T)))\tag{9}$$

where D is the disparity based on function *f*, and g is a variable that provides the mapping to a disparity value *f*. The disparity value is computed based on the input data, specifically fft\_corxz, y\_fft\_mean, z\_fft\_mean, z\_fft\_min, y\_fft\_min, z\_fft\_std, y\_fft\_std, *Vsi* and *Dsi*. While *f* depicts correlation (*Corr*) performed on a matrix containing the input data in pairs; *Bsi* is an individual participant node; *u* is the number of nodes of the participant along the column of the matrix; *v* is nodes of the participant along a row of the matrix, and *T* denotes time. The function *f*, Corr, and *g* depend on the specific crowd that is considered. Typically, *f* is a pre-processing function. Corr computes a measure of differences between the input data for every (*i, j*) pair of nodes to determine an individual in a crowd scenario. Finally, g maps to a disparity value. The disparity value is defined to be zero if the two participants are likely resulting from their participation in the same crowd. Conversely, the disparity tends to one or more if the node *s* is not likely to be the result of participation in the same crowd. The outcome generates a disparity matrix *DT* = [*D*((*u*;*vT*)] *m xn* at time *T*. The reduced features set achieved and other parameters derived as features previously reported in [33]—namely, *Vsi* and *Dsi* [44], are fed into the PBEA, as shown in Equation (6) of (Phase 4) as input to generate the output for individual and crowd condition prediction illustrated in the next section.

#### 3.2.5. Crowd Density Threshold Condition

This study adopted the conditions that trigger abnormality to set a threshold for crowd density determination within the coverage area as established in previous studies [48] and employed in other

studies [3,4,6,49]. The threshold adopted in this study was first suggested by [6], who defined a crowd as made up of three or more persons. This study employs two persons per m<sup>2</sup> for the experiment based on [6]. However, the monitoring of participants occurs within the coverage areas and range of distance for the hotspot, and can be assessed using the device of a participant smartphone, which is referred to as node S. It is generally acknowledged that five persons/m<sup>2</sup> is an extremely high density, four persons/m<sup>2</sup> is high density, three persons/m2 is medium density, two persons/m<sup>2</sup> is low density, while one or no persons/m2 is considered very low density [7]. In addition, six or more persons/m<sup>2</sup> is considered extremely dangerous, with the potential to cause abnormality [7]. Crowd density determination (CDD) was employed to compute the density of the monitored crowd of moving nodes based on a crowd density threshold (CDT) condition shown in Equations (10)–(12) of (Phase 4). Node S is recognized by the crowd controller station (CCS) based on node count using Equations (10) and (11) [50].

$$\text{Density} = \text{LN} \times \text{area in } m^2 \text{ \* 5} \tag{10}$$

$$\text{CDD} = 1 + 4 \ast \left[ \frac{Density - \lambda}{\psi - \lambda} \right] \tag{11}$$

where *LN* represents the number of participants monitored, λ denotes the minimum density level, and ψ is the maximum density observed in the experiment at a particular time. The maximum capacity has also been proposed to be calculated using the number of participants <sup>&</sup>lt; area in m<sup>2</sup> <sup>×</sup> 10; where 10 is regarded as extreme crowd density, as noted in the work of [50]. More than two participants per m<sup>2</sup> exceed the threshold. In order to explain the disparity matrix (a low value and high value) employed by [5], which is used to explain the type of crowd observed in the analyses of the result for this article, Equation (12) shows the crowd density threshold condition (CDT) used for the CDD evaluation.

$$\begin{cases} \begin{array}{c} 1. \text{ } If \text{ } \text{CDT } for \, d\_{i,j} \, per \, sqn^2 \le 2 \text{ then} \\ \begin{array}{c} \text{low } convud \, density \, \text{ocur} \\ \text{2. else } If \text{ } \text{CDT } for \, d\_{i,j} \, per \, sqn^2 = \text{ } 3 \text{ then} \\ \text{medium } convud \, density \, \text{ocur} \\ \text{3. else } If \, CDT \, for \, d\_{i,j} \, per \, sqn^2 = \text{ } 4 \text{ then} \\ \begin{array}{c} \text{high } convud \, density \, \text{ocur} \\ \text{4. else} \\ \text{extremely } high \, crow \, density \, \text{ocur} \end{array} \end{cases} \end{cases} \tag{12}$$

#### **4. Experimental Results**

This section presents results based on the highlighted objectives as follows: the raw sensor data validation, and the descriptive analysis for the validation summarized for all classes N: 22,350, which consists of V1 to V8. V12 provided a mean of 4.735, the standard deviation of 2.519, and a standard error of 0.2216. V13 provided a mean of 47.762, the standard deviation of 47.501, and a standard error of 0.4179. V15 produced a mean of 21.629, the standard deviation of 82.162, and a standard error of 0.7228. Meanwhile, V18 provided a mean of 48.891, the standard deviation of 106.286, and a standard error of 2.255. Inferential statistics for the ANOVA test conducted at *p* = 0.05 shows V12, V13, V15, and V18 having F-values of 46644.20, 4653.71, 196.41, and 967.01, respectively. The *p*-value = 0.000 is statistically significant. Hence, we reject H0, and accept HA, and conclude that there is a significant difference in at least one of the means of the variables V12, V13, ... , V18 used for the analysis of D1. This conclusion implies that the D1 dataset is valid, consistent, and adequate for the analysis conducted in this study.

#### *4.1. Result on the Classification of Raw Dataset D1*

The results of classification after validation is as follows. In Table 5, out of the 22,350 instances (last row); about 10,692 (bold in diagonal) of the confusion matrix were correctly predicted, while the remaining 11,658 instances were wrongly predicted. In Figure 3, the summary of classification results for baseline, a raw dataset D1, an improved SBTFD with 54 features, and seven reduced SBTFD features newly introduced to extend the BCF to produce an enhanced approach (EHCAF) is presented in Equation (3). The best ARAC, FNR, and RMSE are achieved with EHCAF-7 features having 99.1%, 2.8%, and 7.9%, respectively. This is against 92.0%, 31.3%, and 21.6%, respectively.

**Table 5.** Confusion matrix from the classification result of individual activity recognition (IAR) using the sensor signals of the D1 raw dataset.


**Figure 3.** Comparison of BCF—baseline classification results, raw dataset—D1, improved statistical-based time-frequency domain (SBTFD), and reduced features for the enhanced approach.

#### *4.2. Results of Region Identification and Grouping of Nodes Using Clusters*

Figure 4 provided a higher number of clusters, which shows that more participant nodes gathered in subarea SA1 than subareas SA2 and SA3 in the experiment. Thus, SA1 is more prone to risk than SA2 and SA3.

**Figure 4.** Results of clusters for identifying and grouping participant into subareas with GPS data.

#### *4.3. Results on the Algorithm Implemented for Flow Velocity and Flow Direction*

For details of the algorithm implemented for flow velocity and flow direction, please refer to [44]. This article focuses on the individual behavior estimation method combined with reduced features, which were not considered in the BCF.

#### *4.4. Modified PBEA Using Reduced Features and Enhanced Individual Behavior Estimation*

The output serves as input to the modified PBEA as shown in Figure 2 to produce an enhanced context-aware framework for individual and crowd conditions. The analysis is based on pairs of the node; for example, 1 and 2, 1 and 3, 1 and 4... up to 20 for individual behavior estimations. A disparity matrix was computed for the estimation of an individual based on the 20 nodes used as input for S1 to S20 for different nodes in the experiment. The experimental result revealed the interaction of participating (nodes) and their behavioral patterns in a crowd scenario based on the CDT employed and crowd density estimate. It shows two, three, three, and 12 nodes of a different number of individuals per m<sup>2</sup> (Appendix C).

#### 4.4.1. Crowd Condition Prediction Using Individual Behaviour Estimation

For crowd estimation, it is necessary to estimate individual activity recognition and behavior initially. This had been addressed in our earlier works [4,8]. The crowd condition prediction using seven reduced features with Vsi and Dsi is newly introduced. This achieved higher accuracy by 99.1% against 92.0%. Also, a marginal reduction of the false negative rate by 28.5% from 2.8% against 31.3%, which is an improvement over the BCF [5], was obtained to achieved EHCAF see Figure A2 of Appendix D. The individual behavior estimation with suggested CDT and crowd density determination computation for crowd count serve as a means to extend the BCF [5]. This could help identify early danger by using context sensing through a smartphone with a context-awareness alert, thus minimizing the level of abnormality behavior in a crowd-prone area.

4.4.2. Implication of Low False Negative Alarm on the Enhanced Approach Based on PBEA Experiment

Figure 5 shows that the experimental results based on the proposed approach using reduced features and enhanced IBE in this article for crowd condition prediction has a low false negative rate (FNR), achieving an FNR of 2.8% and an ARAC of 99.1%, compared with an FNR of 31.3% based on an ARAC of 92% in the baseline. The results suggest that the higher the false negative rate (FNR) of AR, the higher the number of participants that may be at risk. Figure 5 also shows the comparative risk situation for EHCAF in blue color and BCF in red color, showing one (1) participant (node) in 20 and 28 participants in 1000 for the EHCAF, and six in 20 and 313 participants for 1000 in the BCF. The value was computed using a FNR of 2.8/100 \* Number of the participants (NOPs) based on a crowd of people considered which will be varied in a real-life scenario when the proposed is applied.

**Figure 5.** Effects of the false negative rate on the proposed approach when applying to human behavior monitoring in real life in a crowd condition.

This section presents the details of benchmarking with related works in the literature [5,51,52]. To confirm that the achieved higher results for the proposed approach is significantly better on the evaluation measurements used, Statistical t-tests were carried out using SPSS version 22.0 on dataset D1 and the BCF. The results of the seven reduced features based on FSM from method A, with *p*-values of 0.003 for the improved SBTFD and 0.021 against BCF, indicates *p* < 0.05, implying that the performance of the proposed approach is statistically significant at an 0.05 alpha level.

This supports the objective presented in this article. Based on the analysis of results, the enhanced context-aware framework (EHCAF) depicted in Figure A2 (Appendix D) is an improvement on the basic context-aware framework (BCF) benchmark, as shown in Table 6. However, Table 6 shows the components for EHCAF; likewise, the justification for improved parameters to establish the validity of our findings in the entire study.


#### **Table 6.** Comparison between BCF [6] and proposed approach (EHCAF).

#### **5. Discussion of Results**

The result achieved an improvement of 7.1% and a false negative rate of 28.5% with an error reduction of 13.7% in terms of root mean square errors. This suggests safety to human lives in a crowd-prone situation when applying to real-life applications against the BCF by [5] as analysed in Table 7. In Figure 4, the susceptible area where crowd abnormality is likely to occur suggests sub-area list *SA*1; this was obvious from the plot as more clustered nodes were observed in the area, which is an indication of more participants interacting together at a very close range to one another, as shown in Figure A1 (of Appendix C).

Based on the flow velocity Vsi and flow direction Dsi from accelerometer sensor signals analyzed, the V3 fall scenario revealed that only 778 were correctly recognized as TP, out of the 3159 expected among the instances of 22,350. Meanwhile, the rest consists of FP: 2383, FN: 2831, and TN: 16808 in Table 5. In Table 5, the unrecognized individual activity from 2381 which accounted for the abnormal behavior of individuals could be responsible for disaster manifestation. In a nutshell, the incorrect recognition demands effective features such as those suggested with the statistical-based time-domain

in [10–16] and statistical-based frequency domain in [27,52], which informed the solution adopted in our previous work [4,33].


**Table 7.** Comparison of the proposed approach (EHCAF), activity recognition, and basic context-aware framework (BCF).

Note: SCI: Context-aware issues. ARAC: Activity recognition accuracy. FEM: Feature extraction method. FSM: Reduced features achieved using Feature Selection Method. CCP: Crowd Condition Prediction. RMSE: Root mean square error. N/A: Not applicable.

Figure A1 (Appendix C) showed four distinct groups with the highest and lowest number of participants with 12, three, three, and two nodes, respectively. It shows the interactions and range at which those nodes interconnected for the scenario used as an example. Another plot from the data using a different set of 20 nodes to compute a different set of disparity values based on the disparity matrix with implemented algorithm three gave a similar result. The 12 nodes suggested a dangerous situation in terms of crowd scenario according to [6,7]. This implies a high inflow and outflow, which could bring about high crowd turbulence, and thus requires an immediate control if it happens in a crowded situation. All three nodes in Figure A1 (Appendix C) signify a medium crowd density, and the two nodes indicated a very low crowd density, which is basically known as a normal situation. Therefore, it is found to be within the threshold suggested using Equation (11). Based on this, the pattern of 12 nodes using an undirected graph in real life may result in crowd abnormality occurrence. In such cases of the 12 nodes with early recognition and sensitization using the proposed context-aware framework, such crowd density can easily be controlled before it reaches a critical state. Most importantly, for example, in Appendix D, with an FNR of 2.8% for every 20 and 1000 participants (nodes), which were assumed to be monitored one node and 28 nodes, respectively, will be at risk using the proposed solution, versus six and 313 nodes respectively in the basic context-aware framework (BCF) [5]. Experimental results support activity recognition studies in the literature for both cross-validation and split [11,39]. It also identifies that RF and J48 are the best classifiers suitable for the enhanced context-aware framework (EHCAF) Figure A2 Appendix D for individual and crowd condition prediction as compared to the other classifiers investigated. In view of our findings, the limitation of this work includes an inability to develop a context-aware system to effectively implement the reduced features that are newly suggested in this research. Future work could investigate and integrate the use of this methodology to the realization of safety for human lives through viable application in real life. Also, there was an inability to handle the technicality on the part of the monitoring device functionality to identify none of the functional sensors that could hinder the smooth data acquisition of individual activity recognition for prediction.

#### **6. Conclusions**

This study has described the sensor signals of activity recognition that are adequate for the prediction of individual and crowd conditions. The entire approach demonstrated in this article fulfills the aim, which focused on complementing other research in human activity recognition and pervasive computing toward the mitigation of crowd abnormality in the 21st century. In this article, an enhanced context-aware framework (EHCAF) was developed. The potential of reduced features with the feature selection method based on the improved feature extraction method using SBTFD was demonstrated. The relevant parameters were derived and applied to implement the modified algorithm for grouping participants using smartphones as nodes. Based on findings, an enhanced approach for individual and crowd condition prediction is summarized as follows: the utilization of reduced features and enhanced individual behavior estimation (IBEenhcaf) with high accuracy and low FNR performance is achieved; a clear definition of crowd density formulation for crowd condition prediction in a crowd scenario is presented. Above all, from the previous study, the FNR is 31.3%, while in this study, it is 2.8%. Hence, an improvement of 28.5% is achieved based on the experiment. However, the limitations and gaps left by previous studies have been equally addressed. The experimental results of this article have shown significant improvement from the previous studies done by [5,11,24,39]. The methods applied to achieve the proposed enhanced approach showcased in this article support the objective of the article. In the future, the approach promises a dynamic solution that intends to explore the collection of the ground truth dataset for the purpose of mitigating disasters among individuals gathering in places such as Mecca, medina during the pilgrimage in Saudi Arabia by integrating cloud-based technology.

**Author Contributions:** Funding acquisition, A.S. and O.K.; Methodology, F.I.S.; Supervision, A.S.; R.I. and O.K.; Validation, F.I.S.; Visualization, F.I.S.; Writing – original draft, F.I.S. This article was extracted from ongoing doctoral research at Universiti Teknologi Malaysia, (UTM), 81310, Johor Bahru. The First author F.I. Sadiq has recently completed his PhD in Computer Science. This article reported one of the research contributions in his doctoral thesis and research work. The remaining authors are the supervisors of the candidate. The supervisors' comments and suggestions were valuable to the success of this manuscript preparation.

**Funding:** This research was funded by Universiti Teknologi Malaysia (UTM) under Research University Grant Vot-20H04, Malaysia Research University Network (MRUN) Vot 4L876 and the Fundamental Research Grant Scheme (FRGS) Vot 5F073 supported under Ministry of Education Malaysia for the completion of the research. The Smart Solutions in Ubiquitous Computing Environments", Grant Agency of Excellence 2019, projects No. 2204, University of Hradec Kralove, Faculty of Informatics and Management is acknowledged. The work is partially supported by the SPEV project, University of Hradec Kralove, FIM, Czech Republic (ID: 2102-2019).

**Acknowledgments:** The authors wish to thank Universiti Teknologi Malaysia (UTM) under Research University Grant Vot-20H04, Malaysia Research University Network (MRUN) Vot 4L876 and the Fundamental Research Grant Scheme (FRGS) Vot 5F073 supported under Ministry of Education Malaysia for the completion of the research. The work is partially supported by the SPEV project, University of Hradec Kralove, FIM, Czech Republic (ID: 2102-2019). We are also grateful for the support of Ph.D. student Sebastien Mambou in consultations regarding application aspects. The Smart Solutions in Ubiquitous Computing Environments", Grant Agency of Excellence 2019, projects No. 2204, University of Hradec Kralove, Faculty of Informatics and Management is acknowledged. Likewise, the Authority of Ambrose Ali University, Ekpoma, under Tertiary Education Trust Fund (TETFUND), Nigeria, is also acknowledged for the opportunity giving to the Scholar to conduct his Research leading to Doctor of Philosophy (PhD) in Computer Science in UTM.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A**

**Algorithm A1.** Modified algorithm for region identification and grouping of participants based on clusters using K-means with node S

1. **Set** S: node for participant's smartphone


5. **Set** SA: Sub-arealist = [SA1, SA2, SA3, ... , SAn]


9. **Start**


16. Set minT i.e., for location manager minimum power consumption with minT Milliseconds between location update to reserve power

17. Set minDist: as location transmission in case device moves using minDistance meters

18. TDifference = location.getT( )- currentbestlocation.getT( )

**If** TDifference > TWindow then participant (node) have moved and transmit the new location into a Crowd Controller Station (CCS) based on timestamp change

19. **If** (Lat, Long) in location context with Sub-arealist SAn are the same,

clusters set K using Dist between the nodes S

20. Group S into SA1, SA2, SA3, ... , SAn clusters

21. Crowdcount = S + 1


#### **Appendix B**

**Algorithm A2:** Enhanced approach for individual and crowd condition prediction proposed to extend BCF


#### **Appendix C**

**Figure A1.** Patterns of participant behavior estimation using a disparity matrix for 20 nodes, S1–S20, for the recognition of abnormality of individual behavior per m2.

*Entropy* **2019**, *21*, 487

#### **Appendix D**

Enhanced approach (EHCAF) for individual and crowd condition prediction

**Figure A2.** Patterns of participant behavior estimation using a disparity matrix for 20 nodes S1 to S20 for the recognition of abnormality of individual behavior per m2.

#### **References**


Recognition. In Proceedings of the 13th ACM Conference on Embedded Networked Sensor Systems, Seoul, Korea, 1–4 November 2015.


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **3D CNN-Based Speech Emotion Recognition Using K-Means Clustering and Spectrograms**

#### **Noushin Hajarolasvadi \* and Hasan Demirel**

Department of Electrical and Electronics Engineering, Eastern Mediterranean University, 99628 Gazimagusa, North Cyprus, via Mersin 10, Turkey; hasan.demirel@emu.edu.tr

**\*** Correspondence: noushin.hajarolasvadi@cc.emu.edu.tr

Received: 12 March 2019; Accepted: 4 May 2019; Published: 8 May 2019

**Abstract:** Detecting human intentions and emotions helps improve human–robot interactions. Emotion recognition has been a challenging research direction in the past decade. This paper proposes an emotion recognition system based on analysis of speech signals. Firstly, we split each speech signal into overlapping frames of the same length. Next, we extract an 88-dimensional vector of audio features including Mel Frequency Cepstral Coefficients (MFCC), pitch, and intensity for each of the respective frames. In parallel, the spectrogram of each frame is generated. In the final preprocessing step, by applying *k*-means clustering on the extracted features of all frames of each audio signal, we select *k* most discriminant frames, namely keyframes, to summarize the speech signal. Then, the sequence of the corresponding spectrograms of keyframes is encapsulated in a 3D tensor. These tensors are used to train and test a 3D Convolutional Neural network using a 10-fold cross-validation approach. The proposed 3D CNN has two convolutional layers and one fully connected layer. Experiments are conducted on the Surrey Audio-Visual Expressed Emotion (SAVEE), Ryerson Multimedia Laboratory (RML), and eNTERFACE'05 databases. The results are superior to the state-of-the-art methods reported in the literature.

**Keywords:** speech emotion recognition; 3D convolutional neural networks; deep learning; k-means clustering; spectrograms

#### **1. Introduction**

Designing an accurate automatic emotion recognition (ER) system is crucial and beneficial to the development of many applications such as human–computer interactive (HCI) applications [1], computer-aided diagnosis systems, or deceit-analyzing systems. Three main models are in use for this purpose, namely acoustic, visual, and gestural. While a considerable amount of research and progress is dedicated to the visual model [2–5], speech as one of the most natural ways of communication among human beings is neglected unintentionally. Speech emotion recognition (SER) is useful for addressing HCI problems provided that it can overcome challenges such as understanding the true emotional state behind spoken words. In this context, SER can be used to improve human–machine interaction by interpreting human speech.

SER refers to the field of extracting semantics from speech signals. Applications such as pain and lie detection, computer-based tutorial systems, and movie or music recommendation systems that rely on the emotional state of the user can benefit from such an automatic system. In fact, the main goal of SER is to detect discriminative features of a speaker's voice in different emotional situations.

Generally, a SER system extracts features of voice signal to predict the associated emotion using a classifier. A SER system needs to be robust to speaking rate and speaking style of the speaker. It means particular features such as age, gender, and culture differences should not affect the performance of the SER system. As a result, appropriate feature selection is the most important step of designing the SER system. Acoustic, linguistic, and context information are three main categories of features used in the SER research [6]. In addition to those features, hand-engineered features including pitch, Zero-Crossing Rate (ZCR), and MFCC are widely used in many research works [6–9]. More recently, convolutional neural network (CNN) has been in use at a dramatically increasing rate to address the SER problem [2,10–13].

Since the results from deep learning methods are more promising [8,14,15], we used a 3D CNN model to predict the emotion embedded in a speech signal. One challenge in SER using multidimensional CNNs is the dimension of speech signal. Since the purpose of this study is to learn spectra-temporal features using a 3D CNN, one must transform the one-dimensional audio signal to an appropriate representation to be able to use it with 3D CNN. A spectrogram is a 2D visual representation of short-time Fourier transform (STFT) where the horizontal axis is the time, and the vertical axis is the frequency of signal [16]. In the proposed framework, audio data is converted into consecutive 2D spectrograms in time. The 3D CNN is especially selected because it captures not only the spectral information but also the temporal information.

To train our 3D CNN using spectrograms, firstly, we divide each audio signal to shorter overlapping frames of equal length. Next, we extract an 88-dimensional vector of commonly known audio features for each of the corresponding frames. This means, at the end of this step, each speech signal is represented by a matrix of size *n* × 88 where *n* is the total number of frames for one audio signal and 88 is the number of features extracted for each frame. In parallel, the spectrogram of each frame is generated by applying STFT. In the next step, we apply k-means clustering on the extracted features of all frames of each audio signal to select *k* most discriminant frames, namely keyframes. This way, we summarize a speech signal with *k* keyframes. Then, the corresponding spectrograms of the keyframes are encapsulated in a tensor of size *k* × *P* × *Q* where *P* and *Q* are horizontal and vertical dimensions of the spectrograms. These tensors are used as the input samples to train and test a 3D CNN using 10-fold cross-validation approach. Each of 3D tensors is associated with the corresponding label of the original speech signal. The proposed 3D CNN model consists of two convolutional layers and a fully connected layer which extracts the discriminative spectra-temporal features of so-called tensors of spectrograms and outputs a class label for each speech signal. The experiments are performed on three different datasets, namely Ryerson Multimedia Laboratory (RML) [17] database, Surrey Audio-Visual Expressed Emotion (SAVEE) database [18] and eNTERFACE'05 Audio-Visual Emotion Database [19]. We achieved recognition rate of 81.05%, 77.00% and 72.33% for SAVEE, RML and eNTERFACE'05 databases, respectively. These results improved the state-of-the-art results in the literature up to 4%, 10% and 6% for these datasets, respectively. In addition, the 3D CNN is trained using all spectrograms of each audio file. As a second series of experiments, we used a pre-trained 2D CNN model, say VGG-16 [20] and performed transfer learning on the top layers. The results obtained from our proposed method is superior than the ones achieved from training VGG-16. This is mainly due to fewer parameters used in the freshly trained 3D CNN architecture. Also, VGG-16 is a 2D model and it cannot detect the temporal information of given spectrograms.

The main contributions of the current work are: (a) division of an audio signal to *n* frames of equal length and selecting the *k* most discriminant frames (keyframes) using k-means clustering algorithm where *k* << *n*; (b) representing each audio signal by a 3D tensor of size *k* × *P* × *Q* where *k* is the number of consecutive spectrograms corresponding to keyframes and *P* and *Q* are horizontal and vertical dimensions of each spectrogram; (c) Improving the ER rate for three benchmark datasets by learning spectra-temporal features of audio signal using a 3D CNN and 3D tensor inputs.

The main motivation of the proposed work is to employ 3D CNNs, which is capable of learning spectra-temporal information of audio signals. We proposed to use a subset of spectrogram frames which minimizes redundancy and maximizes the discrimination capability of the represented audio signal. The selection of such a subset provides a computationally cheaper tensor processing for comparable or improved performance.

The rest of the paper is organized as follows: In Section 2, we review the related works and describe steps of our proposed method. In Section 3, our experimental results are illustrated and compared with the state of the art in the literature. Finally, in Section 4 conclusion and future work is discussed.

#### **2. Materials and Methods**

Generally speaking, a SER system is composed of two parts: a preprocessing part that extracts suitable features and a classifier that employs those features to perform ER. This section overviews existing strategies in the SER research area [21,22].

#### *2.1. Related Works*

In a very recent work, ref. [23] proposed a robust technique of SER by embedding phoneme sequences and spectrograms. The authors represented each phoneme as an embedding numeric vector. They use two CNN models, a phoneme-based CNN model and a 2D CNN model for spectrograms. Both models have four parallel convolutional kernels. They used the Interactive Emotional Dyadic Motion Capture (IEMOCAP) database [24] and they achieved an overall accuracy of 73.9% on this corpus. Considering the high computational cost of training CNNs, the drawback of this method is employing two separate CNN models. Also, comparison with other benchmark databases is ignored.

In another recent work, Zhang et al. [25] achieved 70.4% accuracy on the same corpus, IEMOCAP. They proposed an attention-based fully convolutional neural network (FCN). FCNs can handle spectrograms with variable sizes. In fact, they turn AlexNet [26] into an FCN by removing its fully connected layers and then using it as an encoder. Later, they attach an attention layer which is followed by a SoftMax layer. They compared their results with a fine-tuned version of AlexNet and VGG-16 [20]. They reported 67.9% and 66.8% accuracy on IEMOCAP database. Also, they reported recognition rate of 66.5% and 65.3% by direct training (without fine tuning) of these two deep networks. The advantage of this work is that the preprocessing step is limited to the generation of so-called spectrograms.

Avots et al. [9] conducted a cross-corpus evaluation. They analyzed a model on the audio-visual information of SAVEE, RML and eNTERFACE'05 databases and tested the same model on AFEW database to merely show how challenging the task of recognizing emotional states in real world environment might be. They represented the emotional speech in SAVEE, RML and eNTERFACE'05 databases by a 1 × 650, 1 × 1725 and 1 × 1570 feature vector, respectively. Mainly, they used spectral features such as energy entropy, ZCR, and harmonic product spectrum to represent each audio signal. Then, they applied SVM classifier and achieved 77.4%, 69.3% and 50.2% for SAVEE, RML and eNTERFACE'05 databases and only 27.1% for AFEW database. One disadvantage of this work is the different feature vector size that is used for each dataset which ignores the generalization aspect of machine learning methods and makes it highly susceptible to overfitting on a specific dataset.

Torfi et al. [8] proposed a 3D CNN for cross audio-visual matching recognition. Their audio-visual recognition system couples two non-identical 3D CNN architecture. This can map a pair of speech and video input into a new representation space for evaluation of correspondence between them. The input that they used were spectrograms, as well as the first and second order derivatives of the MFEC features. They applied feature-level fusion of audio and video features and reported the area under the curve 95.4% for Lip Reading in the Wild dataset.

Badshah et al. [22] used spectrograms of a speech signal as the input for a 2D CNN. They extracted spectrograms of each speech signal and then split the spectrogram into several smaller spectrograms. These smaller spectrograms are later resized and used as the input to a 2D CNN architecture. They reported using rectangular shaped kernels for convolution layers help to capture local features effectively. They trained and evaluated their model on Berlin Emotional Database (EmoDB) [27] and obtained a weighted (overall) accuracy of 72.21%. Also, in [14], they reported that a freshly trained CNN performs better than transfer learning on AlexNet [26] for SER purpose.

Ref. [28] evaluated two types of neural networks: CNNs and long short-term memory networks. They used IEMOCAP corpus for training and evaluation. In the preprocessing step, they split each sentence longer than 3 s to shorter sub-sentences. The emotional label of the original sentence is assigned to sub-sentences. Then they calculate a spectrogram for each sub-sentence. They studied the effect of 10 Hz and 20 Hz grid resolution and they report using lower resolution yields lower accuracy. They obtained weighted accuracy of 68.8%. They also, used harmonic modeling to remove noise from spectrograms. We believe k-means clustering will select the frames which are less redundant and therefore the corresponding spectrogram of the selected frames is more informative.

Noroozi et al. [29] proposed an audio-visual ER system for video clips. They extracted 88 features including MFCC, pitch, intensity, mean, variance, etc. from the whole speech signal. No framing is performed. Then, they applied SVM and Random Forest on this feature space. They reported the weighted accuracy of 56.07% and 65.28% and 47.11% for SAVEE, RML and eNTERFACE'05 datasets using Random Forest. Results obtained by SVM were lower than the Random Forest. In another work from same author [6], they used random forests and decision trees to classify speech signals using a vector of size 14 para-linguistic features. They obtained an overall accuracy of 66.28% on SAVEE dataset.

Schluter and Grill [13] applied pitch-shifting and time-stretching as two significant methods for data augmentation of spectrograms. They used the augmented data as input to 2D CNN. One disadvantage of this work is that due to a huge number of spectrograms, they used a fixed number of weight updates which means the convergence of CNN optimizer is not guaranteed. Other researchers such as Palaz et al. [12] split a raw input signal to a sequence of frames, and report a class-base score for each frame by passing through several convolution filter stages and a multi-layer perceptron classifier.

CNN is used to learn affect-salient features for SER in the precious work of [7]. In the first step of training, the unlabeled samples are used to learn Local Invariant Features (LIF) using a sparse auto-encoder. In the second step, LIF is used as the input to a feature extractor. The weighted accuracy on SAVEE, EmoDB was 71.8% and 57.2%.

Abdel-Hamid et al. [15] proposed a limited-weight-sharing scheme that models the speech features for speech recognition systems while [11] proposed a new method for modeling speech signals using Restricted Boltzmann Machine.

#### *2.2. Proposed Method*

#### 2.2.1. Preprocessing

In this study, RML, SAVEE and eNTERFACE'05 datasets are used. The preprocessing pipeline is shown in Figure 1. First, the speech signals are extracted from video clips using the FFmpeg framework. Then, each speech signal is divided to shorter overlapping frames of equal length. Each frame has 50% overlap with the previous one. This step results to division of each speech signal to *n* frames. Depending on the length of speech signal, the length of frames differs from one audio signal to another, but all frames of one audio signal has the same length. Then, for each frame 88 commonly known audio features such as MFCC, pitch, variance, intensity, and filter-bank energies are extracted. We adopted the set of extracted features from [29]. The complete list of extracted features is shown in Table 1.

In parallel, the spectrogram of each frame is generated. A Spectrogram is simply a signal strength versus time at different frequencies and is generated by applying STFT. A sequence of overlapping Hamming windows is applied to each frame with window size of 20 ms [30], a window shift of 10 ms and hope size of 256. At the end of this step, each speech signal is represented by a matrix of size *n* × 88 and *n* spectrograms as shown in Figure 1. *n* is the number of frames and matching to each frame there exist a spectrogram, i.e., each audio frame has one feature vector and a corresponding spectrogram.

In the next step, k-means clustering algorithm is applied on all extracted feature vectors of one speech signal to select *k* most discriminant frames known as keyframes. As we mentioned before, corresponding to each of these keyframes, there exist a spectrogram. The sequence of *k* successive spectrograms of the keyframes for one speech signal forms a 3D tensor representing that speech signal. Such tensors are used as the input samples for training our 3D CNN architecture. Label of the original speech signal is assigned to the generated 3D tensor. To find the best representative *k*, we started with *k* is equal to 9 and we increased it in a heuristic fashion to 18 and 27. The best *k* which maximized the accuracy over the validation set and during training is equal to 9.

**Table 1.** List of extracted features for each audio frame.

**Figure 1.** The proposed framework for preprocessing the data.

Training CNNs and especially 3D CNNs is an exhaustive and time-consuming process. As a result, summarizing the input samples (a speech signal represented by a selected sequence of spectrograms) without degrading the performance becomes highly important. For example, in [13], huge number of spectrograms is produced using hop size equal to 1. Due to high redundancy of overlapping audio frames and memory limitation, training of the CNN is performed for a fixed number of 40,000 weight updates instead of training over a full dataset. This means that not only the optimizer might not

converge but also, not all the spectrograms of one audio signal is observed during training. In addition, a 3D CNN can be trained as deep as possible subject to the machine memory limit and computation affordability [31]. Thus, it is desired to handle memory limitation and reducing the computational time by summarizing input samples while preserving the performance.

In our methodology, k-means clustering algorithm addresses these problems. Because it detects the redundancy by clustering the feature vectors representing the frames of one audio signal and maximizing the distinctions between those frames. Figure 2 shows the generated clusters and their corresponding centroids. To visualize the discrimination of clusters, we applied *t*-test score on the 1 × 88 feature vectors of selected frames and non-selected frames of a single audio file to find the two best representative features. The *t*-test examines the differences of two populations using the mean and standard deviation of each population. The first formant and the MFCC provided the maximum difference. The k-means clustering is visualized using the selected features by *t*-test. In the following context, first we explain feature extraction and spectrogram generation in more details. Then, the proposed 3D CNN for SER is described.

**First Formant**

**Figure 2.** k-Means Clustering visualization for one audio sample in Angry category.

#### 1. Extracted Features:

Emotions can be represented using different features of speech. For example, a speaker who is angry has a faster speech rate as well as higher energy and pitch frequency. Some of the most effective features of speech for ER are duration, intonation, pitch and intensity, filter-bank energies, MFCCs, Δ*MFCCs*, and ZCR. In this paper, we extracted 88 features proposed by [29]. The complete list of features is shown in Table 1 and for a speech signal *s* with length *N*, they are explained in detail in Appendix A.

#### 2. Spectrograms:

As we mentioned before, one challenge in SER using CNNs is the dimension of speech signal. Since the purpose of this study is to learn spectra-temporal features using a 3D CNN, one must transform the one-dimensional audio signal to an appropriate representation for CNNs. One such representation is spectrogram which is the visual representation of signal strength over time at different

frequencies [22]. Spectrogram is generated by applying STFT. STFT is a Fourier-based transform which determines the sinusoidal frequency and phase of local portions of a signal as it changes over time. In practice, to compute STFT, first a long time signal must be divided to shorter frames or segments of equal length. Then, by applying Fourier transform on each shorter frame, Fourier spectrum of that frame reveals. Visualizing the changing spectra as a function of time results in spectrogram [16].

In other word, the spectrogram is a visual representation of STFT where the horizontal axis represents the time and the vertical axis represents the frequency of signal in that short frame. In a spectrogram, at a particular time point and a particular frequency, dark colors illustrate the frequency in a low magnitude, whereas light colors show the frequency in higher magnitudes. Spectrograms are perfectly suitable for variety of speech analysis including SER [16]. In this work, we aim to represent each speech signal as a selected sequence of spectrograms generated by applying STFT on overlapping frames.

#### 3. k-means clustering:

It is an iterative, data-partitioning algorithm that assigns each sample point to exactly one of the k clusters. First, k observations are selected randomly to be the centroids of clusters. Then the distance between each sample point and the cluster-centroids are calculated. The sample point is assigned to the cluster with the closest centroid. When all sample points are assigned to exactly one of the clusters, the average of the sample points in each cluster is computed to obtain k new centroid locations. The distance calculation step and modifying the centroid location is then repeated until clusters stabilize or a maximum number of iterations is reached [32,33].

#### 2.2.2. 3D CNN Architecture

The proposed architecture is a 3D CNN trained using 3D tensors. Each of these tensors contain a sequence of spectrograms for one audio signal. The proposed 3D model consists of two convolutional layers, one fully connected layer, a dropout, and a SoftMax layer. In Table 2 the spatial size of the 3D kernels is reported as *T* × *H* × *W* where *T* is the kernel size in temporal dimension, and *H* and *W* are the kernel sizes in height and width dimensions, respectively. By applying a 3D kernel, spectra-temporal features are extracted using a 3D convolutional operation. The complete block diagram of our proposed architecture is shown in Figure 3. We did not use any zero padding because it adds extra zero-energy coefficients which is not meaningful in local feature extraction.

**Figure 3.** Block diagram of the proposed architecture for SER.

As we mentioned before, the best *k* obtained equal to 9. As a result, each input sample of our proposed network is 9 consecutive spectrograms representing one emotional speech signal. All the spectrograms obtained from the pipeline explained in Section 2.2.1 are resized to 96 × 96 images. The first convolution layer, Conv1 has 128 kernels of size 3 × 5 × 1 which are applied at strides of 1 pixel. The 3D convolutional layers extract the correlation between high-level temporal features and the spatial features of spectrograms. Conv1 uses a Parametric Rectified Linear Unit (PReLU). Following,

a 3D max pooling layer with a kernel size 2 × 2 × 2 (Pool1) and stride 1 × 2 × 1 is used. PReLU is an activation function that is used instead of regular sigmoid ones with the aim of improving efficiency of the training process. Layer Conv2 has 256 kernels of size 3 × 7 × 1 again with a moving stride of 1. Conv2 also uses PReLU as activation function. Pool2 is a 3D max pooling layer with the same kernel size and stride as Pool1. Pool2 is followed by a dropout layer with a dropout rate of 75% to avoid overfitting. Then, one fully connected (FC) layer with 64 units and a classification layer with 6 output class is used. Also, batch normalization [34] has been used to improve the training convergence.

In the proposed 3D model, we followed best experimental observations reported in [22,31,35]. In [14], it is reported that using rectangular kernels with large heights captures the local features effectively. As a result, we used a rectangular kernel of size 3 × 5 × 1 and 3 × 7 × 1 in the convolution layers. Also, [35] reported that using shallow temporal and moderately deep spectral kernels are optimal for the SER purpose. Thus, we employed 128 and 256 filters for convolutional layers which resulted in the best performance on the validation set. Using more than 256 filters did not help to improve the performance on the validation set. For initialization of weights and bias parameters, two methods including variance scaling [8] and random uniform distributions are tested. Initialization of both parameters with random uniform distribution resulted in a better performance on the validation set. For regularization, we used *l2* weight regularization with setting the regularization factor to <sup>5</sup> <sup>×</sup> <sup>10</sup>−4.

**Table 2.** The resolution of the proposed 3D CNN.


#### **3. Results and Discussion**

Taking into account the acquisition source of the data, three general groups of emotional databases exist: spontaneous emotions, acted emotions based on invocation and simulated emotions. Sample databases recorded in natural situations such as TV shows or movies are categorized under the first group. Usually, such databases suffer from low quality due to different sources of interference. For databases under second group, an emotional state is induced using various methods such as watching emotional video clips or reading emotional context. Although psychologists prefer this type of databases, the resulted reaction to the same stimulant may differ. Also, ethically provoking strong emotions might be harmful for the subject. eNTERFACE'05 and RML are examples of this group. The last group of databases are simulated emotions with high quality recordings and still emotional state. SAVEE database is a good example of this group.

#### *3.1. Dataset*

Three benchmark datasets were used to conduct the experiments, namely RML, SAVEE and eNTERFACE'05. All three datasets support audio-visual modals. Several reasons have been considered while choosing the datasets. We selected databases in a way covering a variation of size to show the flexibility of our model. Firstly, all three datasets are represented for same emotional states which makes them highly comparable. It is known that distinction between two emotion categories (for example disgust and happy) with large inter-class differences is easier than two emotions with small inter-class discrepancy. In addition, having the same number of emotional states prevents

misinterpretation of the experimental results. Because as the number of emotional states increase the classification task becomes more challenging.

Second, since all three datasets recorded for both the audio and the visual modals, the quality of the recorded audios is almost the same (16-bit single channel format). For example, comparing databases recorded with high acoustic quality and for the specific purpose of SER (EmoDB) with databases recorded in real environments is not preferable. Extraction of speech signals from videos for all three datasets is performed using the FFmpeg framework. Third, SAVEE, RML and eNTERFACE'05 can be categorized as small-size, mid-size, and large-size databases. Thus, the proposed model is evaluated to have a stable performance in terms of number of input samples.

The data processing pipeline explained in Section 2.2.1 is applied on each audio sample. To avoid overfitting, in all experiments, we divided the data such that 90% is used for training and 10% for test. We performed 10-fold cross-validation on the train part which means 90% of the train data is used for training and 10% for validation. Finally, the cross validated model is evaluated on the test part. The experiments are all performed for speaker-independent scenarios.

#### 3.1.1. SAVEE

The SAVEE database has 4 male subjects who acted emotional videos for six basic emotions namely anger, disgust, fear, happiness, sadness, and surprise. A neutral category is recorded as well but since the other two datasets does not include neutral, we discard it. This dataset consists of 60 videos per category. 360 emotional audio samples extracted from the videos of this dataset.

#### 3.1.2. RML

The RML database represented by Ryerson Multimedia Laboratory [17] includes 120 videos in each of six basic categories mentioned above from 8 subjects spoke various languages such as English, Mandarin, and Persian. A dataset of 720 emotional audio samples is obtained from this database.

#### 3.1.3. eNTERFACE'05

The third dataset is eNTERFACE'05 [19] recorded from 42 subjects. All the participants spoke English and 81% of them are female. Each subject was asked to perform all six basic emotional states. Emotional states are exactly the same as SAVEE and RML. 210 audio samples per category is extracted from this dataset.

#### *3.2. Experiments*

To assess the proposed method, four experiments are conducted on each dataset. In the first experiment, we trained the proposed 3D CNN model using the spectrograms of selected keyframes by applying 10-fold cross-validation method. In the second experiment 3D CNN model is trained using spectrograms of all frames. In the third experiment, by means of transfer learning, we trained VGG-16 [20] using the spectrograms of keyframes. Finally, in the last experiment we trained VGG-16 using all spectrograms generated for each audio signal. Comparing the results obtained from the second and third experiment shows that k-means clustering discarded the audio frames which convey insignificant or redundant information. This can be interpreted from the results given in Tables 3–5 which does not differ notably. It is important to note that the overall accuracy results obtained from these four experiments are shown by Proposed 3D CNN(1), Proposed 3D CNN(2), VGG-16(1) and VGG-16(2) in those tables.



†: A feature vector of commonly known audio features like Table 1. <sup>∗</sup>: All generated frames/spectrograms of one audio is used. : Only *k* (9) frames/spectrograms of one audio is used.

**Table 4.** Comparison of recognition rates among different methods for RML dataset.


†: A feature vector of commonly known audio features like Table 1. <sup>∗</sup>: All generated frames/spectrograms of one audio is used. : Only *k* (9) frames/spectrograms of one audio is used.



†: A feature vector of commonly known audio features like Table 1. <sup>∗</sup>: All generated frames/spectrograms of one audio is used. : Only *k* (9) frames/spectrograms of one audio is used.

#### 3.2.1. Training the Proposed 3D CNN

The CNN architecture illustrated in Figure 3 was trained on a sequence of 9 consecutive spectrograms paired with the emotional label of the original speech sample. We train the network for 400 epochs with assuring that each input sample consists of a sequence of 9 successive spectrograms. Also, as a second experiment, the proposed 3D CNN was trained using all spectrograms of each audio signal.

Updates are performed using Adam optimizer [37], categorical cross-entropy error, mini-batches of size 32 [13] and a triangular cyclical learning rate policy by setting the initial learning rate to 1 <sup>×</sup> <sup>10</sup><sup>−</sup>4, maximum learning rate to 6 <sup>×</sup> <sup>10</sup>−4, cycle length to 100 and step size to 50. Cycle length is the number of iterations until the learning rate returns to the initial value [38]. Step size is set to half of the cycle length. Figure 4b shows the learning rate for 400 iterations on RML dataset. As we mentioned before, to fight overfitting, we used *l2* weight regularization with factor 5 <sup>×</sup> <sup>10</sup>−4. In all experiments, 90% of the data is used for training and the rest for test. This means, the model learned spectra-temporal

features by applying 10-fold cross-validation on the training part of the data. Then, the trained model is evaluated using the test data.

**Figure 4.** Results on RML database. (**a**) Training versus validation, accuracy improvement; (**b**) Cyclical learning rate decay.

The average accuracy on test set of SAVEE, RML and eNTERFACE'05 databases is illustrated as a confusion matrix in Tables 6–8, respectively. Clearly, the proposed method achieved superior results than the state-of-the-arts in the literature. Since the complexity of CNNs are extremely large, using discriminant input samples is of high importance especially when it comes to real-time applications. To the best of our knowledge, this is the first paper representing a whole audio signal by means of *k* most discriminant spectrograms. This means, speech signal can be represented with fewer frames, yet preserving the accuracy. Figure 4a shows the training and validation accuracy improvement for RML dataset over 400 iterations. Also, Figure 4b shows the cyclical learning rate decay over same number of iterations and same dataset.


**Table 6.** Confusion matrix for SAVEE.




**Table 8.** Confusion matrix for eNTERFACE'05.

#### 3.2.2. Transfer Learning of VGG-16

In the next two experiments, we selected one of the well-known 2D CNNs, VGG-16 [20]. We applied transfer learning on the top layers to make it more suitable for the SER purpose. We trained the network for 400 weight updates. The initial learning rate is set to 1 <sup>×</sup> <sup>10</sup>−4.

In the first scenario, only the selected spectrograms of audio signals are given to VGG-16. In the second scenario, without applying k-means clustering algorithm, all generated spectrograms for each audio signal are used. In both cases, majority voting is used to make a final decision for each audio signal and assign a label to it. This means majority of labels predicted for the spectrograms of one audio is considered to be the final label for that audio signal. Both experiments under-performed the proposed 3D CNN.

This is mainly because VGG-16 is pre-trained on ImageNet dataset [39] for object detection and image classification purposes. Also, it has more complexity to adjust its weight. As a result, transfer learning was not helpful. Same conclusion has been reported by [14] and [25] for applying transfer learning on AlexNet using spectrograms. Fewer parameters in the freshly trained 3D CNN is the main reason for achieving the higher performance. The overall accuracy obtained by these experiments is compared with the state of the art in the literature in Tables 3–5 for SAVEE, RML and eNTERFACE'05 datasets, respectively.

#### **4. Conclusions**

In this paper, we studied the performance of 3D Convolutional Neural Networks using spectrograms. Instead of using the whole set of spectrograms corresponding to the audio frames, we selected *k* best frames for representing the whole speech signal. We compared the results of the proposed 3D CNN with the results obtained from 2D CNNs. It shows that the proposed method performs better than the pre-trained 2D networks. Future works may include comparing with pre-trained 3D-architecture such as Res-3D and C3D or applying different types of data augmentation to improve the results by fighting the overfitting. Fusion with visual data is another direction to study the multimodal performance of 3D architectures as well as cross-correlation between different modalities.

**Author Contributions:** writing–original draft preparation, N.H.; writing–review and editing, H.D.

**Funding:** This research was funded by BAP-C project of Eastern Mediterranean University under grant number BAP-C-02-18-0001.

**Conflicts of Interest:** The authors declare no conflict of interest.

*Entropy* **2019**, *21*, 479

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **Appendix A**

In this section, we explain the extracted features given in Table 1 with more detail.

#### *Appendix A.1*

1. The loudness of speech signal or the syllable peak is perceived as intensity. In another word, intensity is the power conveyed by speech signal per unit area in a direction perpendicular to that area. It can be expressed as follows:

$$I(dB) = 10\log\_{10}[\frac{I}{I\_0}] \tag{A1}$$

where *I* is the intensity and *I*<sup>0</sup> is the standard threshold of hearing intensity at 1000 Hz for the human ear which represented in terms of sound intensity by a value equal to 10(−12) watts/m2 [40].

2. Pitch is known as the fundamental frequency of the speech signal. It can be measured either using statistical methods or in the time-frequency domains. It can be calculated as follows:

$$\rho\_0(\mathbf{s}) = \mathcal{F}\{\log|\mathcal{F}(\mathbf{s}.w\_n^H||\mathbf{s}||)|\}\tag{A2}$$

where *w<sup>H</sup> <sup>n</sup>* is the Hamming window and it is defined as follows:

$$w\_n^H = 0.54 - 0.46 \cos(\frac{2\pi n}{L}), \qquad 1 \le n \le N - 1 \tag{A3}$$

*L* is the order of the filter and it is equal to filter length −1 [29].

3. Mean of each frame is calculated as:

$$\mu = \frac{1}{N} \sum\_{i=1}^{N} s\_i \tag{A4}$$

4. Standard deviation is extracted by calculating the following formula:

$$std = \sqrt{\frac{1}{N-1} \sum\_{i=1}^{N} (s\_i - \mu)^2} \tag{A5}$$

where *μ* is the mean of audio frame and *si* shows the value of audio frame at *i*.

*Entropy* **2019**, *21*, 479

5. Zero-Crossing Rate (ZCR) of an audio frame is the number of times the signal passes zero or changes sign during the frame. The ZCR is expressed as below by [41]:

$$
\hat{\mathbf{x}}(n) = \frac{1}{2} \sum\_{m=1} L[\text{sgn}[(m+1)] - \text{sgn}[\mathbf{x}(m)]] \tag{A6}
$$

$$
\text{sgn}[\mathbf{x}(m)] = \begin{cases}
+1 & \text{if } \mathbf{x}(m) \ge 0 \\
\end{cases} \tag{A7}
$$

where

$$\text{sgn}[\mathbf{x}(m)] = \begin{cases} +1 & \text{if } \mathbf{x}(m) \ge 0 \\ -1 & \text{if } \mathbf{x}(m) < 0 \end{cases} \tag{A7}$$

A high ZCR is indicative of a stationary series.

6. With an input signal starting at time zero and stopping at time *T*, the probability distribution satisfies [42]:

$$\begin{cases} -1 & \text{if } x(m) < 0\\ \text{nary series.} & \end{cases}$$

$$\text{The zero and stopping at time } T \text{, the probability distribution}$$

$$P(y < u) = \frac{2}{\pi} \arcsin \sqrt{\frac{u}{T}} \tag{A8}$$

where *g* is the last time that the signal passed zero. The density function is then:

$$P(u) = \frac{1}{\pi} \frac{1}{\sqrt{u(T - u)}}\tag{A9}$$

7. Harmonic mean is computed using the following formula:

$$\mathfrak{m} = \frac{N}{\sum\_{i=1}^{N} \frac{1}{s\_i}} \tag{A10}$$

8. Maximizing the inner product of the speech signal by its shifted version is another important feature that can be computed using the autocorrelation function *r*(*τ*) where *τ* is the time shift.

$$r(\tau) = \frac{1}{N} \sum\_{0}^{N-1} s(n)s(n+\tau) \tag{A11}$$

9. In calculation of MFCC, the formula proposed by Davis et al. [43] is used.

$$r(\tau) = \frac{1}{N} \sum\_{0} s(n)s(n + \tau) \tag{A11}$$
 
$$\text{If MFCC, the formula proposed by Davis et al. [43] is used.}$$
 
$$\text{MFCC}\_{i} = \sum\_{\theta=1}^{N} \cos\left[i(\theta - 1)\frac{\pi}{N}\right], \quad i = 1, \ldots, M \quad and \quad \theta = 1, \ldots, N \tag{A12}$$

*M* and *N* are the number of extracted cepstrum coefficients and number of band-pass filters, respectively. *θ* denotes the log energy of *θth* filter.

10. Calculation of the filter-bank energies and their derivatives are performed using a first order Finite Impulse Response (FIR). An array of band-pass filters that breaks up the input signal into multiple components is called a filter bank. Each separated component carries a single frequency sub-band of the original input signal. Let the unit-sample response impulse response *hn* be the response of a discrete-time signal to a unit-sample impulse *δ<sup>n</sup>* where *δ<sup>n</sup>* = 1 for *n* = 0 and *δ<sup>n</sup>* = 0 for *n* = 0. Then, for an arbitrary input signal *sn*, the output *yn* is given by:

$$y\_n = \sum\_{i=0}^{M} \alpha\_i \mathbf{s}(n-1) + \sum\_{j=1}^{N} \beta\_j y(n-j) \tag{A13}$$

*α<sup>i</sup>* and *β<sup>j</sup>* are coefficients of FIR filter and *M* is the order of the filter function. The calculation of FBEs are as follows:

$$y(m) = \sum\_{\theta=0}^{L-1} h(\theta) s[(m-\theta) \mod (N)], \qquad m = 0, 1, \ldots, N \tag{A14}$$

where *L* is the length of the filter [29].

11. Also, Δ*MFCC* is obtained using the proposed formula by [44]

$$C(n) = DCT \* \log(y(m))\tag{A15}$$

where *DCT* is the discrete cosine transform.

#### **References**


c 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

#### *Article*

## **Learning Using Concave and Convex Kernels: Applications in Predicting Quality of Sleep and Level of Fatigue in Fibromyalgia**

**Elyas Sabeti 1,2\*, Jonathan Gryak 1, Harm Derksen 3, Craig Biwer 1, Sardar Ansari 1,2, Howard Isenstein 4, Anna Kratz <sup>5</sup> and Kayvan Najarian 1,2,6,7**


Received: 8 February 2019; Accepted: 24 April 2019; Published: 28 April 2019

**Abstract:** Fibromyalgia is a medical condition characterized by widespread muscle pain and tenderness and is often accompanied by fatigue and alteration in sleep, mood, and memory. Poor sleep quality and fatigue, as prominent characteristics of fibromyalgia, have a direct impact on patient behavior and quality of life. As such, the detection of extreme cases of sleep quality and fatigue level is a prerequisite for any intervention that can improve sleep quality and reduce fatigue level for people with fibromyalgia and enhance their daytime functionality. In this study, we propose a new supervised machine learning method called Learning Using Concave and Convex Kernels (LUCCK). This method employs similarity functions whose convexity or concavity can be configured so as to determine a model for each feature separately, and then uses this information to reweight the importance of each feature proportionally during classification. The data used for this study was collected from patients with fibromyalgia and consisted of blood volume pulse (BVP), 3-axis accelerometer, temperature, and electrodermal activity (EDA), recorded by an Empatica E4 wristband over the courses of several days, as well as a self-reported survey. Experiments on this dataset demonstrate that the proposed machine learning method outperforms conventional machine learning approaches in detecting extreme cases of poor sleep and fatigue in people with fibromyalgia.

**Keywords:** fibromyalgia; Learning Using Concave and Convex Kernels; Empatica E4; self-reported survey

#### **1. Introduction**

Fibromyalgia is medical condition characterized by widespread muscle pain and tenderness that is typically accompanied by a constellation of other symptoms, including fatigue and poor sleep [1–9]. Poor sleep, which is a cardinal characteristic of fibromyalgia, is strongly related to greater pain and

fatigue, and lower quality of life [10–16]. As a result, any intervention that can improve sleep quality may enhance daytime functionality and reduce fatigue in people with fibromyalgia.

Studies of sleep in fibromyalgia often rely on self-reported measures of sleep or polysomnography. While easy to administer, self-reported measures of sleep demonstrate limited reliability and validity in terms of their correspondence with objective measures of sleep. In contrast, polysomnography is considered the gold standard of objective sleep measurement; however, it is expensive, difficult to administer, especially on a large scale, and may lack ecological validity. Autonomic nervous system (ANS) imbalance during sleep has been implicated as a mechanism underlying unrefreshed sleep in fibromyalgia. ANS activity can be assessed unobtrusively through ambulatory measures of heart rate variability (HRV) and electrodermal activity (EDA) [17,18]. Wearable devices such as the Empatica E4 are able to directly, continuously, and unobtrusively measure autonomic functioning such as EDA and HRV [19–22].

In the literature, there are few studies in which machine learning methods are used for classification or prediction of conditions related to fibromyalgia, none of which use physiological signals. A recent survey paper [23] summarizes various types of machine learning methods that have been used in pain research, including fibromyalgia. Previously, using data from 26 individuals (14 individuals with fibromyalgia and 12 healthy controls), the relative performance of machine learning methods for classification of individuals with and without pain using neuroimaging and self-reported data have been compared [24]. In another study using MRI images of 59 subjects, support vector machine (SVM) and decision tree models were used to first distinguish healthy control patients from those with fibromyalgia or chronic fatigue syndrome, and then differentiate fibromyalgia from chronic fatigue syndrome [25]. In [26], an SVM trained on fMRI images was used to distinguish fibromyalgia patients from healthy controls. The combination of fMRI with multivariate pattern analysis has also been investigated in classifying fibromyalgia patients, rheumatoid arthritis patients and healthy controls [27]. Psychopathologic features within an ADABoost classifier have also been employed for classification of patients with fibromyalgia and arthritis [28]. In another recent work [29], secondary analysis of gene expression data from 28 patients with fibromyalgia and 19 healthy controls was used to distinguish between these two groups.

In this study our immediate interest is to predict extreme cases of fatigue and poor sleep in people with fibromyalgia. For such an analysis, we use self-reported quality of sleep and fatigue severity, continuously collected data from the Empatica E4, to measure autonomic nervous system activity during sleep (Section 2). These signals are preprocessed to remove noise and other artifacts as described in Section 3.1. After preprocessing, a number of mathematical features are extracted, including various statistics, signal characteristics, and HRV features (Section 3.2). Section 4 provides a detailed description of our novel Learning Using Concave and Convex Kernels (LUCCK) machine learning method. This model, along with other conventional machine learning methods, were trained on the extracted features and used to predict extreme cases of poor sleep and fatigue, with our method yielding the best results (Section 5).

We believe this analytical framework can be readily extended to outpatient monitoring of daytime activity, with applications to assessing extreme levels of fatigue and pain, such as those experienced by patients undergoing chemotherapy.

#### **2. Dataset**

The data used for this study was collected from a group of 20 adults with fibromyalgia and consists primarily of a set of signals recorded by an Empatica E4 wristband over the course of seven days (removing 1 h/day for charging/download). Most (80%) participants were female with mean age = 38.79 (min-max=18–70 years). Of a possible 140 nights of sleep data, the sample had data for 119 (85%) nights. In this dataset, 19.9% of heartbeats were missing due to noisy signals or failure of the Empatica E4 in detecting beats. Data were divided into 5-min windows for HRV analysis; windows with more than 15% missing peaks were eliminated. This led to the exclusion of 30.9% of the windows. The signals used in this analysis are each patient's blood volume pulse (BVP), 3-axis accelerometer, temperature, and EDA. In addition to these recordings, each subject self-reported his or her wake and sleep times, as well as self-assessed his or her level of fatigue and quality of sleep every morning. These data are labeled by self-reported quality of sleep (1 to 10, 1 being the worst) and level of fatigue (from 1 to 10, 10 indicating the highest level of fatigue).

#### **3. Signal Processing: Preprocessing, Filtering, and Feature Extraction**

The schematic diagram of Figure 1 represents our approach to analyzing the BVP and accelerometer signals in the fibromyalgia dataset. During preprocessing, we remove noise from the input signals and format them for future processing (via the Epsilon Tube filter). Once the BVP and accelerometer signals are fully processed, they along with the EDA and temperature signals can then be analyzed and features can be extracted, which in turn leads to the application of machine learning. The final output is a prediction model to which new data can be fed.

**Figure 1.** Schematic Diagram of the Proposed Processing System for BVP, accelerometer, EDA and temperature signals.

#### *3.1. Preprocessing*

To begin, the raw signals are extracted per patient according to his or her reported wake and sleep times. These are then split into two groups: awake and asleep. For each patient and day, the awake data is paired with the following night's data and ensuing morning's self-assessed level of fatigue and quality of sleep.

Our approach to preprocessing BVP signals consists of a bandpass filter (to remove both the low-frequency components and the high-frequency noise), a wavelet filter (to help reduce motion artifacts while maintaining the underlying rhythm), and Epsilon Tube filtering. In order to least perturb the true BVP signal, we chose the Daubechies mother wavelet of order 2 ('db2') as it closely resembles the periodic shape of the BVP signal. Other wavelets were also considered but ultimately discarded. Once we selected a mother wavelet, we performed an eight-level deconstruction of the input BVP signal. By setting threshold values for each level of detail coefficients (Table 1) and using the results to reconstruct the original signal, we were able to significantly reduce the amount of noise present without compromising the measurement integrity of the underlying physiological values. Utilizing

this filter on a number of test cases showed that the threshold values produced consistently useful results regardless of the input, meaning tailored interactions are not required for each signal.


**Table 1.** Chosen coefficient thresholds for the 8-level wavelet decomposition.

The accelerometer data was upsampled from 32 Hz to 64 Hz via spline interpolation to match the sampling frequency of the BVP signal. The other signals (temperature and EDA) were left unfiltered. We then use these preprocessed signals as input into our main filtering approach (Epsilon Tube), the output of which is then used for feature extraction (Section 3.2).

After filtering of the BVP signal and interpolation of the accelerometer signal, the Epsilon Tube filter [30] is the final component of the preprocessing stage. As discussed in [30], since the BVP signal (and generally any impedance-plethysmography-based measurements) is very susceptible to motion artifact, reduction of this noise is a crucial part of the filtering process. This method uses the synchronized accelerometer data to estimate the motion artifact of BVP signal while leaving the periodic component intact. Let *bt* represent BVP values at time *t*, *A* a matrix whose rows are the accelerometer signals, and **w** the vector of Epsilon Tube filter coefficients. Given the tube radius , the error of *bt* estimation, i.e., *yt*(*A*, **w**), is zero if the point *bt* falls inside the tube

$$|b\_t - y\_t(A, \mathbf{w})|\_c = \max\{0, |b\_t - y\_t(A, \mathbf{w})| - \epsilon\}.$$

The Epsilon Tube filter is formulated as a constrained optimization problem that can be expressed as

$$\min \sum\_{t=0}^{N-1} \mathcal{J}\_t + \sum\_{t=0}^{N-1} \mathcal{J}\_t^{'} - cR(s, A, \mathbf{w}); \tag{1}$$

subject to

$$\begin{aligned} b\_t - y\_t(A, \mathbf{w}) &\le \epsilon + \tilde{\zeta}\_t & t &= 0, \dots, N - 1; \\ y\_t(A, \mathbf{w}) - b\_t &\le \epsilon + \tilde{\zeta}\_t' & t &= 0, \dots, N - 1; \\ \zeta\_t &\ge 0, \qquad \tilde{\zeta}\_t' \ge 0 & t &= 0, \dots, N - 1; \end{aligned}$$

where *N* is the length of BVP signal, *ζ<sup>t</sup>* and *ζ <sup>t</sup>* are slack variables, *R*(*s*, *A*, **w**) is the regularization term and *c* is a designated parameter that adjusts the trade-off between the two objectives. More information about the Epsilon Tube filter can be found in [30]. Taking both the BVP and accelerometer signals as input, the method assumes periodicity in the BVP signal and looks for a period of inactivity at the beginning of the data to use as a template for the rest of the signal. To achieve this, the calmest section of the accelerometer signal (as determined by the longest stretch during which the values never exceed one standard deviation from the mean of the signal) is found. The signal is then shifted so this period of inactivity is at the beginning, and the BVP signal is also shifted to ensure the timestamps remain aligned. The shifted signals are then fed into the Epsilon Tube algorithm, and the resulting output is used for feature extraction.

#### *3.2. Feature Extraction*

Once the BVP and accelerometer signals are processed, the full signal set is used for feature extraction. There are 91 features extracted from each of the following signals:


The extracted features are listed in Table 2. These are extracted from both the awake and the sleep signals, resulting in a full feature set consisting of 182 features. When feature selection is performed using Weka's information gain algorithm [31] on the first four subjects, the only feature ranked consistently near the top is the average of the BVP signal after being run through a mid-band bandpass filter.


**Table 2.** The list of features extracted from all signals.

#### **4. Machine Learning: Learning Using Concave and Convex Kernels**

The final step in the analysis pipeline is the creation of a model that can be used to predict the extreme cases of quality of sleep or level of fatigue for people with fibromyalgia. As detailed in Section 5, in addition to testing a number of conventional machine learning methods, we tested a novel supervised machine learning called Learning Using Concave and Convex Kernels (LUCCK). A key factor in the classification of complex data is the ability of the machine learning algorithm to use vital, feature-specific information to detect settled and complex patterns of changes in the data. The LUCCK method does this by employing similarity functions (defined below) to capture and quantify a model for each of the features separately. The similarity functions are parametrized so that the concavity or convexity of the function within the feature space can be modified as desired. Once the

similarity functions and attendant parameters are chosen, the model uses this information to reweight the importance of each feature proportionally during classification.

#### *4.1. Notation*

In this section, **<sup>x</sup>** <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* is a real-valued vector of features such that **<sup>x</sup>** = (*x*1, ... , *xn*), and *xi* is a real-valued (scalar) feature. Throughout this section, we consider *d* classes, *n* features and *m* (data) samples; also the indexes *k* = 1, ... , *d*; *i* = 1, ... , *n*; and *j* = 1, ... , *m* are used for classes, features and samples respectively. Additionally, *j* = 1, . . . , *mk* refers to *mk* < *m* samples in class *Ck*.

#### *4.2. Classification Using a Similarity Function*

An instructive model for comparison to the Learning Using Concave and Convex Kernels method is the *k*-nearest neighbors algorithm [33–35] and weighted *k*-nearest neighbors algorithm [36]. In *k*-nearest neighbors, a test sample **x** is classified by comparing it to the *k* nearest training samples in each class. This can make the classification sensitive to a small subset of samples. Instead, LUCCK classifies test data by comparing it to *all* training data, properly weighted according to their distance to **x**, which is determined by a similarity function. One major difference between LUCCK and weighted *k*-nearest neighbors is that our approach is based on a similarity function that can be highly non-convex. A fat-tailed (relative to a Gaussian) distribution is more realistic for our data, given that there is a small but non-negligible chance that large errors may occur during measurement, resulting in a large deviation in the values of one or more of the features. The LUCCK method allows for large deviations in a few of the features with only a moderate penalty. Methods based on convex notions of similarity or distance (such as the Mahalanobis distance) are unable to deal adequately with such errors.

Suppose that the feature space is comprised of real-valued vectors **<sup>x</sup>** <sup>∈</sup> <sup>R</sup>*n*. A *similarity function* is a function *<sup>Q</sup>* : <sup>R</sup>*<sup>n</sup>* <sup>→</sup> <sup>R</sup> that measures the closeness of **<sup>x</sup>** to the origin, and satisfies the following properties:

1. *<sup>Q</sup>*(**x**) <sup>&</sup>gt; 0 for all **<sup>x</sup>** <sup>∈</sup> <sup>R</sup>*n*;

$$\text{2.} \quad \underline{Q}(\mathbf{x}) = Q(-\mathbf{x}) \text{ for all } \mathbf{x} \in \mathbb{R}^n;$$

3. *<sup>Q</sup>*(*λ***x**) <sup>&</sup>gt; *<sup>Q</sup>*(**x**) if **<sup>x</sup>** <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* is non-zero and <sup>|</sup>*λ*<sup>|</sup> <sup>&</sup>lt; 1.

The value *Q*(**x** − **y**) measures the closeness between the vectors **x** and **y**. Using the similarity function *Q*(**x**), a classification algorithm can be created as follows:

The set of training data *<sup>C</sup>* is a subset of <sup>R</sup>*<sup>n</sup>* and is a disjoint union of *<sup>d</sup>* classes: *<sup>C</sup>* <sup>=</sup> *<sup>C</sup>*<sup>1</sup> <sup>∪</sup> *<sup>C</sup>*<sup>2</sup> <sup>∪</sup> ···∪ *Cd*. Let *m* = |*C*| be the cardinality of *C* and define *mk* = |*Ck*| for all *k* so that *m* = *m*<sup>1</sup> + ··· + *md*. To measure the proximity of a feature vector **x** to a set *Y* of training samples, we simply add the contributions of each of the elements in *Y*:

$$R(\mathbf{x}, Y) = \sum\_{\mathbf{y} \in Y} Q(\mathbf{x} - \mathbf{y}). \tag{2}$$

A vector **x** is classified in class *Ck*, where *k* is chosen such that *R*(**x**, *Ck*) is maximal. This classification approach can also be used as the maximum a posteriori estimation (details can be found in Appendix A).

#### *4.3. Choosing the Similarity Function*

The function *Q*(**x**) has to be chosen carefully. Let *Q*(**x**) be defined as the product

$$Q(\mathbf{x}) = \prod\_{i=1}^{n} Q\_i(x\_i),\tag{3}$$

where **<sup>x</sup>** = (*x*1, ... , *xn*) <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* and *Qi*(*xi*) only depends on the *<sup>i</sup>*-th feature. The function *Qi*(*xi*) is again a similarity function satisfying the properties *Qi*(−*xi*) = *Qi*(*xi*) <sup>&</sup>gt; 0 for all *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>, and *<sup>Q</sup>*(*x*) <sup>&</sup>gt; *<sup>Q</sup>*(*y*) whenever |*x*| < |*y*|. After normalization, the *Q*, *Q*1, *Q*2, ... , *Qn* can be considered as probability density functions. As such, the product formula can be interpreted as instance-wise independence for the comparison of training and test data. In the naive Bayes method, features are assumed to be independent globally [37]. Summing over all instances in the training data allows for features to be independent in our model.

Next we need to choose the functions *<sup>Q</sup>*1,..., *Qn*. One could choose *Qi*(*xi*) = *<sup>e</sup>*−*γix*<sup>2</sup> , so that

$$Q(\mathbf{x}) = e^{-(\gamma\_1 x\_1^2 + \dots + \gamma\_n x\_n^2)}$$

is a Gaussian kernel function (up to a scalar). However, this does not work well in practice:


Consequently, let

$$Q\_i(\mathbf{x}\_i) = (1 + \lambda\_i \mathbf{x}^2)^{-\theta\_i},\tag{4}$$

for some parameters *λi*, *θ<sup>i</sup>* > 0. The function *Qi*(*xi*) can behave similarly to the Cauchy distribution. This function has a "fat tail": as *x* → ∞ the rate that *Qi*(*xi*) goes to 0 is much slower than the rate at which *e*−*γix*<sup>2</sup> goes to 0. We have

$$Q(\mathbf{x}) = \prod\_{i=1}^{n} (1 + \lambda\_i \mathbf{x}\_i^2)^{-\theta\_i}. \tag{5}$$

The function *Q* has a finite integral if *θ<sup>i</sup>* > <sup>1</sup> <sup>2</sup> for all *i*, though this is not required. Three examples of this function can be found in Appendix B.

#### *4.4. Choosing the Parameters*

Values for the parameters *λ*1, *λ*2, ... , *λ<sup>n</sup>* and *θ*1, *θ*2, ... , *θ<sup>n</sup>* must be chosen to optimize classification performance. The value of log(*Qi*(*xi*)) = <sup>−</sup>*θ<sup>i</sup>* log(<sup>1</sup> <sup>+</sup> *<sup>λ</sup>ix*<sup>2</sup>) is the most sensitive to changes in *<sup>x</sup>* when

$$\frac{\partial}{\partial x}\log(1+\lambda\_i x^2) = \frac{2\lambda\_i x}{1+\lambda\_i x^2}$$

is maximal. An easy calculation shows that this occurs when *<sup>x</sup>* <sup>=</sup> *<sup>λ</sup>*<sup>−</sup> <sup>1</sup> 2 *<sup>i</sup>* . Since the value *λ<sup>i</sup>* directly controls the wideness of *Qi*(*xi*)'s tail, it is reasonable to choose a value for *<sup>λ</sup>*<sup>−</sup> <sup>1</sup> 2 *<sup>i</sup>* that is close to the standard deviation of the *i*-th feature. Suppose that the set of training vectors is

$$\mathbb{C} = \{ \mathbf{x}^{(1)}, \mathbf{x}^{(2)}, \dots, \mathbf{x}^{(m)} \} \subseteq \mathbb{R}^n,$$

where **x**(*j*) = (*x* (*j*) <sup>1</sup> ,..., *x* (*j*) *<sup>n</sup>* ) for all *j*. Let *s* = (*s*1,...,*sn*), where

$$s\_i = \text{std}(\mathbf{x}\_i^{(1)}, \mathbf{x}\_i^{(2)}, \dots, \mathbf{x}\_i^{(m)})^\top$$

be the standard deviation of the *i*-th feature. Let

$$
\lambda\_i = \frac{\Lambda}{s\_i^2}
$$

*Entropy* **2019**, *21*, 442

where Λ is some fixed parameter.

Next we choose the parameters *θ*1, ... , *θn*. We fix a parameter Θ that will be the average value of *θ*1,..., *θn*. If we use only the *i*-th feature, then we define

$$R\_i(\mathbf{x}, \mathcal{Y}) = \sum\_{\mathbf{y} \in \mathcal{Y}} (1 + \lambda\_i (\mathbf{x}\_i - y\_i)^2)^{-\Theta}$$

for any set *Y* of feature vectors. For **x** in the class *Ck*, <sup>1</sup> *mi*−<sup>1</sup>*Ri*(**x**, *Ck* \ {**x**}) gives the average value of (<sup>1</sup> <sup>+</sup> *<sup>λ</sup>i*(*xi* <sup>−</sup> *yi*)2)−<sup>Θ</sup> over **<sup>y</sup>** <sup>∈</sup> *Ck* \ {**x**}. The quantity <sup>1</sup> *mk*−<sup>1</sup>*Ri*(**x**, *Ck* \ {**x**}) <sup>−</sup> <sup>1</sup> *<sup>m</sup>*−<sup>1</sup>*Ri*(**x**, *<sup>C</sup>* \ {**x**}) measures how much closer *xi* is to samples in the class *Ck* than to vectors in the set *C* of all feature vectors except **x** itself. This value measures how well the *i*-th feature can classify **x** as lying in *Ck* as opposed to some other class. If we sum over all **x** ∈ *C* and ensure that the result is non-negative we obtain *αi* = max 

$$\alpha\_{i} = \max\left\{0, \sum\_{k=1}^{d} \sum\_{\mathbf{x} \in \mathcal{C}\_{k}} \left( \frac{R\_{i}(\mathbf{x}, \mathcal{C}\_{k} \mid \{\mathbf{x}\})}{m\_{k} - 1} - \frac{R\_{i}(\mathbf{x}, \mathcal{C} \mid \{\mathbf{x}\})}{m - 1} \right) \right\}. \tag{6}$$

The *θ*1,..., *θ<sup>n</sup>* can be chosen so that they have the same ratios as *α*1,..., *α<sup>n</sup>* and sum up to *n*Θ:

$$\theta\_{\bar{i}} = \frac{m\kappa\_{\bar{i}}\Theta}{\sum\_{i=1}^{n}\kappa\_{\bar{i}}}.\tag{7}$$

In terms of complexity, if *n* is the number of features and *m* is the number of training samples then the complexity of the proposed method would be *<sup>O</sup>*(*<sup>n</sup>* <sup>×</sup> *<sup>m</sup>*2).

#### *4.5. Reweighting the Classes*

Sometimes a disproportionate number of test vectors are classified as belonging to a particular class. In such cases one might get better results after reweighting the classes. The weights *ω*1, *ω*2, ... , *ω<sup>d</sup>* can be chosen so that all are greater than or equal to 1. If *p* is a probability vector, then we can reweight it to a vector

$$\mathcal{W}\_{\omega}(p) = (p'\_{1'}, \dots, p'\_d)$$

where

$$p\_l' = \frac{\omega\_l p\_l}{\sum\_{k=1}^d \omega\_k p\_k}.$$

If the output of the algorithm consists of the probability vectors *p*(**x**(1)), ... , *p*(**x**(*m*)) the algorithm can be modified so that it yields the output *Wω*(*p*(**x**(1))), ... , *Wω*(*p*(**x**(*m*))). A good choice for the weights *ω*1, ... , *ω<sup>d</sup>* can be learned by using a portion of the training data. To determine how well a *training* vector **x** ∈ *C* can be classified using the remaining training vectors in *C* \ {**x**}, we define *<sup>p</sup>k*(**x**) = *<sup>R</sup>*(**x**, *Ck* \ {**x**})

$$\widetilde{p}\_k(\mathbf{x}) = \frac{R(\mathbf{x}, \mathbf{C}\_k \lor \{\mathbf{x}\})}{R(\mathbf{x}, \mathbf{C} \lor \{\mathbf{x}\})} \cdot$$

The value *<sup>p</sup>k*(**x**) is an estimate for the probability that **<sup>x</sup>** lies in the class *Ck*, based on all feature vectors in *<sup>C</sup>* except **<sup>x</sup>** itself. We consider the effect of reweighting the probabilities *<sup>p</sup>k*(**x**), by *p <sup>k</sup>*(**x**) = *<sup>ω</sup><sup>i</sup> <sup>p</sup>*(**x**)

the probability that  $\mathbf{x}$ 

\*\*r the effect of reweight\*\*

$$
\widetilde{p}\_k'(\mathbf{x}) = \frac{\omega\_i \widetilde{p}(\mathbf{x})}{\sum\_{i=1}^d \omega\_i \widetilde{p}\_i(\mathbf{x})}
$$

If **x** lies in the class *Ck*, then the quantity

$$\max \{ \hat{p}\_1'(\mathbf{x}), \dots, \hat{p}\_d'(\mathbf{x}) \} - \hat{p}\_k'(\mathbf{x})$$

measures how badly **x** is misclassified if the reweighting is used. The total amount of misclassification is 

\*\*This\*\*\ssclassified if the reweighting is used. The total amount 
$$\sum\_{k=1}^{d} \sum\_{\mathbf{x} \in \mathbb{C}\_{k}} \left( \max \{ \tilde{p}\_{1}^{\prime}(\mathbf{x}), \dots, \tilde{p}\_{d}^{\prime}(\mathbf{x}) \} - \tilde{p}\_{k}^{\prime}(\mathbf{x}) \right) = \sum\_{k=1}^{d} \sum\_{\mathbf{x} \in \mathbb{C}\_{k}} \left( \frac{\max \{ \omega\_{1} \tilde{p}\_{1}(\mathbf{x}), \dots, \omega\_{d} \tilde{p}\_{d}(\mathbf{x}) \} - \omega\_{k} \tilde{p}\_{k}^{\prime}(\mathbf{x})}{\sum\_{l=1}^{d} \omega\_{l} \tilde{p}\_{l}^{\prime}(\mathbf{x})} \right).$$

We would like to minimize this over all choices of *ω*1, ... *ω<sup>d</sup>* ≥ 1. As this is a highly nonlinear problem, making optimization difficult, we instead minimize 

$$\begin{aligned} & \text{maximize } \mathbf{u}, \mathbf{v}, \mathbf{u} \text{ and } \mathbf{u} \text{ denote or } \omega\_1, \dots, \omega\_d \subseteq \mathbf{1}. \text{ Thus } \mathbf{u} \\ & \text{ization difficult, we instead minimize} \\ & \\ & \sum\_{k=1}^d \sum\_{\mathbf{x} \in \mathbb{C}\_k} \left( \max \{ \omega\_1 \tilde{p}\_1(\mathbf{x}), \dots, \omega\_d \tilde{p}\_d(\mathbf{x}) \} - \omega\_k \tilde{p}\_k(\mathbf{x}) \right) = 1 \\ & \sum\_{\mathbf{x} \in \mathbb{C}} \max \{ \omega\_1 \tilde{p}\_1(\mathbf{x}), \dots, \omega\_d \tilde{p}\_d(\mathbf{x}) \} - \sum\_{k=1}^d \omega\_k \sum\_{\mathbf{x} \in \mathbb{C}\_k} \tilde{p}\_k(\mathbf{x}). \end{aligned}$$

instead. This minimization problem can be solved using linear programming, i.e., by minimizing the quantity *pk*(**x**).

$$
\sum\_{j=1}^{m} z^{(j)} - \sum\_{k=1}^{d} \omega\_k \sum\_{\mathbf{x} \in \mathcal{C}\_k} \widetilde{p}\_k(\mathbf{x}).
$$

$$
\text{variables } z^{(1)}, \dots, z^{(m)} \text{ und}
$$

$$
z^{(j)} \ge \omega\_k \widetilde{p}(\mathbf{x}^{(j)})
$$

for the variables *ω*1,..., *ω<sup>d</sup>* and new variables *z*(1),..., *z*(*m*) under the constraints that

$$z^{(\vec{j})} \ge \omega\_k \tilde{p}(\mathbf{x}^{(\vec{j})})$$

and

*ω<sup>k</sup>* ≥ 1

for all *k* and *j* with 1 ≤ *k* ≤ *d* and 1 ≤ *j* ≤ *m*.

#### **5. Experiments**

In this section, the performance of LUCCK is first compared with other common machine learning methods using four conventional datasets, after which its performance on the fibromyalgia dataset is evaluated.

#### *5.1. UCI Machine Learning Repository*

In this set of experiments, LUCCK in compared to some well-known classification methods on a number of datasets downloaded from the University of California, Irvine (UCI) Machine Learning Repository [38]. Each method was tested on each dataset using 10-fold cross-validation, with the average performance and execution time across all folds provided in Table 3. Table 4 contains the average values for accuracy and time across all four datasets.


**Table 3.** Comparison of our proposed method (LUCCK) with other machine learning methods in terms of accuracy and running time, averaged over 10 folds.

**Table 4.** Model accuracy with standard deviation and execution time for each model, averaged across the four UCI datasets.


#### *5.2. Fibromyalgia Dataset*

In this study, we have created a model that can be used to predict the quality of sleep or level of fatigue for people with fibromyalgia. The labels are self-assessed scores ranging from 1 to 10. Attempts to develop a regression model showed less promise than the results from a binary split. The most likely reason for this failure of the linear regression model is the nature of self-reported scores, especially those related to patient assessment of their level of pain. This fact is primarily due to the differences in individual levels of pain-tolerance. In previous studies [39,40], proponents of neural "biomarkers" argued that self-reported scores are unreliable, making objective markers of pain

imperative. In another study [24], self-reported scores were found to be reliable only for extreme cases of pain and fatigue. Consequently, in this study, binary classification of extreme cases of fatigue and poor sleep is investigated. In this situation, a cutoff value is selected: patients that chose a value less than the threshold are placed in one group, while those that chose a value above the threshold are placed in another. As such, the values >8 are chosen for extreme cases of fatigue, and the values <4 are chosen for extreme cases of poor sleep quality. In this way, binary classifications are possible (>8 vs. <8 for fatigue and >4 vs. <4 for sleep). Using the extracted feature set, machine learning algorithms are applied and tested using 10-fold cross-validation. This is done in a way so as to prevent the data from any one patient being in multiple folds: all of a given patient's data are including entirely in a single fold. In addition, in order to address possibly imbalanced data during fold creation, random undersampling is performed to ensure the ratio between the two classes is not less than 0.3 (this rate is chosen since the extreme cases are at most 30 percent of the [1,10] interval of self-reported scores). This prevents the methods from developing a bias towards the larger class.

#### 5.2.1. Results with Conventional Machine Learning Methods

A number of conventional machine learning models listed in Table 5 were applied to the extracted data in this study. As can be seen, many major machine learning methods were tested. For each of these methods, various configurations were tested, and the best sets of parameters were chosen using cross-validation (hyperparameter optimization). For instance, we used the combination of AdaBoost with different types of standard methods such as Decision Stump and Random Forest in order to explore the possibility of improving the performance of these methods via boosting. The *k*-nearest neighbor method with *k* = 7 was used in this experiment. For the weighted *k*-nearest neighbor method [36], the inversion kernel (inverse distance weights) with *k* = 7 resulted in the best performance. For the Neural Network algorithm, the Weka (Waikato Environment for Knowledge Analysis) [41] multilayer perceptron with two hidden layers was used. The results of using these machine learning approaches for prediction of extreme sleep quality (cutoff of 4) and fatigue level (cutoff of 8) are presented in Table 5. As shown in this table, the AdaBoost method based on random forest yielded the best results for quality of sleep (based on area under the receiver operating characteristic curve, or AUROC). For level of fatigue, the neural network was the best performing model.

#### 5.2.2. Results with Our Machine Learning Method: Machine Learning Using Concave and Convex Kernels

In addition to the aforementioned conventional methods, we also used our machine learning approach that resulted in superior performance compared to the standard machine learning methods discussed above. Recall that in the Learning Using Concave and Convex Kernels algorithm, test data is classified by comparing it to all training data, properly weighted according to information extracted from each of the features (see Section 4 for further details). The results of applying our method to fibromyalgia are presented in Table 5, with cutoff values of 4 and 8 for quality of sleep and level of fatigue, respectively. As can be seen, LUCCK was able to vastly outperform other models on the fatigue outcome; however, the improvement on sleep outcome was not significant. This disparity is likely due to the different feature spaces for the sleep and fatigue outcomes. In general, the feature space for fatigue is significantly more dispersed, due to there being more samples (during daytime) and also that daytime activity negatively affects the signal quality, increasing dispersion. In contrast, signals (and their associated features) recorded during sleep are of better quality. This leads to the better prediction result for sleep in all methods used. Our proposed LUCCK algorithm can ameliorate the nature of the fatigue feature space, as it is specifically designed to reduce the effect of training data for which there is a large deviation from test data. As such, LUCCK was able to vastly outperform other models on the fatigue outcome. We should note that while the cohort size in this study seems to be limited, the continuous recording of physiological signals for seven days and nights created a comprehensive dataset. Additionally, similar to *k*-NN and its weighted version (and unlike SVM and

neural network models), LUCKK can be trained even with few samples, which is one advantage of the proposed algorithm.


**Table 5.** Results of conventional machine learning methods.

#### **6. Conclusions and Discussion**

In this study we primarily focused on prediction of the extreme cases of fatigue and poor sleep. As such, we have created preprocessing/conditioning methods that have the ability to improve the quality of parts of the signals with low quality due to motion artifact and noise. In addition, we identified a set of mathematical features that are important in extracting patterns from physiological signals that can distinguish poor and good clinical outcomes for applications such as fibromyalgia. Additionally, we showed that our proposed machine learning method outperformed the standard methods in predicting the outcomes such as fatigue and sleep quality. Generally, our proposed framework (preprocessing, mathematical features, and proposed machine learning method) can be employed in any study that involves prediction using BVP, HRV and EDA signals.

**Author Contributions:** Conceptualization, H.D.; Data curation, C.B. and S.A.; Formal analysis, E.S.; Funding acquisition, H.I. and K.N.; Supervision, A.K. and K.N.; Writing—original draft, E.S.; Writing—review & editing, J.G. and K.N.

**Funding:** This research was funded by Care Progress LLC through National Science Foundation grant number 1562254.

**Conflicts of Interest:** The authors declare no conflict of interest.

**Patents:** The epsilon tube filter is covered by US Patent 10,034,638, for which Kayvan Najarian is a named inventor.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **Appendix A. Classification as Maximum a Posteriori Estimation**

The classification approach suggested in Section 4.2 can also be viewed in terms of probability density functions. Suppose that <sup>R</sup>*<sup>n</sup> Q*(**x**) = *e* with 0 < *e* < ∞. The function

$$f\_{\mathbb{C}}(\mathbf{x}) = \frac{R(\mathbf{x}, \mathbb{C})}{m\mathfrak{e}} = (m\mathfrak{e})^{-1} \sum\_{\mathbf{y} \in \mathbb{C}} Q(\mathbf{x} - \mathbf{y}) \tag{A1}$$

is therefore a probability density function. This probability density function is an estimation for the probability distribution from which the training data were taken.

We have

$$f\_{\mathbb{C}} = p(\mathbb{C}\_1)f\_{\mathbb{C}\_1} + \dots + p(\mathbb{C}\_d)f\_{\mathbb{C}\_d}$$

where

$$f\_{\mathbb{C}\_k}(\mathbf{x}) = \frac{R(\mathbf{x}, \mathbb{C}\_k)}{m\_k \mathfrak{e}} = (m\_k \mathfrak{e})^{-1} \sum\_{\mathbf{y} \in \mathbb{C}\_k} Q(\mathbf{x} - \mathbf{y}) \tag{A2}$$

is a probability density function for the training data in class *Ck* for *<sup>k</sup>* = 1, 2, ... , *<sup>d</sup>* and *<sup>p</sup>*(*Ck*) := *mk <sup>m</sup>* is the probability that a randomly chosen training vector lies in *Ck*. *fC* can be considered as a mixture of the probability density functions *fC*<sup>1</sup> , ... , *fCd* . Suppose that **<sup>x</sup>** <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* is taken from the distribution *fCk* with probability *p*(*Ck*), then the distribution for **x** is *fC*. Given the outcome **x**, the probability that it was taken from the distribution *fCk* is

$$p\_k(\mathbf{x}) := \frac{p(\mathbf{C}\_k) f\_{\mathbf{C}\_k}(\mathbf{x})}{f\_{\mathbf{C}}(\mathbf{x})} = \frac{R(\mathbf{x}, \mathbf{C}\_k)}{R(\mathbf{x}, \mathbf{C})}.$$

This shows that the classifying scheme is the maximum a posteriori estimation. Instead of classifying a feature vector, the probability vector

$$p(\mathbf{x}) = (p\_1(\mathbf{x}), p\_2(\mathbf{x}), \dots, p\_d(\mathbf{x})),$$

can be given as output. The formula

$$p\_k(\mathbf{x}) = \frac{\mathcal{R}(\mathbf{x}, \mathbf{C}\_k)}{\mathcal{R}(\mathbf{x}, \mathbf{C})}$$

is well-formed, even if *Q*(**x**) does not have a finite integral, which may be the case in some examples.

#### **Appendix B. Examples**

**Example A1.** *Suppose that there is only one feature, i.e., n* = 1*, then Q*(*x*) *can be defined as*

$$Q(\mathbf{x}) = (1 + \lambda\_1 \mathbf{x}^2)^{-\theta\_1} \mathbf{x}$$

*whose graph at various values of θ and λ is depicted in Figure A1:*

**Figure A1.** *Q*(*x*)=(1 + *λ*1*x*2)−1/*λ*<sup>1</sup> with for *λ*<sup>1</sup> = 0.4, 0.8, . . . , 4 (blue curves) and *λ*<sup>1</sup> = 0 (red curve).

*As <sup>λ</sup>*<sup>1</sup> *goes to zero, the function converges to the normal distribution e*−*x*<sup>2</sup> *(the red curve in Figure A1).*

**Example A2.** *Suppose that n* = 2*, then Q*(**x**) *is defined as*

$$Q(\mathbf{x}\_1, \mathbf{x}\_2) = (1 + \mathbf{x}\_1^2)^{-1} (1 + \mathbf{x}\_2^2)^{-1} \mathbf{x}\_1$$

*with θ*<sup>1</sup> = *θ*<sup>2</sup> = *λ*<sup>1</sup> = *λ*<sup>2</sup> = 1*. Q*(**x**) *is depicted in Figure A2 at various level curves for Q*(**x**) = *α, with* 0 < *α* < 1 *.*

*The Equation Q*(*x*1, *x*2) = *α is a closed curve. Such a curve can be thought of as the set of all points that have a given distance to the origin. We observe that for <sup>α</sup>* <sup>≥</sup> <sup>1</sup> <sup>4</sup> *, the neighborhood*

**Figure A2.** *Q*(**x**)=(1 + *x*<sup>2</sup> <sup>1</sup>)−1(<sup>1</sup> + *<sup>x</sup>*<sup>2</sup> <sup>2</sup>)−<sup>1</sup> = *<sup>α</sup>* with 0 < *<sup>α</sup>* < 1.

$$\{\mathbf{x} \in \mathbb{R}^2 \mid Q(\mathbf{x}) > \mathbf{a}\}$$

*of the origin is convex, but for α* < <sup>1</sup> <sup>4</sup> *it is not.*

**Example A3.** *Consider the case when n* = 2 *and θ*<sup>1</sup> = 1*, θ*<sup>2</sup> = 2*, λ*<sup>1</sup> = 1 *and λ*<sup>2</sup> = <sup>1</sup> <sup>2</sup> *, then Q*(**x**) *is defined as*

$$Q(\mathbf{x}\_1, \mathbf{x}\_2) = (1 + 2\mathbf{x}\_1^2)^{-\frac{1}{2}} (1 + \mathbf{x}\_2^2)^{-1}.$$

*Q*(**x**) *is depicted in Figure A3 at various level curves for Q*(**x**) = *α, with* 0 < *α* < 1 *.*

$$\text{Figure A3. } Q(\mathbf{x}) = (1 + 2\boldsymbol{\chi^2\_1})^{-\frac{1}{2}}(1 + \boldsymbol{\chi^2\_2})^{-1} = \boldsymbol{\kappa} \text{ with } 0 < \boldsymbol{\kappa} < 1.$$

*For small values of* **x***, the function Q is equally sensitive to x*<sup>1</sup> *and x*2*. However, if* **x** *is large, then Q is more sensitive to x*2*.*

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Action Recognition Using Single-Pixel Time-of-Flight Detection**

**Ikechukwu Ofodile 1,†, Ahmed Helmi 1,†, Albert Clapés 2,†, Egils Avots 1,†, Kerttu Maria Peensoo 3,†, Sandhra-Mirella Valdma 3,†, Andreas Valdmann 3,†, Heli Valtna-Lukner 3,†, Sergey Omelkov 3,†, Sergio Escalera 2,4,†, Cagri Ozcinar 5,† and Gholamreza Anbarjafari 1,6,7,\*,†**


Received: 14 January 2019; Accepted: 15 April 2019; Published: 18 April 2019

**Abstract:** Action recognition is a challenging task that plays an important role in many robotic systems, which highly depend on visual input feeds. However, due to privacy concerns, it is important to find a method which can recognise actions without using visual feed. In this paper, we propose a concept for detecting actions while preserving the test subject's privacy. Our proposed method relies only on recording the temporal evolution of light pulses scattered back from the scene. Such data trace to record one action contains a sequence of one-dimensional arrays of voltage values acquired by a single-pixel detector at 1 GHz repetition rate. Information about both the distance to the object and its shape are embedded in the traces. We apply machine learning in the form of recurrent neural networks for data analysis and demonstrate successful action recognition. The experimental results show that our proposed method could achieve on average 96.47% accuracy on the actions walking forward, walking backwards, sitting down, standing up and waving hand, using recurrent neural network.

**Keywords:** single pixel single photon image acquisition; time-of-flight; action recognition

#### **1. Introduction**

Action is a spatiotemporal sequence of patterns [1–6]. The ability to detect movement and recognise human actions and gestures would enable advanced human to machine interaction in wide scope of novel applications in the field of robotics from autonomous vehicles, surveillance for security or care-taking to entertainment.

In the field of machine vision, the majority of effort has been put into recognising human action from video sequences [7–9], because overwhelmingly imaging devices mimic the human-like perception of the surroundings and video format is most widely available. Videos are a sequence of two-dimensional intensity patters, captured by using an imaging lens projecting the scene to a two-dimensional detector array (a charge coupled device (CCD) device, for example). Unlike living creatures, the ever growing field of robotics has run into major difficulties while trying to recognise objects, their actions, and their distances from two-dimensional images. Processing the data is computationally demanding, and depth information is not unanimously retrievable.

Deep neural networks, due to their high accuracy, are widely used in many of the computer vision applications such as emotion recognition [10–16], biometric recognition [17–20], personality analysis [21,22], and activity analysis [5,23,24]. Depending on the nature of the data, different structures can be used [25,26]. In this work, we deal with time-series data, i.e., we handle temporal information. For this purpose, we are mainly focused on recurrent neural networks (RNN) and long-short term (LSTM) algorithms.

In addition to colour and intensity, incident light can be characterised by its propagation direction(s), spectral content and temporal evolution in the case of pulsed illumination. Light also carries information about its source, and each medium, refraction, reflection and scattering event it has encountered or traversed. This enables various uncommon ways to characterise the scene. The rapid advancements in optoelectronics and availability of sufficient computational power enable innovative imaging and light capturing concepts, which serve as the ground for action detection. For example, a detector, capable of registering evolution of backscattered light with a high temporal resolution in a wide dynamic range, would be able to detect even objects hidden from the direct line of sight [27–29]. Along the same vein, several alternative light-based methods have been developed for resolving depth information or 3D map of the surroundings (some examples can be found in [30–34]) giving 3D information in voxel format about the scene, which is also suitable for action detection [35]. Combining the fundamental understanding of light propagation and computational neural networks for the data reconstruction, it appears that objects or even persons can be detected using a single pixel detector registering temporal evolution of the back-scattered light pulse [36].

In this work, which is a feasibility study of a novel setup and methodology for conducting action recognition, we propose and demonstrate an action recognition scheme based on a single-pixel direct time-of-flight detection. We use NAO robots in a controlled environment as a test subject. We illuminate a scene with a diverging picosecond laser pulse, (30 ps duration) and detect the temporal evolution of back scattered light with a single pixel multiphoton detector of 600 ps temporal resolution. Our data contains one-dimensional time sequences presenting the signal strength (proportional to the number of detected photons) versus arrival time. Information about both the distance to the object and its shape are embedded in the traces. We apply machine learning in the form of recurrent neural networks for data analysis and demonstrate successful action recognition.

The following list summarises the contributions of our work:


The rest of the paper is organised as follows: in Section 2, related works to single-pixel, single-photon acquisition and action recognition are reviewed. Section 3 describes the data collection and the details of the setup. In Section 4, the details of the proposed deep neural network algorithm used for action recognition are described. The experimental results and discussions are provided in Section 5. Finally, the work is concluded in Section 6.

#### **2. Related Work**

The filed of motion analysis was firstly inspired by intensity images and progresses towards depth images, which are more robust in comparison to intensity images. In the case of action recognition, the most useful are sensors that provide depth map. Nevertheless, data are processed to extract human silhouettes, body parts, skeleton and pose of the person, which in turn are used as features for machine learning methods to classify actions. These sensors have drawn much interest for human activity related research and software development.

#### *2.1. Depth Sensors*

Depth images provide the 3D structure of the scene, and can significantly simplify tasks such as background subtraction, segmentation, and motion estimation. With the recent advances in depth sensor hardware, such as time-of-flight (ToF) cameras, research based on depth imagery has appeared. Three main depth sensing technologies are applied in computer vision research: stereo cameras, time-of-flight (ToF) cameras and structured light.


Similar to ToF depth cameras, action can be encoded in a laser pulse, which is captured by single-pixel cameras. The contents of the scene are encoded in time-series data. When using single pixel camera setups, processing steps such as pose estimation is not necessary. The acquired time series data are usable for machine learning tasks without any modification or additional processing.

#### *2.2. Sing-Pixel Single-Photon Acquisition*

Recent advances in photonics offer various innovative approaches for three-dimensional imaging [42]. Among those is time-of-flight imaging, which enables detection and tracking of objects. This involves illuminating the scene with diverging light pulses shorter than 100 ps. The light is scattered back from the scene and its flight time is detected with respective accuracy. Flight time *t* of light multiplied by the speed *c* of light directly gives the distance the light pulse has travelled from the source to the detector. Often, the laser source and detector are nearby and the value *ct* equals twice the distance of the object. Compared to time-of-flight ranging used in LiDARs (Light Detection And Ranging device), the principles introduced here utilise the knowledge of light propagation and are potentially capable of achieving higher spatial resolution.

In early experiments, 50 fs pulse duration mode-locked Ti:Sapphire near-infrared (NIR) laser and streak camera of 15 ps temporal resolution with array matrix were used to detect movement in occluded environment or to recover the 3D shape of an object behind direct line of sight [28,43]. (Using light pulses as short as 50 fs was not necessary; this is a widely spread ultrashort pulse laser source available in photonics labs.) The reconstruction of the object shape required data traces from various viewing angles and mathematical back-projection. In the scope of current research, the non-line-of-sight illumination can be seen as a method of efficiently diverging the incident laser pulse on the scene. There has been several suggestions to use more widely accessible hardware by

replacing expensive and fragile streak camera with single photon avalanche diode (SPAD) [29], or to construct a setup based on modulated laser diodes and single pixel photonic mixer device [44].

In proof-of-principle experiment [45] a single pixel SPAD detector (the actual 32 × 32 pixels were used for statistics and to speed up the measurements, such device was and early prototype at the time) was used and ca. 50 ps temporal resolution was utilised to demonstrate the ability to detect linear movement of a non-line-of-sight object. Again, ultrashort 10 fs pulse duration Ti:Sapphire laser with carrier wavelength in NIR region was used. Instead of recording the shape of the object, the shape of its reflection on a screen (a floor) was recorded and position of the object was derived from geometry. Replacing the detector array by three single-pixel SPAD detectors, real-time movement of an object was traced [46]. In this experiment, pulsed NIR diode was used instead of Ti:Sapphire laser. The integration time for single-photon detector was reduced from approximately 3 s to 1 s. In consequent papers, the table-top scenes are scaled up to detect a human [36,47]. Significance of the solution presented in [36] relies on artificial neural network machine learning algorithms for data analysis instead of deterministic tools used before. As a result, the team led by Daniele Faccio was able to distinguish between several standing position of a human and distinct between three different persons by analysing merely one-dimensional trace of SPAD detector.

#### *2.3. Action Recognition*

Most action recognition and monitoring systems use images with high enough quality where a person can be identified. When considering commercial applications, such systems invade human privacy. The identification factor can be removed by blurring or obscuring the images, downscaling, using encryption and IT solutions to keep the stored data safe. Nevertheless, at some point data is available in a format where people can be identified and can be mishandled due to breach of security, selling private data for commercial purposes or by request from governmental authorities.

One way of removing the privacy concerns is to use devices which by default use low resolution images, hence eliminating privacy issues at the data acquisition step. For such purpose, researchers are developing methods for action recognition using single pixel and low-resolution cameras. A privacy preserving method was proposed Jia and Radke [48] to track a person and estimate pose of a person using a network of ceiling-mounted time-of-flight sensors. Tao et al. [49] based their solution on a network of ceiling-mounted binary passive infrared sensors to recognise a set of daily activities. Kawashima et al. [50] used extremely low-resolution (16 × 16 pixels) infrared sensors to monitor a person constantly day and night without privacy concerns. Ji Dai et al. [51] studied the privacy implications using virtual space for action recognition. They studied Kinect 2 resolutions from 100 × 100 pixels down to 1 × 1 and their effect on action recognition methods. To address privacy issues, Xu et al. [52] proposed a fully-coupled two-stream spatiotemporal architecture for reliable human action recognition on extremely low resolution (e.g., 12 × 16 pixel) videos.

In this research work, we develop a new methodology for action recognition without using any data which can rise a privacy issue. Such a system can be highly used in places such as nursery and hospitals where recognition of actions might be important without violating the privacy rights of people in the environment.

#### *2.4. Data Interpretability*

In comparison to devices such as Kinect, the depth map provides enough information about a person's body shape and height, and facial features to visually identify the person and his/her actions in the scene. In the proposed experimental setup, we recorded a kind of a depth map, but it was recorded with a single pixel detector. Hence, the trace has no spatial resolution, which would enable identifying a person or an object directly through detecting above mentioned properties. The spatial properties of the scene are imprinted into the temporal evolution of the recorded trace. In the case an action takes place, characteristic temporal evolution pattern is imprinted to the recorded trace. The recorded 1D time series containing temporal evolution of back scattered light (timestamped

detected photon amplitudes) is enough to recognise human actions when interpreted using machine learning algorithms. In the case of a static scene, there is no change in the consequent temporal traces, indicating that no actions are taking place. In addition, the data footprint of a 1D data trace is smaller than that of a depth map. This enables rapid processing times. In the case of using Kinect, the data processing pipeline contains human interpretable data that could be used for unlawful purposes, but in the proposed setup such possibility does not exist.

#### **3. Collected Data**

In this research, for data collection, we created a special setup. Figure 1 shows the general data collection setup, including the placement of the laser and the detector sensor. The data has been collected under the control environment where a NAO V4 humanoid robot was placed in a black box with dimensions of 800 <sup>×</sup> <sup>800</sup> <sup>×</sup> <sup>1200</sup> mm<sup>3</sup> (W <sup>×</sup> <sup>H</sup> <sup>×</sup> L) and was used to conduct some pre-defined actions. The scene was illuminated by Fianium supecontinuum laser source (SC400-2-PP) working at 1 MHz rate. The scatterer ensured that the whole scene Was illuminated at once, without any scanning or other moving parts required. The reflected light from the scene was collected by a Hamamatsu R10467U-06 hybrid photodetector (HPD) with spectral sensitivity range of 220–650 nm. The neutral density filter (OD2) was used in front of the detector to prevent HPD damage due to overexposure. The signal from the HPD was boosted by a Hamamatsu C10778 preamplifier (37 dB, inverting) and then directly digitised by LeCroy WaveRunner 6100a (1 GHz, 10 Gs/s) oscilloscope. The HPD was used in a pulse current (multiphoton) mode, therefore the time resolution of the system was determined by its single-photon pulse response of 600 ps FWHM. The oscilloscope worked in a sequence acquisition mode, recording 200 traces from subsequent trigger events during one sequence, with average frame rate of five sequences per second. The traces within one sequence were averaged to improve signal-to-noise ratio due to both electronic noise and photon statistics. The usage of multiphoton detection mode allowed greatly reducing acquisition time per frame, although with a lower time resolution, unlike the single photon detection used in [46]. The oscilloscope traces in the form of reflected light intensity versus time in nanosecond scale contained all the relevant information about the scene in a non-human-readable form, thus preserving privacy. The series of such traces recorded a 5 fps therefore contain the information about motion.

Various experiments were performed using one- and two-robot setups. A short summary can be seen in Table 1, More detailed description of the tasks can be found in the following sections.


**Table 1.** Summary of the performed actions.

**Figure 1.** The data collection setup: Fianium laser delivers 30 ps duration light pulses. The collimated laser beam is directed to a scatterer, which creates divergent speckle pattern (giving divergence of 40 degree apex angle) inside the box, which are directed to the black box specially designed for the robot. Scattering illumination will reduce potential interference effects at the detector and, using controlled speckle pattern could be used to increase the lateral resolution. The light scattered from the moving object (NAO V4) and the walls is detected using single-pixel hybrid photodetector (HPD), which detects the temporal evolution of back scattered light.

#### *3.1. ONE-Robot Setup*

Initially, for acquiring training data, only one robot was used. Experiments were divided into the following categories:


**Table 2.** Forward (F) movement.





**Figure 2.** Start and endpoints, showing paths of the robot during directional walk.

**Figure 3.** Positions of sitting down and standing up actions.

**Table 4.** Tasks performed in specific locations.




\* In Task 6, the robot did not go forward or reverse, but performed hand-wave, stand-up and sit down action, where each action was repeated 12 times.

**Figure 4.** Positions of object during various robot actions.

#### *3.2. Two-Robot Setup*

We also devised new setup with two NAO V4 humanoid robots. Firstly, one robot was standing still at Position 1 and the other robot at Position 2 walks forward and reverse to Positions 3 and 4, as shown in Figure 5. This action was repeated 10 times. In the next experiment, both robots performed actions simultaneously. Performed actions are listed in Table 6).

**Figure 5.** Position of two robots during actions.



In Figure 6, we illustrate a few examples of preprocessed data, which was used in training. Columns correspond to different actions and the rows are different examples. That is, the sequences consisted of a time series of 500-dimensional vectors.

**Figure 6.** Visualisation of the traces throughout time (x-axis). Columns correspond to different actions (respectively, forward walking, reverse walking, sitting down, standing up, and waving), whereas rows correspond to different examples. The titles on the subplots correspond to the sequence files in the dataset.

#### **4. Method**

We chose a recurrent neural network (RNN) as our baseline. Recurrent nets are able to model multivariate time-series—in our case, time-of-flight measurements—and output a class prediction by considering the whole temporal sequence. In particular, our choice was a RNN with Gated-Recurrent Unit (GRU) cells. These cells can retain long-temporal information using internal gates and a set of optimisable parameters.

#### *4.1. Gated-Recurrent Unit*

We briefly introduce GRUs following the notation from [53]. Let **<sup>x</sup>** = (**x**1, ... , **<sup>x</sup>***t*, ... , **<sup>x</sup>***T*), **<sup>x</sup>***<sup>t</sup>* <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* be a sequence of *T* observations and *y* ∈ *C* its ground truth class label. At each time step *t*, a GRU cell receives **<sup>x</sup>***<sup>t</sup>* and outputs an activation *ht* <sup>∈</sup> <sup>R</sup>*<sup>m</sup>* response

$$h\_t^j = (1 - z\_t^j)h\_{t-1}^j + z\_t^j \bar{h}\_t^j \tag{1}$$

by combining activation at previous time step *h j <sup>t</sup>*−<sup>1</sup> and a candidate activation from the current time step ˜ *h j t*.

The trade-off factor *z j <sup>t</sup>*, namely *update gate*, is calculated as

$$z\_t^j = \sigma(\mathcal{W}\_z \mathbf{x}\_t + \mathcal{U}\_z \mathbf{h}\_{t-1})^j,\tag{2}$$

where *Wz* <sup>∈</sup> <sup>R</sup>*m*×*<sup>n</sup>* and *Uz* <sup>∈</sup> <sup>R</sup>*m*×*<sup>m</sup>* are optimisable parameters shared across all *<sup>t</sup>* and *<sup>σ</sup>* a sigmoid function that outputs values in the interval (0,1).

In its turn, the *candidate activation* is calculated

$$\bar{h}\_t^j = \tanh(\mathcal{W}\mathbf{x}\_t + \mathcal{U}(\mathbf{r}\_t \odot \mathbf{h}\_{t-1}))^j\_{\prime} \tag{3}$$

where is the element-wise product of two vectors and **<sup>r</sup>***<sup>t</sup>* also known as *reset gate*. Note that *<sup>W</sup>* <sup>∈</sup> <sup>R</sup>*m*×*<sup>n</sup>* and *<sup>U</sup>* <sup>∈</sup> <sup>R</sup>*m*×*<sup>m</sup>* are different sets of parameters from *Wz* and *Uz*.

Similar to the update gate **z***t*, the *reset gate* is

$$
\sigma\_t^j = \sigma(\mathcal{W}\_\mathbf{r} \mathbf{x}\_t + \mathcal{U}\_\mathbf{r} \mathbf{h}\_{t-1})^j. \tag{4}
$$

Finally, the last GRU activation at time *T* is input to a dense layer with softmax activation function. From the dense layer, the logit value *z<sup>i</sup>* is computed by

$$z^i = \sum\_j w\_s^{ij} h\_{T'}^j \tag{5}$$

where *Ws* = (*wij <sup>s</sup>* ) are the softmax layer weights. Then, the softmax activation function can be applied to output the sequence classification label

$$
\mathfrak{H}^i = \frac{e^{z^i}}{\sum\_i e^{z\_i}}.\tag{6}
$$

#### *4.2. Bidirectional GRU and Stacked Layers*

Bidirectional recurrent networks consist of two independent networks processing the temporal information in the two temporal dimensions, forward and reverse, so their activation outputs are concatenated. The input of the reverse recurrent network is simply the reversed input sequence. The logit value computation becomes

$$z^i = \sum\_j w\_s^{ij} [h^j\_{\text{fw},T'} h^j\_{\text{rv},0}]\_\prime \tag{7}$$

where [·, ·] is the concatenation of forward and reverse GRUs activations.

In addition, GRU layers can be stacked to form a deeper GRU architecture. The first GRU layer receives as input the sequence of observations **x**, whereas each subsequent layers are fed with activation outputs from the previous layer. We finally apply the softmax dense layer to the activations of the deepest stacked layer.

#### *4.3. Baseline*

Our architecture is a two-layer bidirectional GRU, each GRU with 512 neurons (experimentally chosen). The size of the softmax dense layer is the number of classes |*C*|. Figure 7 illustrates the architecture.

**Figure 7.** The two-layer bidirectional GRU baseline architecture. Arrays represent information flow, grey rectangles are bidirectional GRU layers, and circles represent the concatenation operation.

#### **5. Experimental Results and Discussion**

#### *5.1. Learning Model Details and Code Implementation*

Among different RNN cells, we chose Gated-Recurrent Units (GRU) for our baseline architecture. Compared to other recurrent cells, such as Long-Term Short Memory (LSTM) cells, these require a reduced number of parameters while still retaining long-term temporal information and providing highly competitive performance [54]. GRU is also often chosen over LSTM because hidden states are fully exposed and hence easier to interpret.

For the model computations, we entirely relied on GPU programming. In particular, our implementation is based on Keras [55], a GPU-capable deep-learning library written in Python. As for the GPU device itself, we utilised an NVIDIA Titan Xp with 12 GB of GDDR5X memory.

#### *5.2. Ablation Experiments on GRU Architectures*

To determine the best GRU architecture, we first performed a set of binary classification experiments on the following actions: forward (walking), reverse (walking), sit-down, standing up, and handwaving. We report the performance in terms of accuracy (averaging accuracies over a 10-fold cross validation). In Table 7, we illustrate the ablation experiments on different multi-layer and bidirectional GRU architectures with fixed hidden layer size to 64 neurons. For each architecture and target action, we trained a different GRU model for 25 epochs, which was enough to avoid under-fitting in the most complex model (two-layer biGRU).

In particular, the most complex model, two-layer biGRU, was the one that provided the best result. This showed how both multiple and bidirectional layers can help to model single-pixel time-of-light data sequences. In particular, adding a second stacked layer provided a +5.09% improvement over one single layer, whereas the bidirectionality increased accuracy by 4.4%. The +8.37% gain from using both showed how those two architecture variations are highly complementary when dealing with our data.

**Table 7.** Comparison on GRU models with multiple layers and/or bidirectionality. In this ablation, we defined a set of five binary problems: forward, reverse, sit-down, stand-up, and hand-wave actions. The results reported are class-weighted accuracies averaged over a 10-fold cross validation. The "Average" column is the average of performances on binary problems.


Next, using a two-layer biGRU, we performed another set of ablation experiments on hidden layer sizes: {32, 64, 128, 256, 512}. Since the hidden layer size drastically affects the number of parameters to optimise during the training stage, each model was trained during a different number of epochs: {10, 25, 50, 100, 200, 400}, respectively. Results are shown in Table 8.

The largest model, i.e., 512 hidden layer neurons, performed the best. Its +5.62% gain with respect to the smallest two-layer biGRU model with 32 neurons demonstrated room for improvement from using more complex models despite the presumed simplicity of single-pixel time-of-flight time-series. However, we discarded further increasing the hidden size because of computational constraints: enlarging the hidden layer causes an exponential grow of the number of parameters to train. In particular, a model with 32 hidden neurons consisted of 121 K parameters, whereas 512 hidden neurons increased the size up to 7.8 M (and 37.7 M in the case of 1024 neurons); this and the saturation of accuracy discouraged us to keep enlarging the hidden layer size.

Before further experimentation with GRU recurrent nets, we compared the best performing model to its analogous LSTM variant (two-layer biLSTM with 512 hidden layer neurons). In Table 9, we show how GRU could obtain competitive performance with LSTM. The marginal improvement of 0.56% obtained by LSTM requires a substantial increment of the number of parameters, especially when considering larger models. In the case of 512 hidden size, LSTM has 2.6M additional parameters to optimise when compared to the GRU version. For further experiments, we stuck to the biGRU (two-layer, 512 hidden neurons) architecture.



**Table 9.** GRU versus LSTM on 5 binary problems (see Columns 2–6). The results reported are class-weighted accuracies averaged over a 10-fold cross validation. The "Average" column is the average of performances on binary problems.


#### *5.3. Final Experiments*

After having fixed the final GRU model architecture to tow bidirectional stacked layers with 512 hidden neurons, we performed evaluated its performance in multiclass classification and also other experiments to ensure the generalisation capabilities of our approach.

#### 5.3.1. Multiclass Classification

To evaluate the missclassifications and potential confusion among classes from our previous binary problems, we first defined a multiclass problem with labels those same labels: {F, R, sd, su, hw}, where F is forward, R is reverse, sd is sit down, su is stand up, and hw is hand-wave. In this five-class problem, the model was able to correctly predict 92.67% of actions (see column "Actions" in Table 10). As shown in Figure 8a, the confusion is introduced by the semantically similar classes, either forward and reverse or sit-down and stand-up. The hand-wave classification was almost perfect, only confused once as a reverse instance in 50 hw examples.

The second and third experiments were intended to classify the walking path. The former was not distinguishing walking direction. We hence defined two separate sets of labels, {A1, A2, B1, C1, C2} and {FA1, FA2, FB1, FC1, FC2, RA1, RA2, RB1, RC1, RC2}, respectively, where letters (F) and (R) before action label are used to distinguish between action in forward or reverse direction. As shown in Table 10, the model performed similarly in the two cases, with slightly worse performance not considering the walking direction (86.23%) than when doing so (86.65%). Figure 8 shows the confusion matrices for these two experiments.

Finally, in the fourth and last experiment, we labelled the setup in which the action was occurring with labels {1, ... , 6}, which correspond to tasks listed in Table 5. In this experiment, the robot had to perform various tasks, in addition to performing an action an object is also present in the same environment. Its location and performed actions can be seen in Figure 4. The accuracies obtained from those are summarised in Table 10, while Figure 8 illustrates class confusions.

**Table 10.** Classification on four multiclass problems obtained by biGRU (two-layer, 512-hidden) baseline. The results reported are class-weighted accuracies averaged over a 10-fold cross validation.


**Figure 8.** Confusion matrices (row-wise normalised) from multiclass classification experiments from Table 10.

#### 5.3.2. Model Generalisation on Actions and Two Robots

In this section, we evaluate the generalisation capabilities of the models when learning from single-pixel time-of-flight patterns.

Each action was captured a certain amount of repetitions. During this repetition, the path in walking actions (forward and reverse) or initial position (sit-down, stand-up, and hand-wave) were varied. In this experiment, we wanted to take this into account and try to learn by excluding from training the all the repetitions of one action to assert the model is not overfitting due to repetitions being very similar patterns. For that, we changed our validation procedure to leave-one-rep set-out, i.e., we predicted a repetition set all at once in the test set, and did not use repetitions from the same respect during training. Results are presented in Table 11. If we compare to those to results from the same model, i.e., biGRU (two-layer, 512-hidden), in Table 10, we can observe there was no drop in accuracy, but a slight improvement—probably due to both the generalisation capabilities and the fact that we could use more data to train across folds.

**Table 11.** Leave-one-rep set-out cross-validation (LOROCV) experiment using biGRU (two-layer, 512-hidden). These are the same as those from last row in Table 8, but using LOROCV instead of 10-fold CV.


All sequences were with just one robot performing actions. A separate set of sequences was used to test action classification when two robots were present, as shown in Table 12. These sequences were only used in the test phase (only one-robot sequences were used for training). In particular, we analysed three different scenarios: (1) one robot acted, while the other one stood still; (2) the two robots performed the same action; and (3) each robot performed a different action.

From results in Scenario (1), we observed the standing-up robot did not interfere in the other action category prediction. In fact, the model failed to predict stand-up action since the other actions presented a more dominant motion pattern that interfere in the stand-up pattern learned from one-robot actions.

**Table 12.** Two-robot experiments in three different scenarios: one robot standing up while other performing a particular action, the two robots performing the same action, and the two performing each a different action. Each scenario is a separate test set with a different number of examples. In brackets, the number of positive examples for each class in each scenario. Since positive/negative classes are, we report class-weighted accuracies (%).


#### *5.4. Discussion*

In this paper, we propose a concept for detection actions while preserving the test subjects (NAO V4 robot) privacy. Our concept relies on recording only the temporal evolution of light pulses scattered back from the scene. Such data trace to record one action contains sequence of one-dimensional arrays of voltage values acquired by the single-pixel detector after amplifying and detection by the data acquisition system at 6 GHz repetition rate. The data trace is very compact and easy to process, compared to videos, containing sequences of 2D images.

The data volume reduction is achieved by controlled illumination and single pixel detector without any spatial resolution. The scene was illuminated with a diverging, speckled light pulse of 30 picosecond (30 <sup>×</sup> <sup>10</sup>−<sup>12</sup> ps) duration. The method would also work in different scenes, where most of the objects are static.

Compared to 2D images, hardly any information about the colours, object, their shapes and positions could be retrieved from the data traces by classical method. Although quite similar to the neural networks, a human can distinguish the actions and perhaps also clearly differentiate moving directions from the data traces.

The research in hand clearly articulates the core properties of movement—it imprints a temporal evolution to even most simple data trace. Owing to the interdisciplinary approach through combining the tools of photonics (modern, application oriented optics and light detection) and computer science, one is capable of reducing the data rate. The result has high potential to provide cost effective surveillance systems to aid societies to look after of public order, and take care of young, elderly and injured members.

The photonics and data acquisition schemes used in this experiment are unlikely to become widespread owing to their high cost and other features. However, detectors and laser systems capable of providing suitable illumination and detection properties in affordable price range are being developed and will enter the market in near future.

#### **6. Conclusions**

This research work proposed a new methodology for action recognition while preserving the test subjects privacy. The proposed method uses only the temporal evolution of light pulses scattered back from the scene. Advanced machine learning algorithms, namely RNN and LSTM, were adopted for data analysis and demonstrated successful action recognition. The experimental results show that our proposed method could achieve high recognition rate for five actions, namely walking forward, walking reverse, sitting down, standing up, and waving hand, with an average recognition rate of 96.47%. In this work, we additionally studied action recognition when multiple concurrent actors are present in the scene.

In future work, we will conduct further experiments, including more complex actions, such as running, jumping, and head movements. We are planning to record higher number of samples to conduct a better generalisation capabilities of our proposed approach.

**Author Contributions:** Conceptualization, H.V.-L., S.E., C.O. and G.A.; Data curation, I.O., A.H., K.M.P., S.-M.V. and S.O.; Funding acquisition, G.A.; Investigation, H.V.-L., S.O., S.E. and G.A.; Methodology, A.C., E.A., A.V., H.V.-L., S.E. and G.A.; Software, A.C.; Writing—original draft, A.H. and G.A.; Writing—review & editing, I.O., E.A., S.-M.V., H.V.-L., S.O., S.E., C.O. and G.A.

**Funding:** This work was partially supported by Estonian Research Council Grants (PUT638, PUT1075, PUT1081), The Scientific and Technological Research Council of Turkey (TÜBITAK) (Project 1001-116E097), the Estonian Centre of Excellence in IT (EXCITE) funded by the European Regional Development Fund, the Spanish Project TIN2016-74946-P (MINECO/FEDER, UE) and CERCA Programme/Generalitat de Catalunya. This project received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No. 665919. This work was partially supported by ICREA under the ICREA Academia programme.

**Acknowledgments:** We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp and V GPUs used for this research.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


c 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Supervisors' Visual Attention Allocation Modeling Using Hybrid Entropy**

#### **Haifeng Bao, Weining Fang \*, Beiyuan Guo and Peng Wang**

State Key Lab Rail Traff Control & Safety, School of Mechanical, Electronic and Control Engineering, Beijing Jiaotong University, 100044 Beijing, China; 12116341@bjtu.edu.cn (H.B.); byguo@bjtu.edu.cn (B.G.); c2t53y48@gmail.com (P.W.)

**\*** Correspondence: wnfang@bjtu.edu.cn; Tel.: +86-139-1108-5123

Received: 26 March 2019; Accepted: 10 April 2019; Published: 12 April 2019

**Abstract:** With the improvement in automation technology, humans have now become supervisors of the complicated control systems that monitor the informative human–machine interface. An alyzing the visual attention allocation behaviors of supervisors is essential for the design and evaluation of the interface. Supervisors tend to pay attention to visual sections with information with more fuzziness, which makes themselves have a higher mental entropy. Supervisors tend to focus on the important information in the interface. In this paper, the fuzziness tendency is described by the probability of correct evaluation of the visual sections using hybrid entropy. The importance tendency is defined by the proposed value priority function. The function is based on the definition of the amount of information using the membership degrees of the importance. By combining these two cognitive tendencies, the informative top-down visual attention allocation mechanism was revealed, and the supervisors' visual attention allocation model was built. The Building Automatic System (BAS) was used to monitor the environmental equipment in a subway, which is a typical informative human–machine interface. An experiment using the BAS simulator was conducted to verify the model. The results showed that the supervisor's attention behavior was in good agreement with the proposed model. The effectiveness and comparison with the current models were also discussed. The proposed attention allocation model is effective and reasonable, which is promising for use in behavior analysis, cognitive optimization, and industrial design.

**Keywords:** attention allocation; attention behavior; hybrid entropy; information entropy

#### **1. Introduction**

With the improvement in automation technology, the role of humans in complicated control systems is changing from that of operators to supervisors [1]. More and more information is being displayed on human–machine interfaces, but human attention ability is limited. Therefore, the limited attention resources of supervisors are precious and important. Most system failures and operational accidents are due to the lack of visual attention to relevant information [2]. An alyzing the visual attention behaviors and revealing the visual attention allocation mechanism are important for the design and evaluation of human–machine interfaces (HMIs). HMIs with an ergonomic design that align with the attention behaviors of the supervisors are useful for system safety, error evaluation, and accident prevention [3–5].

Attention behavior has many aspects, such as hearing, vision, and touch. Among them, vision is important to supervisors during the task of monitoring. Humans have a complex selective visual attention behavior that scans the scene both in a rapid, bottom-up, salience-driven manner as well as a slower, top-down, task-dependent manner [6]. The visual attention to bottom-up salient information is a rapid process that has limited effects on task-dependent attention allocation. Supervisory behavior is a long-term attention allocation mechanism for familiar scenes. The top-down task-driven factors occupy the majority of the attention strategy during supervisory tasks.

Many factors can affect attention behaviors. Salience-driven factors depend on visual features such as salience, blinking, shape, and colors [7–9]. The task-driven factors in the supervisory task depend on the task features such as urgency, expectation, effort, and value [10–13]. Supervisors always comprehensively consider the above task-factors during the task process, then establish the priority of the information. In the task, the importance of the displayed information is mainly considered by the supervisors. Matsuka proved that human learners do not always optimize attention; one reason they fail to do so is that, under certain conditions, the cost of information retrieval or use may affect the attention strategy adopted by the learners [14]. Therefore, in familiar procedural tasks, supervisors acquire system information based on their experience and previously acquired knowledge due to the top-down attention strategy.

The determination of information priorities is complicated and fuzzy in the cognitive process. The uncertainty of the information may produce significant anxiety in supervisors, who tend to pay attention to the information sections that can reduce that indeterminacy. The attention to information is a reduction of the entropy of the HMI. This complicated cognitive behavior was described as mental entropy processing by Wanyan [15]. Even though mental entropy theory has some limitations, it was used successfully in modeling the cognitive process for information processing in the human brain. Supervisors tend to pay attention to the visual section which has a higher information value. Therefore, the membership degrees of the importance of the information sections based on fuzzy theory could be feasibly used to quantify its value. These two selective cognitive mechanisms have been shown to synergistically affect attention behaviors [16,17].

Efficient HMIs help their users accomplish their tasks with minimal workload and fatal errors. The visual attention model is useful for the design and optimization of these interfaces [18,19]. The layout of a T-type HMI on aircraft was constructed by Fitts by analyzing the pilots' visual attention behavior [20]. The visual attention model predicted the users' selective attention behavior in supervisory tasks, which was beneficial in staff training [21]. One important aspect of on-the-job training of supervisors is to make them pay attention to the right section at the right time. Using the model, the researchers evaluated the mental workload and situation awareness of the user, which provided information about the conditions of the user's current mental status [22,23]. This model can also guide task analysis and contribute to task optimization [12]. Overall, the visual attention allocation model is useful and promising.

At present, evaluating visual attention is easily accomplished by tracking eye gaze in or after the supervisory task [24]; however, predicting the visual attention allocation behaviors before the task is challenging. We aimed to build an effective, accurate, and quantified model in visual attention allocation based on the related works.

#### **2. Related Works**

In previous studies, researchers proposed many valuable attention allocation models to predict the supervisory behavior of supervisors in informative HMIs. Based on saliency-based image recognition, the predictive attention model was built which considered the bottom-up attention mechanism of humans [6,9,25,26]. The observable information on the screen could be recognized using deep learning to predict the attention behavior [27–29]. These bottom-up models help us reveal the basic attention mechanism that how humans react to images. Wickens developed the SEEV model of scanning behavior considering the task-driven factors [10–12]. This model considers the salience, effort, expectancy, and value (SEEV) associated with each visual section. The model was improved to NT-SEEV, to predict the notice ability (NT for notice) of events that occurred in the context of routine task-driven scanning across large-scale visual environments [30]. Many researchers worked on the quantitation and computation of multiple factors in SEEV [31–33]. SEEV and its improved models consider both

the bottom-up and top-down attention mechanisms of humans. However, due to the different chosen factors and computational methods, the results of the above models have varied significantly.

Some researchers computed attention allocation using gaze data based on fuzzy theory [34,35]. However, this involved a post analysis method that could not predict the attention allocation strategy. Senders considered the human operator as a monitor and controller in the system [36]. The model argues that humans are information processors and supervisory behavior is a data processing process. The model describes the strategy of humans when selecting their attention focus in an informative HMI. Sheridan distinguished the time interval of the supervisor when processing the information and the proposed model assumed that the operator controls the most valuable information with each sample [37]. Visual information processing is fuzzy in the human brain. Lin introduced a novel fractional-order chaotic phase synchronization model for visual selection and shifting [38]. The model uses two chaotic network layers to simulate the human cognitive system and solves the processing of the natural image in the brain, which was useful for the proposed model in this article. Junshan used multiscale entropy analysis of human operating behavior, which is a post-analysis method to determine the human dynamics [39]. Pan extended the influence model to incorporate dynamical parameters to a social system, which allowed us to uncover important shifts between actors. The model is instructive in attention shift behavior [40].

Based on the above work, Matsui researched attention allocation using fuzzy theory and quantified the selective attention mechanism of the information using hybrid entropy [41,42]. Wanyan et al. [15] and Wu et al. [16] applied detection efficiency factors and fatigue factors to Matsui's fuzzy model for pilots. Considering multiple factors in the SEEV model, Wu and Wanyan developed the attention model under multi-factor conditions [17]. This was an attempt to integrate the SEEV model and the fuzzy model. Based on subjective expected utility theory (SEU), a human is an optimal information processing processor [43]. The comprehensive consideration of the theory aimed to maximize the acquisition of the important information and minimize the fuzziness of the scene. The above attention allocation models based on fuzzy theory usually involved two main factors: information value and information fuzziness [15–17,41,42].

The above models used the membership degrees of the importance of the information (value: 0–1) expressing the information value. However, the drawback of the application of membership degrees without processing was that the attention allocation ratio did not increase when the information value increased. This means that a high information value might not lead to a high attention allocation ratio. In this aspect, the above models based on fuzzy theory need to be improved. In this study, we tried to solve this problem and demonstrate that our improvement is reasonable and effective.

The proposed attention allocation model was built based on the work of Matsui's and Wanyan et al.'s models [15,16,42]. The information value is presented by the proposed value priority function using the membership degrees of the importance and information amounts. Using the theory of hybrid entropy, the proposed model expresses the supervisors' fuzzy cognition of the information processing in the human brain. Combining these two cognitive processes, an increasing attention allocation model was built along with the increasing information value. The BAS system is a typical interface used by supervisors to monitor the environmental equipment in subway systems. We conducted an experiment using a BAS simulator, which showed that the proposed model is effective. Compared with Matsui's and Wanyan's model, the proposed model has several advantages and reasonable improvements. We think that our proposed model has potential for applications in behavior analysis, cognitive optimization, and industrial ergonomic design.

#### **3. Methods**

#### *3.1. Value Priority Function*

The supervisory task involves monitoring and controlling a large amount of system information. The information on the monitors can be partitioned into several visual displays and independent meaningful sections, creating *Ii*:

$$I\_i = (I\_1, I\_2, \dots, I\_n) \tag{1}$$

The attention allocation model aims to predict the attention behavior of the supervisor. The attention allocation is the ratio *Ai* of the virtual attention time required to focus on the information *Ii* to the total virtual attention time for the whole task, as shown in Equation (2). The proposed attention allocation model aims to build the mapping relationship between *Ii* and *Ai* before the supervisory task:

$$A\_{\bar{i}} = (A\_1, A\_2, \dots, A\_n) \tag{2}$$

Based on the research of Wickens, the attention behaviors of a skillful operator are rarely affected by the bottom-up channel unless the bottom-up factors have independent meaning [11]. Subsequent research supported this view [14]. Thus, the extension of this theory tried to consider multiple factors in particular scenes.

During a familiar procedural task, the supervisor of the system would have previously evaluated the information value based on their knowledge and training. However, the priority is fuzzy to recognize. Based on fuzzy theory, the membership degree of the information importance is considered the information value *Vi* to every information *Ii*, as shown in Equation (3). For a task, the membership degrees of the importance for the informative sections are certain values. Usually, the values are provided by experts in the field who are familiar with the task [15,16]:

$$V\_{\mathbf{i}} = (V\_1, V\_2, \dots, V\_n) \tag{3}$$

Matsui and Wanyan et al. considered these membership degrees as the information value [15,16,42]. The possible values are 0–1. In this research, we wanted to build a visual attention allocation model with a higher attention ratio to the higher information value. Therefore, the information value, *Vi*, needed to be improved to value the priority of the information, *Vi'*, which ranges from 0 to positive infinity.

Considering supervisors as the information processor, the information value *Vi* of the sections should be converted with its information entropy. Usually, the information amount, *Hi*, in Equation (4), presents the information sections when event *i* occurs, which is related to the probability that the certain information *Pri* occurs:

$$H\_{\bar{i}} = -\ln Pr\_{\bar{i}} \tag{4}$$

The definition of information amounts shows:


The improvement in the information value *Vi* needs to consider the following cognitive behaviors:


Referring to similarities to the definition of information amounts and cognitive behaviors, we propose a value priority function F(*Vi*), to manage the information value *Vi*. The improved information value is value priority *Vi*', as shown in Equation (5), and represents the tendency where supervisors tend to pay more attention to the more important information:

$$V\_i' = \mathcal{F}(V\_i) = -\ln(1 - V\_i) \tag{5}$$

#### *3.2. Information Fuzziness Tendency*

The psychological and physiological states of the supervisor affect attention behavior. *Pi* represents the probability that the supervisor will correctly process the information (Equation (6)). When they have a higher probability of correctly evaluating the information, the supervisor pays more attention to this information [15,16]:

$$P\_1 = \begin{pmatrix} P\_1, P\_2, \dots, P\_n \end{pmatrix} \tag{6}$$

This uncertain evaluation of the information *Pi* is caused by the fuzzy information value *Vi*. Based on fuzzy theory, the ambiguities of information can be quantified by hybrid entropy. The hybrid entropy *S* represents the cognition fuzzy level, which involves the informative probabilistic entropy *Hprob* and the informative binary entropy *Hbin*:

$$\begin{array}{l} S = H\_{\text{prob}} + H\_{\text{bin}} = \sum\_{i=1}^{n} -P\_i \ln P\_i + \sum\_{i=1}^{n} P\_i h(V\_i) \\\ h(V\_i) = -V\_i \ln V\_i - (1 - V\_i) \ln(1 - V\_i) \end{array} \tag{7}$$

The supervisor is the optimal processer of the information when they have the highest attention cognition. That is, the best cognitive state occurs when the hybrid entropy *S* reaches the maximum. The supervisor can process the most amount of information they can based on SEU theory [43] and in that case, the *S* will reach the *Smax*. On this condition, *S* = *Smax*, we calculated the probability of the correct evaluation of *Pi* based on Equation (7) using the Lagrange multiplier with constraints. Finally, the critical points *Pi* was calculated using Equation (8). The calculation of the critical points can be found in the current research [15]:

$$P\_i = \frac{\exp h(V\_i)}{\sum\_{i=1}^n \exp h(V\_i)}, \text{ For S reaches the maximum} \\ S = S\_{\max} \tag{8}$$

When the hybrid entropy *S* reaches the maximum, humans become the best processer of information based on the maximum entropy principle. This means that the human optimally processes the information to decrease the uncertainty of the HMI. *Smax* quantifies this ability, called mental entropy (ME).

The probability of the correct evaluation *Pi* presents the tendency of supervisors to pay more attention to more fuzzy information [15,16].

#### *3.3. Attention Allocation Model*

According to the above-mentioned analysis, the cognitive process of the information in the supervisory task involves two channels. The supervisors process the information value based on their previous cognition and knowledge, while they process the information fuzziness based on the psychological and physiological state of the supervisor. Combining these two channels, we can obtain the information cognitive evaluation *Ci* using Equation (9). Finally, the cognitive process is defined by the probability of the correct evaluation *Pi* and the information value *Vi*:

$$\mathbf{C}\_{i} = P\_{i}V\_{i}^{\prime} = P\_{i}\mathbf{F}(V\_{i}) = -P\_{i}\ln(1 - V\_{i}) \tag{9}$$

Kleinman defined the attention allocation *Ai* as the ability to process the information [44]. Based on information science, he considered humans the optimal multiple processors to process the information channel *Ii*. The subsequent research adopted this idea as the foundation of the attention allocation model and defined the attention allocation *Ai*, which showed that the information cognitive evaluation *Ci* determines the final attention allocation strategy. The final attention allocation model for the supervisors can be represented as:

$$A\_{\bar{i}} = \frac{\mathbb{C}\_{\bar{i}}}{\sum\_{i=1}^{n} \mathbb{C}\_{i}} = \frac{-P\_{\bar{i}} \ln(1 - V\_{\bar{i}})}{\sum\_{i=1}^{n} -P\_{\bar{i}} \ln(1 - V\_{\bar{i}})} \tag{10}$$

Figure 1 shows the framework of the proposed visual attention allocation model for the supervisors and shows how to build the model and the dependent theories.

**Figure 1.** The framework of the proposed visual attention allocation model for the supervisors.

#### **4. Experiment**

#### *4.1. Apparatus*

The experiment interface was a simulator running the BAS system showing the statuses of the main air exchange fans in the subway system (Figure 2a). The system information was shown on a 22-inch digital screen with a resolution ratio of 1680 × 1050. Based on capturing the reflected infrared lights with the eyes, the SMI RED500 (Silicon Microstructures Inc., California, CA, USA) tracked the participant's eye movements with a 60Hz infrared-based camera. We used it to record the participant's visual behaviors including the gaze points on the screen, the fixation distribution. The experiment environment is shown in Figure 2b.

#### *4.2. Participants*

Fourteen students from the Beijing Jiaotong University, Beijing, China participated in the study (seven men, seven women, 25.3 ± 2.6 years old). All participants were familiar with the operation of a computer keyboard and had background knowledge of the subway operation. All participants were right-handed with normal vision.

**Figure 2.** (**a**) The human-machine interface of the Building Automatic System (BAS). (**b**) The experiment environment.

#### *4.3. Experimental Task*

The BAS interface showed four main sections for four air fans in a fire scene. During the task, the participants needed to monitor the four speed indicators of the air fans and allocated their attention resources based on the pre-given membership degrees of the importance of the four sections. The speeds of the air fans continuously changed every second which was shown in the indicators. When the indicators showed an excess speed of the fans (>80% rated), the participants had to press the corresponding key (Insert, Delete, Home, or End for the four sections) on the keyboard to control its speed for overload protection. The abnormal excess speed would remain for one second. If the participants missed it or entered the wrong response to the overload air fans, they would be considered as not having paid attention to the corresponding section on the screen. The accuracy rates and eye behaviors were recorded during the whole task. We used the keys Insert, Delete, Home, and End, because the layout of these four keys is similar to the HMI of the BAS simulator.

The correct response to the abnormal section results in a corresponding score point based on the membership degrees of the importance, e.g., a correct response to areas of interest (AOI) 0.9 will get 0.9 points. It is obvious that response to the section which has higher information value and responses to more abnormal sections will get a higher total score point. The goal of the participants is to achieve the highest total score points.

#### *4.4. Experimental Procedure*

The operation of the BAS interface was explained to the participants. At first, the membership degrees of the importance of the four air fans were set based on their relative priorities in a fire scene. The participants were instructed to remember and understand the membership degrees given the possibility that the system would encounter a serious failure if the supervisor missed the overload control. Participants were asked to practice task operations twice to simulate the supervisor's experience and previously acquired knowledge. Through practice, the participants became familiar with the operation of the BAS and the functioning of the system. They would not need to look at the keyboard when they pressed the keys.

During the formal experiment, the participants were asked to complete the calibration process for the eye tracking devices first. Then, they were asked to freely allocate their attention to the four sections. They need to try their best to response to all the abnormal sections in the HMI. The test continued for five minutes and during the whole test eye behaviors were recorded.

#### *4.5. Data An alysis*

The experimental results of the key-press response showed that the sections had a different correct response ratio, *Oi*, which was calculated by the number of correct responses and total overload occurrences during this section. The correct response to the overload section was considered as selective attention to the corresponding section. Therefore, the fractional attention, *Ak\_i* (key), was quantified by the experimental key-press data as:

$$A\_{k,j} = \frac{O\_i}{\sum\_{i=1}^n O\_i}, (i = 1, 2, 3, 4) \tag{11}$$

After the experiment, the participants' eye tracking data were analyzed using the eye behavior analysis software Begaze, which was developed by Silicon Microstructures Inc., California, CA, USA. In Begaze, the four sections were identified by the four areas of interest (AOIs). The fixation behaviors of the different AOIs were extracted from the original data, which meant that the participants paid attention to the corresponding sections. Based on the fixation times, *mi,* for a certain AOI, the fractional attention, *Ae\_i* (eye), was quantified by the experimental eye tracking data with:

$$A\_{\mathbf{c},\mathbf{i}} = \frac{m\_{\mathbf{i}}}{\sum\_{i=1}^{n} m\_{\mathbf{i}}}, (\mathbf{i} = 1, 2, 3, 4) \tag{12}$$

Using Equation (10), the theoretical results of the proposed supervisors' visual attention allocation model could be calculated as:

$$A\_{p\\_i} = \frac{-P\_i \ln(1 - V\_i)}{\sum\_{i=1}^{n} -P\_i \ln(1 - V\_i)}, (i = 1, 2, 3, 4) \tag{13}$$

Matsui's and Wanyan's model was used as a comparison model; their model was used for aircraft pilots [15,42]. The theoretical results of their model can be calculated using Equation (14). This model is referred to as the Matsui's Model, as he was the first to create the basic method:

$$A\_{m\downarrow i} = \frac{P\_i V\_i}{\sum\_{i=1}^n P\_i \ln V\_i}, (i = 1, 2, 3, 4) \tag{14}$$

The experiment aimed to compare *Ak\_i* (Key), *Ae\_i* (Eye), and *Ap\_i* (Proposed) and *Am\_i* (Matsui's). We adopted the SPSS 25.0 statistics software (developed by IBM, California, CA, USA) to process the data. The results are expressed as the mean ± standard deviation (m ± s). Bivariate Pearson correlation analysis was used to analyze the relationship between the theoretical results and the experimental models. Considering the main difference between the Matsui's Model and the proposed model, the one-sample T test was used to analyze the difference between the two experimental results and the two theoretical results at the sections that had a high membership degree of importance.

#### **5. Results**

#### *5.1. Theoretical and Experimental Results*

Through the information value, *Vi*, pre-given by the experts for the four sections, in one scene the section of the air intake fan in the station hall (intake@hall) had 0.1 membership degrees of information importance, the section of the air outtake fan in the station hall (outtake@hall) had 0.3; and the section of the air outtake fan in the platform (outtake@platform) had 0.7. The section of the air intake fan in the platform (intake@platform) had 0.9 membership degrees of information importance.

The fractional attention, *Ai* (%), of each section can be predicted by both Matsui's Model, *Am\_i*, and the proposed model, *Ap\_i*. The theoretical values are shown in Table 1. There was a significant difference between the two models in the section that had a high membership degree of importance. The proposed model, *Ap\_i*, monotonically increased with the information value, *Vi*, while Matsui's Model, *Am\_i*, did not.


**Table 1.** Information value *Vi* and theoretical values of Matsui's model *Am\_i*, proposed model *Ap\_i*.

The experimental results of the key-press response are shown in Table 2. The key press results showed that a higher information value, *Vi*, led to a higher correct response ratio, *Oi*. This indicted that supervisors paid more attention to the information that had a higher information value, *Vi*, and obtained a higher ratio of correct responses, *Oi*.



The experimental results of the eye tracking are shown in Table 3. The results showed a similar attention tendency as the key-press results. A higher information value, *Vi*, led to more fixation points on the higher-value sections.


**Table 3.** Experimental values based on the eye tracking data.

The eye tracking results provided the most practical evidence of the supervisors' attention allocation strategy. Figure 3 shows the fixation points of one participant. The figure shows that the participant paid more attention to the section that had a higher information value, *Vi* (AOI 0.9 > AOI 0.7 > AOI 0.3 > AOI 0.1).

**Figure 3.** Fixation points of the eye tracking data on the screen.

#### *5.2. Comparison of Theoretical and Experimental Results*

The fractional attention values of the key-press response experiment and the eye movement tracking experiment as well as the two theoretical values are shown in Figure 4.

**Figure 4.** Comparison of the theoretical and experimental results.

At Figure 4 shows, the experimental results better supported the proposed model compared to Matsui's Model. The correlation analysis between the four results were processed and the results are shown in Table 4, which shows that the proposed model was significantly associated with the participants' experimental behaviors in both Key Press and Eye Tracking (*P* < 0.01). The two experimental behaviors, Key Press and Eye Tracking, were significantly correlated (*P* < 0.01), the two experimental results showed coincident behaviors, confirming that the data analysis method is effective. We also found that the correlation between Matsui's Model and the proposed model was 0.939, which means that these two models were close but different. The proposed model was more effective.


**Table 4.** The correlation between models.

\* Correlation was significant at the 0.01 level (two-tailed).

Based on the method used in the proposed model, the significant difference between the two theoretical models were observed for AOI 0.7 and 0.9. The T-test was used to analyze the difference. The results of the statistics are shown in Table 5.


**Table 5.** The one-sample T-test between the models at areas of interest (AOI) 0.7 and 0.9.

\* Significance level is at the 0.05 level (two-tailed).

The statistics showed that the experimental Eye Tracking and Key Press results were not significantly different (*P* > 0.05) from the proposed model at AOI 0.7, but were different from Matsui's Model at AOI 0.7.

For AOI 0.9, the experimental Key Press result showed a significant difference with the proposed model because the participants may not respond to the AOI 0.9 section, even if the participants focused on the section while the overload scene for AOI 0.9 was random. However, the eye tracking results showed no significant difference *(P* > 0.05) with the proposed model, which is more practical.

For Matsui's Model, the experimental results showed a significant difference for AOI 0.7 and AOI 0.9.

#### **6. Discussion**

#### *6.1. Discussion of the Value Priority Function*

The experimental results showed that the proposed model predicts supervisors' visual attention allocation more accurately than Matsui's Model. The improvement in the results from the proposed model was in the high information value, *Vi*, which was due to the proposed value priority function, F(*Vi*), in Equation (5). The role of this function is discussed in depth below.

The proposed value priority function, F(*Vi*), processes the information value, *Vi,* and the processed value is *Vi'*. The proposed model used *Vi'* to present the value priority, whereas Matsui's Model uses the original information value, *Vi*. This finally affected the information cognitive evaluation, *Ci*, process. Therefore, the two theoretical models are based on a different information cognitive evaluation, *Ci*. The fractional cognitive evaluation in Matsui's Model, *Cm\_i*, and the proposed model, *Cp\_i*, can be calculated using Equations (15) and (16), respectively:

$$\mathcal{L}\_{m\\_i} = P\_i V\_i \tag{15}$$

$$\mathbf{C}\_{p\\_i} = P\_i V\_j' = P\_i \mathbf{F}(V\_i) = -P\_i \ln(1 - V\_i) \tag{16}$$

Assume that the number of the independent information sections, *i*, reaches infinity. Assuming that the corresponding information value, *Vi* (membership degree of the importance), ranges from 0 to 1, the probability of the correct evaluation, *Pi*, can be calculated using Equation (17) based on Equation (8):

$$P\_i = \frac{\exp h(V\_i)}{\int\_{V\_i = 0, i = 0}^{V\_i = 1, i = \infty} \exp h(V\_i)}\tag{17}$$

Along with the information value, *Vi*, the information cognitive evaluation, *Ci*, values based on Matsui's Model, *Cm\_i*, and the proposed model, *Cp\_i*, are shown in Figure 5.

**Figure 5.** The information cognitive evaluation *Ci* based on Matsui's Model, *Cm\_i*, and the proposed model, *Cp\_i*, along with information value, *Vi*.

At the figure shows, the proposed model is more reasonable than Matsui's Model in the following aspects:


$$\mathbb{C}\_{\mathfrak{m}\_{\infty}} = \lim\_{V\_i \to 1, i \to \infty} \mathbb{C}\_{\mathfrak{m}\_i} = 0.006\\\mathbb{C}\_{p\_- \infty} = \lim\_{V\_i \to 1, i \to \infty} \mathbb{C}\_{p\_- i} = \infty \tag{18}$$

(3) The proposed *Cp\_i* increases after 0.7822 along with the information value, *Vi*. However, Matsui's *Cm\_i* decreases after 0.7822, which means that a high information value above 0.7822 will not lead to a higher information cognitive evaluation status (Equation (19)), which is not realistic. Therefore, our proposed value priority function, F(*Vi*), is an improvement that corrects the unreasonable part of Matsui's Model:

$$\mathcal{C}\_{m,j} = \exp(\left(V\_i - 1\right)\ln(1 - V\_i) - V\_i \ln(V\_i)) \left(\ln(1 - V\_i) - \ln(V\_i)\right) \mathcal{C}\_{m,j} \\ = 0 \Big| V\_i = 0.7822 \tag{19}$$

(4) At the overall curve of the proposed *Cp\_i* becomes steeper, the attention allocation of the supervisor tends to be more concentrated, and the adjustment of the supervisors' attention allocation is more reasonable.

#### *6.2. Discussion of Attention Allocation Models*

The proposed value priority function, F(*Vi*), affects the information cognitive evaluation, *Ci*; *Ci* affects the whole visual attention allocation model, *Ai*. The difference between Matsui's and the proposed model in theory is discussed in depth below.

Based on the different information cognitive evaluation models, *Ci*, the fractional attention in Matsui's Model, *Am\_i*, and the proposed allocation model, *Ap\_i*, can be calculated using Equations (20) and (21) based on Equation (10), respectively:

$$A\_{m\\_i} = \frac{C\_{m\\_j}}{\int\_{V\_i=0, i=0}^{V\_i=1, i=\infty} C\_{m\\_i}} = \frac{P\_i V\_i}{\int\_{V\_i=0, i=0}^{V\_i=1, i=\infty} P\_i V\_i} \tag{20}$$

$$A\_{p\\_j} = \frac{\mathbb{C}\_{p\\_j}}{\int\_{V\_i=0, i=0}^{V\_i=1, i=\infty} \mathbb{C}\_{p\\_j}} = \frac{P\_i V\_j'}{\int\_{V\_i=0, i=0}^{V\_i=1, i=\infty} P\_i V\_j'} = \frac{P\_i \mathbb{F}(V\_i)}{\int\_{V\_i=0, i=0}^{V\_i=1, i=\infty} P\_i \mathbb{F}(V\_i)} = \frac{-P\_i \ln(1 - V\_i)}{\int\_{V\_i=0, i=0}^{V\_i=1, i=\infty} -P\_i \ln(1 - V\_i)}\tag{21}$$

Along with the information value, *Vi*, the attention allocation based on Matsui's Model, *Am\_i*, and the proposed model, *Ap\_i*, is shown in Figure 6. Based on Equation (17), we added the probability of the correct evaluation *Pi* into the figure. *Pi* is a factor of information fuzziness tendency, which affects the model.

**Figure 6.** The ratios of attention allocation based on Matsui's Model, *Am\_i*, and the proposed model, *Ap\_i*, and the probability of the correct evaluation, *Pi*, along with the information value, *Vi.*

At the figure shows, the proposed model, *Ap\_i*, and Matsui's Model, *Am\_i*, are significantly different:


$$A\_{m\downarrow} ^{\prime} = \exp((V\_l - 1)\ln(1 - V\_l) - V\_l \ln V\_l) + V\_l \exp((V\_l - 1)\ln(1 - V\_l) - V\_l \ln V\_l)(\ln(1 - V\_l) - \ln V\_l) \tag{22}$$

$$A\_{m\downarrow} ^{\prime} = 0 | V\_l = 0.7822\tag{23}$$

(3) The probability of the correct evaluation, *Pi*, reaches the highest value when the information value, *Vi*, = 0.5 (Equation (23)), which means that the supervisor has a higher successful probability to process the information in the visual section that has medium information value, *Vi*:

$$P\_0 = \lim\_{V\_i \to 0, i \to 0} P\_i = 0.006\\P\_{\text{in}} = \lim\_{V\_i \to 0.5, i \to \text{mid}} P\_i = 0.012\\P\_{\text{oo}} = \lim\_{V\_i \to 1, i \to \text{oo}} P\_i = 0.006 \quad \text{(23)}$$

(4) The proposed attention allocation model is not significantly different from Matsui's Model before the intersection near the critical point in Matsui's Model. After the intersection, the ratio of the attention allocation tended to be a steep curve. This means that the participants focused on the highest value information.

In summary, the proposed model is more reasonable and effective, as shown through the above analysis. The experimental results supported the above theoretical discussion. The proposed model is an accurate quantitative method that can be used to analyze the attention allocation strategy of supervisors.

The proposed model can basically quantify attention allocation using hybrid entropy. The other current models based on Matsui's Model, which consider the fatigue, effort, salience, and information detection efficiency [15–17], can replace the basic Matsui Model with the proposed model to improve results. The above factors were weakened in the experiment in this article on purpose to highlight the research achievement that prevented it from being overwhelmed by the above factors.

#### **7. Conclusions**

By referencing the definition of the information amounts, the value priority function was proposed in this paper. Considering supervisors as information processors, the information fuzziness was quantified based on hybrid entropy theory. Supervisors tend to pay more attention to important and fuzzy information. Combining these two aspects, a quantitative visual attention allocation model for supervisors was built. The experiment showed that the proposed model was more effective than the current model. The difference between the proposed theory and the current theory was further discussed, which showed that the proposed model has mathematical specialties that coincide more with practical applications and compensated for the deficiency in the current model.

Further Application: Using the proposed model, visual attention behavior can be predicted before the task. This will help researchers analyze supervisors' behaviors and evaluate the ergonomics of the HMI. The risk of cognitive deficits can be detected early, and targeted attention training can help supervisors schedule limited behavioral resources. Optimizing the HMI design with human behavior will make the system safer and more efficient.

**Author Contributions:** Conceptualization, W.F. and H.B.; Data Curation, P.W.; Funding Acquisition, W.F. and B.G.; Methodology, H.B.; Project Administration, W.F.; Software, P.W.; Validation, B.G. and P.W.; Writing–Original Draft, H.B.; Writing–Review & Editing, H.B.

**Funding:** This research was funded by the National Natural Science Foundation of China (grant number 51575037) and the Research Foundation of State Key Laboratory of Rail Traffic Control and Safety (grant number RCS2018ZT009).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Saliency Detection Based on the Combination of High-Level Knowledge and Low-Level Cues in Foggy Images**

#### **Xin Zhu 1, Xin Xu 1,2,3,\* and Nan Mu <sup>1</sup>**


Received: 28 February 2019; Accepted: 3 April 2019; Published: 6 April 2019

**Abstract:** A key issue in saliency detection of the foggy images in the wild for human tracking is how to effectively define the less obvious salient objects, and the leading cause is that the contrast and resolution is reduced by the light scattering through fog particles. In this paper, to suppress the interference of the fog and acquire boundaries of salient objects more precisely, we present a novel saliency detection method for human tracking in the wild. In our method, a combination of object contour detection and salient object detection is introduced. The proposed model can not only maintain the object edge more precisely via object contour detection, but also ensure the integrity of salient objects, and finally obtain accurate saliency maps of objects. Firstly, the input image is transformed into HSV color space, and the amplitude spectrum (AS) of each color channel is adjusted to obtain the frequency domain (FD) saliency map. Then, the contrast of the local-global superpixel is calculated, and the saliency map of the spatial domain (SD) is obtained. We use Discrete Stationary Wavelet Transform (DSWT) to fuse the cues of the FD and SD. Finally, a fully convolutional encoder–decoder model is utilized to refine the contour of the salient objects. Experimental results demonstrate that the presented model can remove the influence of fog efficiently, and the performance is better than 16 state-of-the-art saliency models.

**Keywords:** saliency detection; foggy image; spatial domain; frequency domain; object contour detection; discrete stationary wavelet transform

#### **1. Introduction**

There is great influence on the visibility of the human tracking in the wild under foggy environments on account of how dust particles suspend in the air. Therefore, the foggy images typically have low contrast and faded color features, in which the main objects are difficult to be recognized. Saliency detection is advantageous to this task, and it is a cognitive process that simulates the attention mechanism of human visual system (HVS) [1–3], which has an astonishing capability to rapidly judge the most attractive image region from a scene for further processing in the human brain.

In the past several years, the detection of visual salient objects has drawn much attention to most image processing applications. Saliency detection in foggy images acts a pivotal part in fields such as human tracking in the wild, object recognition, object segmentation, remote sensing, intelligent vehicles, and surveillance. So far, all kinds of defogging techniques [4–7] have been proposed, and they can reach comparatively good performance.

At present, image processing methods in foggy weather can be split into image enhancement and image restoration methods.

Image restoration methods include Dark channel prior algorithm [8], Visual enhancement algorithms for uniform and non-uniform fog [9], and defogging algorithms based on deep learning [10]. The method of image restoration based on the physical model is mainly to explore the physical mechanism of images degraded by fog, and to establish a general foggy weather degradation model. Then, the degradation model is calculated to compensate for the loss of image information caused by the degradation process. Finally, the quality of foggy images can be improved. However, the image restoration algorithm is a physical model based on atmospheric scattering. It requires more priori knowledge. Image enhancement methods can be divided into contrast enhancement and color enhancement. Image enhancement representative algorithms include histogram equalization [11], Retinex [12], and Wavelet based approaches [13,14]. However, the main drawbacks of these algorithms include: (1) High complexity makes their execution time-consuming, thereby making it difficult to guarantee the real-time performance of saliency detection. (2) During the process of dehazing, the visibility of foreground and background is increased simultaneously, so the recognition of salient objects is disturbed to some extent. (3) Image color distortion leads to visual features such as the edge and contour of the target cannot be accurately extracted.

Due to the low-resolution and low-contrast characteristics of foggy images, traditional spatial or frequency-based saliency models have a poor performance under fog environment. In view of this problem, this paper presents a frequency-spatial saliency model based on the atmospheric scattering distribution of foggy images, which can obtain effective information under foggy weather. Since the traditional machine learning method leads to the loss of boundary information, the object contour detection method of deep learning is added to enrich the edge information of the saliency map. As illustrated in Figure 1, the object contour detection method obviously improves the quality of the saliency map.

(a) input image (b) ground truth (c) traditional method (d) our method

**Figure 1.** Example of salient object detection in foggy images.

In this paper, traditional methods and deep learning methods are combined to effectively detect salient objects for human tracking in the wild. In step one, the frequency domain (FD) and the spatial information are fused by DSWT. We utilize the object contour detection method of deep learning to obtain the map of the edge of the object at step two. Last, we obtain the final saliency map of the foggy image by fusing the two maps. Specifically, in step one, the foggy image is transformed into HSV color space first, and the amplitude information of FD is utilized to obtain feature maps in each channel. Then, segmenting the image into superpixels and computing the saliency of each superpixel by the local-global spatial contrast. Finally, the DSWT is applied to fuse the FD and spatial domain (SD) saliency maps, and the Gaussian filter is employed to refine the results. The flow diagram of the presented method is shown in Figure 2. The experimental results show that the proposed method can effectively detect salient objects under fog conditions.

**Figure 2.** Flowchart of the proposed salient object detection model in single foggy image.

#### **2. Related Works**

Saliency detection is generally driven by low-level knowledge and high-level cues. Therefore, visual saliency computation under foggy environments for human tracking in the wild can be typically categorized into two classes: Saliency computational models and object contour detection approaches. Traditional saliency computational models are data-driven and primarily utilize low-level image features; while top-down object contour detection models are task-driven and usually utilize cognitive visual features.

#### *2.1. Saliency Computational Models*

From the perspective of information processing, traditional saliency models can be divided into two categories: SD and FD based models.

The SD saliency models are usually based on the contrast analysis to establish the algorithms. Itti et al. [15] presented a famous saliency model by utilizing the center-surround differences of multiple features. Goferman et al. [16] introduced a context-aware saliency approach, which measures the similarity of image patches in a local-global manner. Xu et al. [17] proposed a superpixel-level saliency method through a support vector machine (SVM) to train unique features. Cheng et al. [18] considered the histogram information and spatial relations, and then developed a global contrast-based saliency algorithm. Peng et al. [19] integrated tree-structed sparsity-inducing and Laplacian regularizations to construct a structured matrix decomposition model. However, most of the features used in these spatial models are not ideal for foggy images.

The saliency models of the FD develop an algorithm by converting the to a spectrum. Hou and Zhang [20] employed a spectral residual saliency method, which utilizes the log spectra to represent images. Guo et al. [21] extended the FPT algorithm and denoted four features of image by quaternion. Then, they utilized the Fourier transform of the quaternion to acquire the saliency map. Achanta et al. [22] built a frequency-tuned method, which estimates the contrast of several features. The color and brightness characteristics of each pixel are adopted to calculate the saliency map by Bian and Zhang [23]. Li et al. [24] explored saliency detection by analyzing the scale-space information of the amplitude spectrum (AS). Li et al. [25] studied the image saliency in the FD to design the model. Arya et al. [26] integrated local and global features to propose a biologically feasible FD saliency algorithm. These existing FD saliency models do not work well in foggy images due to the low-frequency information representing salient objects are greatly reduced in foggy weather.

#### *2.2. Object Contour Detection*

Object contour detection is a traditional computer vision problem with a long history. The traditional computer vision methods include Roberts, Prewitt, Sobel, canny, and other algorithms.

In the process of object contour detection, Roberts' algorithm does not smooth the image, so the image noise is generally not well suppressed, which also affects the loss of a part of the edge when calculating the positioning. However, Roberts' algorithm has higher positioning accuracy and better effect on steep low-noise images. Prewitt algorithm can suppress noise. The principle of noise suppression is pixel average, which is equal to low-pass filtering of the image. Thus, Prewitt's algorithm is inferior to Roberts' algorithm in edge positioning. The practical application of the Sobel edge detection algorithm [27] is when the efficiency requirements are high and the fine texture is not of interest. Sobel is usually directional and can detect only vertical or horizontal edges or both. The Sobel algorithm is improved on the basis of the Prewitt algorithm. Compared with the Prewitt algorithm, the Sobel algorithm can suppress the smoothing noise better. The Canny algorithm [28] pays more attention to the edge information reflected by the pixel gradient change and does not consider the actual object. However, it leads to loss of spatial information of the image at the same time. For some images where the edge color is similar to the background color, the edge information may be lost. The Canny algorithm is one of the best algorithms for detecting edge effects in traditional first-order differentials. It has stronger denoising capabilities than the Prewitt and Sobel algorithms. On the other hand, it is also easy to smooth some edge information, and its checking method is more complicated. However, the traditional edge detection algorithm uses the maximum gradient or the zero-crossing value of the second derivative to obtain the edge of the image. Although these algorithms have better real-time performance, they have poor anti-interference and cannot effectively overcome the influence of noise. In addition, the positioning is not good.

With the development of deep learning, the fast edge algorithm, HED and RCF algorithms are introduced. Fast edge algorithm [29] uses random forests to generate edge information. Ground truth is used to extract the edge of the image patch. This can not only reflect the actual object, but also reflect the spatial information of the picture. HED [30] used the network modified by VGG. Feature information is extracted from the whole image through multi-scale fusion, multi-loss and other methods. Similarly, it can reflect the feature information of the edge. RCF [31] takes advantage of the features of all convolutional layers in each stage compared to HED. The use of more features has also brought about an improvement in results and achieved good results. Inspired but different from these deep learning models, we employ an encoder-decoder network with full convolution to guide better salient object detection.

In our previous work, we trained an encoder-decoder network with full convolution using Caffe to optimize the performance of saliency detection. The proposed fully convolutional encoder-decoder network can learn the object contour to better represent saliency map in low contrast foggy images. The key contributions of this paper are summarized below: (1) We compute the saliency map via a frequency-spatial fusion saliency model based on DSWT. (2) This framework is further refined by a fully convolutional encoder-decoder model based on fully convolutional networks [32] and deconvolutional networks [33]. (3) The presented saliency computational model has better performance in foggy images than traditional models.

#### **3. Proposed Saliency Detection Method**

In this paper, we propose a frequency and spatial cues based traditional method through DSWT and a deep learning-based edge detection method fused salient object computational model to obtain the saliency map in foggy images effectively.

This section first analyses the features of foggy images, including the imaging model and effect of fog distortion on images in Section 3.1. We describe the FD based algorithm and some important computational formulas in Section 3.2. Then we give the detailed description of the SD based algorithm in Section 3.3. Section 3.4 provides the implementation of the discrete stationary wavelet transform based image fusion, which combines the above-mentioned two algorithms to generate elementary saliency map. Finally, Section 3.5 introduces the object contour detection method to refine the contour of the saliency map. It makes the position of the salient object more precise.

#### *3.1. Analysis of Foggy Image Features*

#### 3.1.1. Imaging Model of Foggy Image

Under fog conditions, there are a lot of tiny water droplets and aerosols in the atmosphere, which seriously affect the spread of light, resulting in a decrease in image clarity and contrast in foggy days. Especially for color images, it also produces severe color distortion and misalignment. From the respective of the computer vision, there are plentiful models [34,35] which are widely used for describing the information of foggy images. Narasimhan and Nayar [35] proposed imaging model of foggy images as shown following:

$$I\_\mathbf{x}^\mathbf{c} = f\_\mathbf{x}^\mathbf{c} t\_\mathbf{x} + A^\mathbf{c} (\mathbf{1} - t\_\mathbf{x}) \tag{1}$$

where *<sup>c</sup>* <sup>∈</sup> r, g, b denotes the color space of the images and *I* c *<sup>x</sup>* denotes the foggy image captured by an imaging device. *J* c *<sup>x</sup>* and *tx* denote the scene reflected light and scene transmissivity, respectively. *A*<sup>c</sup> is a constant and represents the ambient light.

In Equation (1), *J c xtx* and *Ac*(<sup>1</sup> − *tx*) denote the direct attenuation [10] and air light [36], respectively. Direct attenuation is defined as the radiance of the scene and its attenuation in the medium. Air light, on the other hand, is caused by the previous scattering light, resulting in a change in the color of the scene. The transmission *t* can be indicated as follows in which the atmosphere is homogenous:

$$t\_x = e^{-\beta d\_x} \tag{2}$$

let β denote the scattering coefficient of the atmosphere.

The results show that the scene brightness decays exponentially with the scene depth *d*.

#### 3.1.2. Effect of Foggy Distortion on Images

The degraded effect of fog on the image [37] is called fog distortion. The degraded effect of fog distortion brings great challenges to the saliency computation of images. The effect of fog distortion on image quality is mainly concentrated in three aspects:


#### *3.2. FD Based Algorithm*

Given a foggy image, it is transformed into HSV color space firstly, which has shown strong stimuli to human visual cortex in foggy image [38], thus the hue, saturation, value features of H, S, and V channels can be considered as the important indicators for detecting saliency.

Then, the H, S, and V channels are converted into FD respectively by conducting the Fast Fourier Transform (FFT) as:

$$F(\mu, v) = \sum\_{x=0}^{M-1} \sum\_{y=0}^{N-1} f(x, y) e^{-j2\pi(\frac{y\mu}{M} + \frac{v\mu}{N})},\tag{3}$$

where *M* and *N* denote the image's width and height. *f*(*x*, *y*) and *F*(*u*, *v*) denote image pixels in SD and FD, respectively.

*A*(*u*, *v*) and *P*(*u*, *v*) represent the AS and the *phase spectrum* (PS), respectively. And they can be computed via:

$$P(u,v) = \text{angle}(F(u,v)),\tag{4}$$

$$A(\mu, v) = \text{abs}(F(\mu, v)),\tag{5}$$

where the AS function and the PS function are denoted as abs(·) and angle(·), respectively. In PS function, each element of the complex array *F*(*u*, *v*) returns the phase angle (in radians). This angle is between ±π. Amplitude spectrum *A*(*u*, *v*) = abs(*F*(*u*, *v*)) means the absolute value of image pixels in frequency domain.

For foggy images, the low amplitude in FD can be regarded as a cue of the object, and the high amplitude can represent the fog background. Therefore, restraining the high amplitude information to highlight the object region in other words, the salient object can be extracted by removing the peaks of the AS via:

$$A(u, v) = \text{med}\, \text{filt} \mathbf{2}(A(u, v)),\tag{6}$$

where the median filter function is represented as medfilt2(·), which can effectively eliminate the peaks of *A*(*u*, *v*). medfilt2(*I*) performs median filtering of the image I in two dimensions. Each output pixel contains the median value in a 3-by-3 neighborhood around the corresponding pixel in the input image.

Next, it can compute a new FD map via:

$$F(u,v) = \left| A(u,v) \right| e^{-jP(u,v)},\tag{7}$$

where the absolute value is represented as |·|.

The FD map is then transformed back to SD by performing the Inverse Fast Fourier Transform (IFFT) via:

$$f(x,y) = \frac{1}{\text{MN}} \sum\_{\mu=0}^{M-1} \sum\_{v=0}^{N-1} F(\mu, v) e^{j2\pi(\frac{\mu v}{M} + \frac{v\nu}{N})}.\tag{8}$$

The saliency maps (denoted as *Hmap*, *Smap*, and *Vmap*) of each channel in HSV color space can be acquired by (3)–(8).

Finally, we calculate the sum of *Hmap*, *Smap*, and *Vmap*, and obtain the map of FD saliency (represented as *S*1).

#### *3.3. SD Based Algorithm*

To reduce the amount of computation and guarantee the integrity of the object, the input foggy image is first divided into superpixels (presented as *SP*(*i*), *i* = 1, ···, *Num*, *Num* = 300) through the simple linear iterative clustering (SLIC) algorithm [39]. Then, the obtained *Hmap*, *Smap*, and *Vmap* of H, S, and V channels are regarded as the features of saliency.

The local-global saliency of every superpixel *SP*(*i*) in *Hmap* can be obtained through:

$$S\_{H\_{\text{unp}}}(i) = 1 - \exp\left\{ -\frac{1}{Num - 1} \sum\_{j=1, j\neq i}^{Num} \frac{d\_{H\_{\text{unp}}}(SP(i), SP(j))}{1 + E(SP(i), SP(j))} \right\},\tag{9}$$

where *dHmap* (*SP*(*i*), *SP*(*j*)) is the difference in the mean of *SP*(*i*) and *SP*(*j*) in *Hmap*. The mean Euclidean distance between *SP*(*i*) and *SP*(*j*) is represented as *E*(*SP*(*i*), *SP*(*j*)).

Through (9), saliency values *SSmap* (*i*) and *SVmap* (*i*) of superpixels *SP*(*i*) in *Smap* and *Vmap* can be figured out.

In the end, the saliency value of each pixel *SP*(*i*) is acquired by the sum of *SHmap* (*i*), *SSmap* (*i*), and *SSmap* (*i*). And *S*<sup>2</sup> is the saliency map of SD.

#### *3.4. DSWT Based Image Fusion*

The presented model mainly employs 2-levels DSWT to remove the noise of the saliency map and to accomplish the wavelet decomposition on it.

Low-pass filter and high-pass filter of the 1-level conversion are represented as *h*1[*n*] and *g*1[*n*]. Up sample of the 1-level can calculate the 2-levels filters *h*2[*n*] and *g*2[*n*]. Next, we can obtain the horizontal high-frequency subband *H*2, the approximation low-pass subband *A*2, and the diagonal high-frequency subband *D*2, the vertical high-frequency subband *V*2. The high-pass and low-pass subband has the same size as the initial image. Therefore, the information of detail can be preserved adequately. Thereby, it makes DSWT have translation invariance.

According to above steps, the saliency map based on the FD *S*<sup>1</sup> and the saliency map based on the SD *S*<sup>2</sup> is obtained. Then, we fuse the two maps through the 2-levels DSWT as:

$$[A\_1 \mathbf{S}\_1, H\_1 \mathbf{S}\_1, V\_1 \mathbf{S}\_1, D\_1 \mathbf{S}\_1] = \text{swrt2}(\mathbf{S}\_1, 1, \text{ } \text{'sym2'}),\tag{10}$$

$$[A\_1 \text{S}\_2, H\_1 \text{S}\_2, V\_1 \text{S}\_2, D\_1 \text{S}\_2] = \text{swt2}(\text{S}\_2, 1, \text{'sym2'}),\tag{11}$$

$$[A\_2S\_{1\prime}, H\_2S\_{1\prime}, V\_2S\_{1\prime}, D\_2S\_1] = \text{swrt2}(A\_1S\_{1\prime}, 1, \text{'sym2'}),\tag{12}$$

$$[A\_2S\_2, H\_2S\_2, V\_2S\_2, D\_2S\_2] = \text{swt2}(A\_1S\_2, \ 1, \ '\text{sym}\, 2'),\tag{13}$$

where the multilevel DSWT is represented as swt2(·). swt2(·) performs a multilevel 2-D stationary wavelet decomposition using either an orthogonal or a biorthogonal wavelet. Equations (10)–(13) compute the stationary wavelet decomposition of the real-valued 2-D or 3-D matrix at 1-level by using 'sym2 . The output three-dimensional array *AiSj* is represented as the result of the i-level low frequency approximation coefficients of saliency map *Sj* employing sym2 filter, and *DiSj*, *HiSj*, *ViSj* represent the high frequency coefficients of the diagonal, vertical and horizontal directions, respectively.

Next, the 2-level fusion is calculated using the following formulas:

$$A\_2 S\_f = 0.5 \times (A\_2 S\_1 + A\_2 S\_2),\tag{14}$$

$$H\_2\mathcal{S}\_f = D \cdot H\_2\mathcal{S}\_1 + \overline{D} \cdot H\_2\mathcal{S}\_{2,\*}\\D = (|H\_2\mathcal{S}\_1| - |H\_2\mathcal{S}\_2|) \ge 0,\tag{15}$$

$$V\_2 S\_f = D \cdot V\_2 S\_1 + D \cdot V\_2 S\_2, \; D = \left( |V\_2 S\_1| - |V\_2 S\_2| \right) \ge 0,\tag{16}$$

$$D\_2 S\_f = D \cdot D\_2 S\_1 + \tilde{D} \cdot D\_2 S\_2, \; D = \left( |D\_2 S\_1| - |D\_2 S\_2| \right) \ge 0. \tag{17}$$

The 1-level fusion is calculated using the following formulas:

$$A\_1 S\_f = \text{iswt2}(A\_2 S\_f, H\_2 S\_f, V\_2 S\_f, D\_2 S\_f, \text{ } \text{sym2}'), \tag{18}$$

$$H\_1 S\_f = D \cdot H\_1 S\_1 + \overline{D} \cdot H\_1 S\_2 \text{ } D = \left( |H\_1 S\_1| - |H\_1 S\_2| \right) \ge 0,\tag{19}$$

$$V\_1 S\_f = D \cdot V\_1 S\_1 + D \cdot V\_1 S\_2 \text{ } D = \left( |V\_1 S\_1| - |V\_1 S\_2| \right) \ge 0,\tag{20}$$

*Entropy* **2019**, *21*, 374

$$D\_1 S\_f = D \cdot D\_1 S\_1 + D \cdot D\_1 S\_2, \; D = \left( |D\_1 S\_1| - |D\_1 S\_2| \right) \ge 0,\tag{21}$$

where the inverse DSWT function is represented as iswt2(·). For example, X = iswt2(*A*, *H*, *V*, *D*, sym2 ) reconstructs the matrix X based on the multilevel stationary wavelet decomposition structure [*A*, *H*, *V*, *D*] in Equation (18) and Equation (22).

Then, the fusion image can be calculated using the following formulas:

$$\text{Salmap} = \text{iswt2} \{ A\_1 \text{S}\_f, H\_1 \text{S}\_f, V\_1 \text{S}\_f, D\_1 \text{S}\_f, ' \text{sym2}' \}. \tag{22}$$

In the end, the proposed method utilizes a Gaussian filter to generate a smoothed saliency map.

#### *3.5. Object Contour Detection*

Object contour detection model [40] can filter and ignore the edge information in the background and obtain the contour detection result by centering the object in the foreground. Inspired by the fully convolutional networks and deconvolutional networks [33], an object contour detection model is introduced to extract the target contour and suppress background boundaries.

The layers up to 'fc6 from VGG-16 [41] are used in the edge detection model as the encoder of the network. The deconv6 decoder convolutional layer uses 1 × 1 kernel, and all remaining decoder convolutional layers use 5 × 5 kernel. Except for the decoder convolutional layer next to the output layer which uses the sigmoid activation function, all other decoder convolutional layers are followed by the relu activation function.

We trained the network using Caffe. The parameters of the encoder are fixed when training the network, while only the parameters of the decoder are optimized. This maintains the generalization of the ability of the encoder and enables the decoder network to be easily combined with other tasks.

#### **4. Experimental Results**

#### *4.1. Experiment Setup*

**Datasets:** Abundant experiments are executed on two datasets to assess the performance of the proposed saliency model.

A foggy image dataset (FI) was collected from the Internet, which contained 200 foggy images. We also provide the corresponding manual labeled ground truths. The FI dataset can be downloaded at https://drive.google.com/file/d/1aqro3U2lU8iRylyfJP1WRKxTWrrFzizh/view?usp=sharing. The other one is the BSDS500 Dataset. It includes 500 natural images with carefully annotated boundaries by different users. The dataset is divided into three parts: 200 for training, 100 for validation and the other 200 for testing. Object contour detection is utilized to optimize the saliency map which was obtained by traditional machine learning methods of salient object detection. Due to the use of traditional methods, the edge information of the saliency map is incomplete.

**Evaluation Criteria:** For quantitative evaluation, the average computation time, the mean absolute error (MAE) score, the overlapping ratio (OR) score, the precision-recall (PR) curve, the true positive rates (TPRs) *and false positive rates* (FPRs) curve, the area under the curve (AUC) score, the F-measure curve, the weighted F-measure (WF) score, and various saliency models are computed, respectively.

The precision, recall, TPR and FPR values are generated by converting the saliency map into binary map via thresholding to compare the difference of each pixel with ground truth. β<sup>2</sup> is the parameter to weigh the precision and recall, which is set to 0.3 in our experiments [18,22].

The ratio of the number of salient pixels correctly labeled to all salient pixels in this binary map is defined as the precision. In other words, precision refers to how many of the samples that are positively judged by the model that are true positive samples. The recall rate refers to how many positive samples are judged as positive samples by the model in the ground-truth map:

$$precision = \frac{|TS \cap DS|}{|DS|}, recall = \frac{|TS \cap DS|}{|TS|}. \tag{23}$$

*Entropy* **2019**, *21*, 374

where *TS* and *DS* denote true salient pixels and detected salient pixels by the binary map, respectively.

The TPRs represents the probability that have a right classification of positive examples, and the FPRs represents the probability of splitting a negative sample into a positive sample.

$$\text{TPR} = \frac{\text{TP}}{\text{(TP} + \text{FN)}}, \text{FPR} = \frac{\text{FP}}{\text{(FP} + \text{TN)}},\tag{24}$$

F-measure value, denoted as *F*β, is obtained by computing the weighted harmonic mean of precision and recall.

$$F\_{\beta} = \frac{(1+\beta^2) \times \text{Precision} \times \text{Recall}}{\beta^2 \times \text{Precision} + \text{Recall}},\tag{25}$$

where β<sup>2</sup> is set to 0.3 to weight precision more than recall as suggested in [42].

Given a ground truth main subject region G and a detected main-subject region D. The OR score is the ratio between two times the correctly detected main-subject region to the sum of detected and ground truth main subject region.

$$\text{OR} = \frac{2 \times A(\text{D} \cap \text{G})}{A(\text{D}) + A(\text{G})} \,\text{}^{\prime} \tag{26}$$

The percentage of area under the TPRs-FPRs curve is called as the AUC score. It intuitively reflects the classification ability of ROC curve.

$$\text{AUC} = \frac{\sum\_{i \in \text{positiveClass}} rank\_i - \frac{M(1+M)}{2}}{M \times N},\tag{27}$$

The MAE score to calculate the average difference of each pixel between the saliency map which is predicted and ground truth. It is acquired by:

$$\text{MAE} = \frac{1}{W \times H} \sum\_{x=1}^{W} \sum\_{y=1}^{H} \left| S(x, y) - G(x, y) \right|, \tag{28}$$

where *S* is predicted saliency map and *G* is ground truth, the width and height of saliency map *S* are presented as *W* and *H*.

#### *4.2. Comparison and Analysis*

The presented method is compared with 16 well-known saliency detection methods including: IT [15], CA [16], SMD [19], SR [20], FT [22], MR [41], NP [43], IS [44], LR [45], PD [46], SO [47], BSCA [48], BL [49], GP [50], SC [51], and MIL [52]. The source code provided by others was used to test on our foggy dataset. Each foggy image in our dataset was tested on 16 methods of others to produce the corresponding saliency map.

Figure 3 shows the PR, TPRs-FPRs, and F-measure curves of various saliency models to evaluate the proposed model quantitatively. The larger the area under the curve is, the better the performance of the saliency model will be.

It can be seen from the three figures that the proposed model is superior to other saliency models, which validates that our saliency result is robust in foggy images.

The greatest three results in Table 1 are emphasized in red fonts, blue fonts and green fonts when comparing performance with other methods. Table 1 shows that the presented model yields the greatest performance in terms of AUC and OR scores and obtains the second best in MAE and WF. These results indicate that the presented saliency model reaches the better performance under fog conditions. Moreover, our proposed method has a shorter running time than most, ranking fifth out of other 16 methods.

**Figure 3.** The quantitative comparisons of the proposed saliency model with 16 state-of-the-art models in foggy images.


**Table 1.** The performance comparisons of various saliency models in foggy images.

Figure 4 shows the visual comparisons of varieties of saliency detection models on the foggy image dataset, which demonstrates that the saliency maps obtained by our method are much closer to the ground truths. Compared to the baselines, our method yields a better performance, which means that it suppresses background clutters well and generates visually good contour maps. Based on the saliency maps compared with other models, this paper makes a few basic observations:

The IT, NP, IS and BL models find it difficult to suppress the fog background. The map did not highlight salient objects but detected the fog background together. It is treated with fog as the foreground. As can be seen from Figure 4, there has a very poor effect.

Saliency maps of the MR, BSCA and GP models show that the fog background areas are too bright, and background and foreground are marked as salient regions at the same time. Therefore, saliency maps are blurred. However, the results are relatively better than IT, NP, IS and BL.

The FT, LR, PD, SC, CA models detect salient objects in the foreground while also clearly detecting non-salient objects such as trees, streetlights, and roads in the back-ground. Such an algorithm cannot achieve the purpose of saliency detection and is meaningless for tracking humans in the wild.

Although the fog background has less interferential in the SR than others, the salient objects are also not detected. It is the worst model on test dataset. Due to the features they used are ineffective in foggy images.

D,QSXWE\*7 F,7 G65 H)7 I13 J&\$K,6 L/5 M3' N05O62 P%6&\$ Q%/ R\*3 S6& T60'U0,/V2856

**Figure 4.** The saliency maps of the proposed model in comparison with 16 models in foggy images. (**a**) testing foggy images, (**b**) ground truth binary masks, (**c**–**r**) saliency maps obtained by 16 state-of-the-art saliency models, (**s**) saliency maps obtained by the proposed model.

The SO, SMD and MIL have poor performance for foggy images with a slightly complex background. Although salient objects in the foreground are detected, the brightness of the fog in the image affects the final detection results. In other words, these models are not robust in fog environment.

The experiment results show that other models cannot detect the salient objects well under foggy weather. It is evident that the proposed method can better detect the salient objects in foggy images and more effective than other models. The reasons are summarized as follows:


#### **5. Conclusions**

In our study, we present a high-efficiency model to handle the salient object detection of foggy images. The proposed model combines traditional machine learning based frequency-spatial saliency detection algorithm and deep learning-based object contour detection algorithm to cope with the matter of salient object detection under fog environments. In traditional saliency detection method, the saliency map is acquired by fusing the frequency and spatial saliency maps via DSWT. Then, a fully convolutional encoder–decoder model is utilized to improve the contour of the salient objects. Experimental results on foggy image dataset demonstrate that the proposed saliency detection model performs obviously better against other 16 well-known models.

**Author Contributions:** X.Z. analyzed the data and wrote the paper; X.X. guided the algorithm design and provided the funding support; X.Z. and N.M. designed the algorithm and conducted the experiments with technical assistance.

**Funding:** This research was funded by the Natural Science Foundation of China, grant number 61602349.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Detecting Toe-Off Events Utilizing a Vision-Based Method**

#### **Yunqi Tang 1, Zhuorong Li 1, Huawei Tian 2, Jianwei Ding 3,\* and Bingxian Lin 4,5,\***


Received: 15 February 2019; Accepted: 24 March 2019; Published: 27 March 2019

**Abstract:** Detecting gait events from video data accurately would be a challenging problem. However, most detection methods for gait events are currently based on wearable sensors, which need high cooperation from users and power consumption restriction. This study presents a novel algorithm for achieving accurate detection of toe-off events using a single 2D vision camera without the cooperation of participants. First, a set of novel feature, namely consecutive silhouettes difference maps (CSD-maps), is proposed to represent gait pattern. A CSD-map can encode several consecutive pedestrian silhouettes extracted from video frames into a map. And different number of consecutive pedestrian silhouettes will result in different types of CSD-maps, which can provide significant features for toe-off events detection. Convolutional neural network is then employed to reduce feature dimensions and classify toe-off events. Experiments on a public database demonstrate that the proposed method achieves good detection accuracy.

**Keywords:** toe-off detection; gait event; silhouettes difference; convolutional neural network

#### **1. Introduction**

Gait is the periodic motion pattern of human walking or running. Different people owns different gait patterns, due to the reason that gait pattern is uniquely decided by the personal factors, such as personal habits, injury, and disease. Base on this character, researchers in pattern recognition area employ gait pattern to recognition the identity of walkers, namely gait recognition. And gait pattern is also used for disease diagnosing by the researchers in the field of medicine, namely gait analysis. No matter gait recognition or gait analysis, gait events detection is the basic problem of the both applications. Automatic detection of gait events is desirable for artificial intelligence applications, such as gait recognition and medicine abnormal gait analysis [1].

A gait cycle is the minimum periodic movement of human walking. Usually, a gait cycle is defined as a period from a heel strikes on the ground to the same heel strikes on the ground again the next time. According to the swing character of legs, a gait cycle can be divided into two phases, which are stance phase and swing phase. And there are also important six gait events within each gait cycle (shown as Figure 1), which are right heel strike, left toe-off, mid stance, left heel strike, right toe-off and mid swing. Accurate detection of the six gait events would raises the accuracy of gait recognition and analysis. In this paper, we focus on automatic detection of toe-off events using vision methods.

**Figure 1.** Graphic demonstration of the gait events within a gait cycle.

Currently, gait events detection methods can be mainly classified into two types: wearable sensors-based and vision-based methods [2]. The wearable sensors-based methods can accurately detect gait events by collecting motion data from the joints and segments of human lower limb with wearable devices. This type of method is widely used in the medicine area for evaluating abnormal gait due to its high accuracy performance. However, wearable sensors-based methods rely on high cooperation of participants. The participants have to first wear particular devices and then walk around the given area.

Conversely, vision-based methods detect gait event directly from video data captured by a single or several cameras without the aid of any other special sensors. Various cameras including structured light camera [3], stereo camera [4] and 2D vision camera [5] have been applied within these methods. Compared with the wearable sensors, cameras would be cheaper and easier to use. Detecting gait events from 2D video data is a challenging problem due to variations of illumination, perspective, and clothing. Previously, researchers attached markers to the joints of the human limb as participants walked on a clearly marked walkway. This setup requires the cooperation from participants.

In this paper, a new method of toe-off events detection based on a single 2D vision camera system is proposed. Consecutive pedestrian silhouettes extracted from video frames are combined to generate consecutive silhouettes difference maps (CSD-maps). Different number of consecutive silhouettes would result in different CSD-maps, namely *n*-CSD-maps, while *n* represents the number of consecutive silhouettes. Convolutional neural network is finally employed to learn the toe-off events detection features from CSD-maps. The main contribution of this paper is designing of a set of novel features, namely, consecutive silhouettes difference maps, for toe-off event detection. This method can be used to accurately detect gait event from video data captured from a single 2D vision camera under different viewing angles. If gait events can be accurately detected from 2D video data without participants cooperation, it would be greatly benefit to gait recognition and gait analysis.

The remainder of this study is organized as follows. In Section 2, the advancements of gait events detection methods are reviewed. In Section 3, the proposed method is discussed in detail. Section 4 reports the experimental results on publicly available databases. Finally, Section 5 concludes this study.

#### **2. Related Work**

In this section, we review the recent progress of gait event detection, which can be coarsely classified into two categories: wearable sensors-based methods and vision-based methods.

#### *2.1. Wearable Sensors-Based Methods*

Wearable sensors-based methods employ various wearable sensors placed on joints or segments of human limbs (such as feet, knees, thighs or waist) to collect their motion data. Accelerometers and gyroscopes are desirable sensors for gait event detection, which have drawn much attention from researchers. Rueterbories et al. [6] placed accelerometers on the foot to detect gait events. Aung et al. [7] placed tri-axial accelerometers on the foot, ankle, shank or waist to detect heel strike and toe off events. Formento et al. [8] placed a gyroscope on the shank to determine initial contact and foot-off events. Mannini et al. [9] used a uniaxial gyroscope to measure the angular velocity of foot instep in a sagittal plane. Anoop et al. [10] utilized force myography signals from thighs to determine the heel strike (HS) and toe-off (TO) events. Jiang et al. [11] proposed a gait phase detection method based on force myography technique.

The inertial measurement unit (IMU), which is composed of gyroscope and accelerometer, is also a powerful sensor for capturing human limb motion data. Bejarano et al. [12] employed two inertial and magnetic sensors placed on the shanks to detect gait events. Olsen et al. [13] accurately and precisely detected gait events using the features from trunk- and distal limb-mounted IMUs. And latter, Trojaniello et al. [14] mounted a single IMU at the waist level to detect gait events. Ledoux [15] presented a method for walking gait event detection using a single inertial measurement unit (IMU) mounted on the shank.

These sensors can accurately capture motion signals of the points where sensors are placed. Thus, these methods can accurately detect gait events and have been widely used for gait analysis in the medicine area. The disadvantages of these type of methods mainly lie in power consumption restriction, high cost and user cooperation restriction.

A smartphone would contain a 3-dimensional accelerometer, a 3-dimensional gyroscope, and a digital compass. Thus, smartphones are new convenient sensors for gait analysis. Pepa et al. [16] utilized smartphones to detection gait events (such as heel strike) by securing them to an individual's lower back or sternum. Manor et al. [17] proposed a method to detect the heel strike and toe off events by placing a smartphone in the user's pants pocket. Ellis et al. [18] presented a smartphone-based mobile application to quantify gait variability for Parkinson's disease diagnosing. Smartphones are also powerful sensors for gait recognition. Fernandez-Lopez et al. [19] compared the performance of four state-of-art algorithms on a smartphone before 2016. Muaaz et al. [20] evaluated the security strength of a smartphone-based gait recognition system against zero-effort and live minimal-effort impersonation attacks under realistic scenarios. Gadaleta et al. [21] proposed a user authentication framework from smartphone-acquired motion signals. The goal of this work is to recognize a target user from their way of walking, using the accelerometer and gyroscope (inertial) signals provided by a commercial smartphone worn in the front pocket of the user's trousers.

#### *2.2. Vision-Based Methods*

Vision-based methods can be also divided into two sub-categories: marker-based and no marker-based methods.

Marker-based methods calculate human limb motion parameters by tracking the markers attached to the joints of human limb. Ugbolue et al. [22] employed an augmented-video-based-portable-system (AVPS) to achieve gait analysis. In this study, bull's eye markers and retroreflective markers are attached to human lower limb. In [23], Yang et al. proposed an alternative, inexpensive, and portable gait kinematics analysis system using a single 2D vision camera. Markers are also attached on the hip, knee, and ankle joints for motion data capture. And three years later, the authors enhanced the initial single-camera system by designing a novel autonomous gait event detection method [5]. These methods achieve good accuracy of gait event detection. However, a calibration step is needed, where the participant has to walk on a clearly marked walkway, thus indicating user cooperation is required.

No marker-based methods can achieve gait event detection without user cooperation. With respect to this type of method very few research studies have worked on gait event detection techniques. The directly related work is Auvinet's work [3], in which a depth camera (Kinect) is employed to achieve gait analysis on a treadmill for routine outpatient clinics. In [3], a heel-strike event detection algorithm is presented by searching for extreme values of the distance between knee joints along the walking longitudinal axis. Although it achieves accurate detection results, Kinect used in [3] is also a special camera compared with a widely used web camera. In this study, we attempt to detect toe-off events using a web camera. As far as we know, this paper would be the first effort to detect gait events utilizing video data without the cooperation of participants.

Some research works about gait cycle detection algorithm have been presented in gait recognition methods. These methods can detect whole gait cycle or gait phase from video data without the help of markers. In [24], a gait periodicity detection method is presented based on dual-ellipse fitting (DEF). The periodicity is defined as the internal between the first extreme point and the third extreme point of DEF signals. Kale et al. [25] employed the norm of the width vector to show a periodic variation. Sarkar et al. [26] estimated gait cycle by counting the number of foreground pixels in the silhouette in each frame overtime. Mori et al. [27] detected the gait period by maximizing the normalized autocorrelation of the gait silhouette sequence for the temporal axis. These methods mentioned above can achieve gait cycle detection, but cannot obtain accurate gait event detection results.

#### **3. Toe-Off Events Detection Based on CSD-Maps**

In this section, we present the technique detail of the toe-off events detection method. The framework of the proposed method is graphically presented in Figure 2. Several consecutive silhouettes of a pedestrian are first combined to generate a consecutive silhouettes difference map. Convolutional neural networks are then employed to learn the features for toe-off events classification.

**Figure 2.** The framework of the proposed method.

#### *3.1. Consecutive Silhouettes Difference Maps*

There are rich temporal and spatial information contained in video data. Mining and fusing temporal and spatial information is currently an interest in computer vision. Inspired by the principle of the exclusive OR operation, we employ a frame difference method to encode the temporal and spatial information contained in several consecutive frames into a map. The difference map generated from *n* consecutive silhouettes is named as a *n*-CSD-map. We first take a 2-CSD-map as an example to explain how consecutive silhouette frames are encoded into a map.

#### 3.1.1. 2-CSD-Maps

The main idea of 2-CSD-maps is graphically presented in Figure 3. The 2-CSD-map of the *i th* frame is generated from two consecutive silhouette frames. Let Γ<sup>2</sup> *<sup>i</sup>* present the 2-CSD-map of the *i th* frame, *Ii*−<sup>1</sup> and *Ii* present the binary silhouette images of the (*<sup>i</sup>* <sup>−</sup> <sup>1</sup>)*th* and *<sup>i</sup> th* frame. For any pixel *P*<sup>2</sup> *j*,*k* in Γ<sup>2</sup> *<sup>i</sup>* , it's pixel value can be formulated as following: ⎧⎪⎨⎪⎩

$$\Gamma\_i^2(j,k) = \begin{cases} 1 & \text{if } (P\_{j,k}^2 \notin \Omega\_{i-1}) \cap (P\_{j,k} \in \Omega\_i) \\ 2 & \text{if } (P\_{j,k}^2 \in \Omega\_{i-1}) \cap (P\_{j,k} \notin \Omega\_i) \\ 3 & \text{if } (P\_{j,k}^2 \in \Omega\_{i-1}) \cap (P\_{j,k} \in \Omega\_i) \end{cases} \tag{1}$$

while, Ω*i*−<sup>1</sup> represents the pixel set of the silhouette area in *Ii*−1, and Ω*<sup>i</sup>* represents the pixel set of the silhouette in *Ii*. In order to achieve a good visual effect, the pixel values in Figure 3c are normalized to [0,1].

**Figure 3.** The basic idea of the 2-CSD-map. The pixel values in (**c**) are are normalized to [0,1] for good visual effect.

In practice, a pedestrian silhouette is presented as a binary image. Thus, a 2-CSD-map of two consecutive silhouettes can be computed using following three steps to achieve fast extraction of 2-CSD-maps.

First, copy gray value of pixels from *Ii*−<sup>1</sup> to <sup>Γ</sup><sup>2</sup> *<sup>i</sup>* . A temporary matrix *I* is then computed as:

$$I = I\_i - I\_{i-1} \tag{2}$$

Secondly, modify the pixel value of Γ<sup>2</sup> *<sup>i</sup>* according to the value of matrix *I*:

$$I = I\_i - I\_{i-1} \tag{2}$$

$$\text{The of } \Gamma\_i^2 \text{ according to the value of matrix } I;$$

$$\Gamma\_i^2(j,k) = \begin{cases} 1 & \text{if } I(j,k) > 0 \\ 2 & \text{if } I(j,k) < 0 \end{cases} \tag{3}$$

Finally, the pixel value of Γ<sup>2</sup> *<sup>i</sup>* is modified as follows:

$$
\Gamma\_i^2 \text{ is modified as follows:}
$$

$$
\Gamma\_i^2(j,k) = \begin{cases}
3 & \text{if } \Gamma\_i^2(j,k) == 255 \\
\Gamma\_i^2(j,k) & \text{else}
\end{cases}
\tag{4}
$$

Some samples of 2-CSD-maps are graphically presented in Figure 4. We can see that 2-CSD-maps are distinctive features for toe-off events detection compared with original silhouette images.

**Figure 4.** Samples of 2-CSD-maps compared with original silhouettes. The images presented in the first row are original silhouettes of two different persons, and the corresponding 2-CSD-maps are presented in the second row. The images with red edging are the toe-off frames. The pixel values in 2-CSD-maps are normalized to [0,1] for good visual effect.

#### 3.1.2. *n*-CSD-Maps

⎧

Suppose that there are *n* consecutive silhouettes images *I*1, *I*2, ..., and *In*. The *n*-CSD-maps Γ*<sup>n</sup> <sup>i</sup>* can be formulated as following: ⎪⎪⎪⎪⎪⎪⎨

$$
\Gamma^{n}\_{i}(j,k) = \begin{cases}
1 & \text{if } (P^{n}\_{j,k} \in \Omega\_{1}) \cap (P\_{j,k} \notin \Omega\_{2}) \cap (P\_{j,k} \notin \Omega\_{3}) \cap \dots \cap (P\_{j,k} \notin \Omega\_{n}) \\
2 & \text{if } (P^{n}\_{j,k} \notin \Omega\_{1}) \cap (P\_{j,k} \in \Omega\_{2}) \cap (P\_{j,k} \notin \Omega\_{3}) \cap \dots \cap (P\_{j,k} \notin \Omega\_{n}) \\
3 & \text{if } (P^{n}\_{j,k} \notin \Omega\_{1}) \cap (P\_{j,k} \notin \Omega\_{2}) \cap (P\_{j,k} \in \Omega\_{3}) \cap \dots \cap (P\_{j,k} \notin \Omega\_{n}) \\
\cdots & \\
2^{n} - 1 & \text{if } (P^{n}\_{j,k} \in \Omega\_{1}) \cap (P\_{j,k} \in \Omega\_{2}) \cap (P\_{j,k} \in \Omega\_{3}) \cap \dots \cap (P\_{j,k} \in \Omega\_{n})
\end{cases} \tag{5}
$$

while, Γ*<sup>n</sup> <sup>i</sup>* (*j*, *<sup>k</sup>*) stands for the pixel value of the pixel *<sup>P</sup><sup>n</sup> <sup>j</sup>*,*<sup>k</sup>* in the generated *n*-CSD-map. Ω1, Ω2, ..., and Ω*<sup>n</sup>* represent the pixel set of the silhouette areas in frame *I*1, *I*2, . . . , and *In* respectively.

Given *n* consecutive silhouette images, the *n*-CSD-maps extraction algorithm can be described as Algorithm 1. With this algorithm, the CSD-map generated from the given consecutive silhouette images is also presented as an image with the same size as silhouette images, shown as Figure 3c. Thus, a further normalization step is necessary. In this paper, CSD-map images are initially normalized to a certain size (such as 90 × 140) using Algorithm 2.

Figure 5 shows some consecutive normalized CSD-maps. we can see that the CSD-maps under toe-off state are obviously different with other CSD-maps.

#### **Algorithm 1** Algorithm for generating *n*-CSD-maps

#### **Require:**

Consecutive silhouette images: *I*[*w*, *h*, *n*].

Parameter *w* and *h* represent the width and height of the silhouette images respectively. Parameter *<sup>n</sup>* represents the number of consecutive silhouette images. **Ensure:**

The CSD-map: Γ

```
1: for i = 1 to w do
```

```
2: for j = 1 to h do
```

```
3: t = I(i, j, :);
4: value = 0;
5: for k = 1 to n do
```

```
6: value = value + 2(k−1) ∗ t(k);
```

```
7: end for
```

```
8: Γ(i, j) = value;
```

```
9: end for
```

```
10: end for
```

```
11: return Γ;
```
**Figure 5.** Samples of normalized CSD-maps. From the first row to the fifth row, the normalized 2-CSD-maps, 3-CSD-maps, 4-CSD-maps, 5-CSD-maps and 6-CSD-maps are respectively presented. The images with red edging are the toe-off frames. The pixel values in all CSD-maps are normalized to [0,1] for good visual effect.

#### **Algorithm 2** Algorithm for normalizing a CSD-map

#### **Require:**

The original CSD-map image: *OM* The width of the normalized CSD-map:*w* The height of the normalized CSD-map:*h* **Ensure:** The normalized CSD-map: *NM* 1: [*x*, *y*] = *find*(*OM* > 0); 2: *segm* = *OM*(*min*(*x*) : *max*(*x*), *min*(*y*) : *max*(*y*)); 3: *NM* = *imresize*(*segm*, [*h*, *w*]); 4: return *NM*;

#### *3.2. Convolutional Neural Network*

Convolutional neural networks have a feed-forward network architecture with multiple interconnected layers which may be of any of the following types: convolution, normalization, pooling and fully connected layers. CNNs have recently achieved many successes in visual recognition tasks, including image classification [28], object detection [29], and scene parsing [30]. CNNs are chosen as a detector for this study because they outperform other traditional methods in many image classification challenges, such as ImageNet [28] and many other image-based recognition problems, e.g., face recognition and digital recognition [31]. Comparing with traditional methods which rely on feature engineering, CNNs are able to learn feature representation through the back propagation algorithm without the need for much intervention and also achieve much higher accuracy.

The aim of this study is not to propose another CNN but use a classic CNN to address the problem of toe-off events detection. In this paper, we employ the CNN architecture presented in Figure 6. It is modified from DeepID [32]. The network includes three convolutional layers and three fully connected layers. The three convolutional layers have 64, 128 and 256 kernels and their sizes are respectively 5 × 5, 3 × 3 × 64 and 3 × 3 × 128. The first fully connected layer has 1024 neurons and the second fully connected layer has 512 neurons. In the last fully connected layer, there are two neurons, one for toe-off frame output and the other for non-toe-off frame output. The max-pooling with a size of 2 and a stride of 2 follows the three convolutional layers.

**Figure 6.** The architecture of the CNN employed in this study.

#### **4. Experiments and Results Analysis**

#### *4.1. Database*

Experiments are conducted on CASIA gait database (Dataset B) [33] to evaluate the accuracy of the performance of the proposed method. The data contained in this database are collected from 124 subjects (93 males and 31 females) in an indoor environment under 11 different viewing angles. The data from a subject is simultaneously captured by 11 USB cameras (with a resolution of 320 × 240, and a frame rate of 25 fps) around the left hand side of the subject when he/she was walking, and the

angle between two nearest view directions is 18◦. When a subject walked in the scene, he/she was asked to walk naturally along a straight line 6 times first, and 11 × 6 = 66 normal walking video sequences were captured for each subject. After normal walk, the subjects were asked to put on their coats or carried a bag, and then walked twice along the straight line. In each viewing angle, there are totally 10 videos collected from every subject under three different clothing conditions, namely normal condition, coat condition and bag condition. The CASIA Gait Database is provided free of charge at web site http://www.cbsr.ia.ac.cn.

In this study, we considered the data captured under the viewing angles of 36◦, 54◦, 72◦, 90◦, 108◦, 126◦ and 144◦ (approximately 500,000 frames in total) for training and testing. The data captured under the frontal viewing angles of 0◦, 18◦, 172◦, 180◦, are not used in the experiments primarily because there is very little difference between two consecutive silhouettes. The CSD-maps generated from the video data captured in the viewing of sagittal plane do not contain much information for gait events detection. This means that the method proposed by this paper cannot deal with the video data captured in the viewing of sagittal plane. Even so, the proposed method can deal with the video data captured from most viewing angles. This makes the proposed method useful in practice.

#### *4.2. Toe-Off Frame Definition and Data Preparation*

The ground truth of all the silhouette frames should be manually labeled for modal training and testing. Thus, the toe-off frames should be first and clearly defined.

Human gait is a continuous and periodic movement. In medical field, the toe-off event is defined as the moment that the stance limb leaves the ground, shown as in Figure 1. While, the video data is the sampling record of human gait with a certain frame rate *θ*. Usually, the frame rate *θ* would be 30 fps. And the gait cycle of a person is averagely about 1 s time consuming. This means that one gait movement cycle of a person would be recoded as about 30 consecutive frames with an interval of 33 ms. The problem is that the moment the stance limb leaving the ground may not be included in the 30 consecutive sampling frames. In this paper, the first frame after the stance limb leaves the ground is defined as a toe-off frame. For example, as shown in Figure 7, if the moment that the stance limb leaves the ground falls within the period of *tn* < *t* < *tn*+1, then the frame (*n* + 1) is defined as the toe-off frame.

**Figure 7.** Toe-off event definition of video frames.

According to the definition, there would exist error in the labeled groundtruth. Let *θ* be the frame rate of the video data. The during time between two continuous frames would be <sup>1</sup> *<sup>θ</sup>* , which means *tn*+<sup>1</sup> <sup>−</sup> *tn* <sup>=</sup> <sup>1</sup> *<sup>θ</sup>* . If the toe-off event happens during the period of (*tn*, *tn*+1) but nearer to *tn* shown as Figure 7a, then at frame *n* + 1, the foot would have swung in the air for about <sup>1</sup> *<sup>θ</sup>* seconds. However, if the toe-off event happens during the period of (*tn*, *tn*+1) but nearer to *tn*+<sup>1</sup> shown as Figure 7b. At frame *n* + 1, the foot would have just left the ground. The frames *n* + 1 in both Figure 7a,b are regarded as toe-off frames. Obviously, the toe-off frames in Figure 7a,b may be different with each other. But this error doesn't change the validity of the proposed method.

#### *4.3. Experimental Configuration*

The experiments are conducted by using Caffe [34], which is a deep learning framework created by Yangqing Jia during his PhD at UC Berkeley. The experiments are conducted as following.


**Figure 8.** The relationship between detection accuracy of the proposed method and the size of normalized *n*-CSD-map.


**Table 1.** Detection accuracy of the proposed method.

**Figure 9.** The relationship between detection accuracy of the proposed method and *n*-CSD-map. (**a**) The detection accuracy as a function of *n*-CSD-map. (**b**) The bars of the detection accuracy VS. *n*-CSD-map.

#### *4.4. Experimental Results and Discussion*

In this paper, a new evaluation indicator, namely n-frame-error cumulative detection accuracy, is designed to evaluate the performance of the proposed method besides detection accuracy and ROC curve. The n-frame-error cumulative detection accuracy is similar with cumulative match characteristics (CMC) curves [35]. Let's *d* represents the difference between the sequence number of predicted toe-off frame and the ground truth, shown as Figure 10. *n*-frame-error cumulative detection accuracy indicates the detection accuracy with the condition of *d* ≤ *n*.

**Figure 10.** Graphical demonstration of the n-frame-error cumulative detection accuracy. The frame difference between the sequence number of predicted toe-off frame and the ground truth is noted as *d*. The image with red edging is the predicted toe-off frame. The image with yellow edging is the ground truth.

Table 1 shows the detection accuracy of the proposed method. We can see that the proposed method achieves good detection accuracy. The proposed method reaches the accuracy around 93% under the viewing angles of 36◦, and achieves the peak value of 93.63% by using 6-CSD-maps. Under the viewing angle of 54◦, the proposed method reaches the accuracy around 94% and achieves the peak value of 95.4% by using 6-CSD-maps. Under the viewing angle of 72◦, the proposed method reaches the accuracy around 95% and achieves the peak value of 95.44% by using 6-CSD-maps. Under the viewing angle of 90◦, the proposed method reaches the accuracy around 96% and achieves the peak value of 96.78% by using 6-CSD-maps. Under the viewing angle of 108◦, the proposed method reaches the accuracy around 95% and achieves the peak value of 95.78% by using 6-CSD-maps. Under the viewing angle of 126◦, the proposed method also reaches the accuracy around 95% and achieves the peak value of 95.65% by using 6-CSD-maps. Under the viewing angle of 144◦, the proposed method reaches the accuracy around 93% and achieves the peak value of 93.44% by using 6-CSD-maps.

The relationship between detection accuracy of the proposed method and *n*-CSD-map is graphically presented in Figure 9. Figure 9a demonstrates the detection accuracy of the proposed method as a function of *n*-CSD-map, and the corresponding bars are presented in Figure 9b. Generally, the detection accuracy is slightly improved with the increase of *n*. The reason is that the bigger the parameter *n* is, the more consecutive silhouettes will be encoded into a CSD-map, and the more information will be contained in the CSD-map. The detection accuracy gets a good promotion when the parameter *n* changes from 2 to 3. For example, under viewing angle of 108◦, the accuracy of the proposed method increase from 94.68% to 95.54% when the parameter *n* increases from 2 to 3. However, the accuracy gets a few increase when the parameter *n* goes to 4, 5 and 6. This demonstrates that 3-CSD-map is a good choice for toe-off detection, which can achieve good accuracy with little additional computation cost. Figure 11 shows the ROC curves of the proposed method under different viewing angles. The ROC curves of the proposed method under the viewing angles of 36◦, 54◦, 72◦, 90◦, 108◦, 126◦ and 144◦ are respectively presented in the figures from Figure 11a–g. As shown in Figure 11, under all viewing angles, the proposed method gets higher detection performance by using larger parameter *n* of *n*-CSD-map.

The ROC curves of the proposed method using 3-CSD-map under different viewing angles are presented in Figure 9h. Generally, we can see that the proposed method obtains higher detection accuracy around coronal plane viewing angles than sagittal plane viewing angles. Especially, the proposed method achieves the accuracy of 96.78% under the viewing angle 90◦, which is higher than other viewing angles. This demonstrate that CSD-maps generated from the video data captured in sagittal plane viewing angles contain less useful information for gait events detection than coronal plane viewing angles. The reason is that there is fewer different between two consecutive silhouettes of video frames captured under sagittal plane viewing angles compared with coronal plane viewing angles.

The plots presented in Figure 12a are the *n*-frame-errors cumulative detection accuracy of the proposed method against different viewing angles. The 1-frame-error cumulative detection accuracy of the proposed method reaches the accuracy of 99.3%, 99.86%, 99.9%, 99.9%, 99.9%, 99.8%, and 99.4% for the viewing angles of 36◦, 54◦, 72◦, 90◦, 108◦, 126◦, and 144◦ respectively. For the 2-frame-error, the cumulative detection accuracy of the proposed method achieves 100% for the viewing angles of 54◦, 72◦, 90◦, 108◦, and 126◦. This demonstrates that the maximum time error of the proposed method detecting toe-off events in coronal plane viewing angles is less than <sup>2</sup> *<sup>θ</sup>* , where *θ* is the frame rate of the video data. Practically, we can promote the time accuracy of this method by increasing the frame rate of the video.

Figure 12b shows the detection accuracy of the proposed method as a function of viewing angles compared with [24,25,36]. Due to the reason that [24,25] do not provide toe-off event detection results directly, we implemented the both algorithms for toe-off event detection according to the main ideas of [24,25]. Ref. [36] is our previous work based on principal component analysis and support vector machine. In this experiment, all frames are used for training and testing in 5-fold cross validation. We can see that our CNN-based method significantly outperforms Ben's method [24], Kale's method [25] and our previous work [36] in the viewing angles of 36◦, 54◦, 72◦, 90◦, 108◦, 126◦, and 144◦.

**Figure 11.** The ROC curves of the proposed method. (**a**) The ROC curves under the viewing angle of 36◦. (**b**) The ROC curves under the viewing angle of 54◦. (**c**) The ROC curves under the viewing angle of 72◦. (**d**) The ROC curves under the viewing angle of 90◦. (**e**) The ROC curves under the viewing angle of 108◦. (**f**) The ROC curves under the viewing angle of 126◦. (**g**) The ROC curves under the viewing angle of 144◦. (**h**) The ROC curves of the proposed method with 3-CSD-map under different viewing angles of 36◦, 54◦, 72◦, 90◦, 108◦, 126◦ and 144◦.

**Figure 12.** The n-frame-error cumulative detection accuracy of the proposed method. (**a**) the detection accuracy of the proposed method against different frame-errors. (**b**) The detection accuracy of the proposed method compared with [24,25,36].

In Figure 13, we use a confusion matrix to evaluate cross viewing angle detection accuracy of this method using 3-CSD-maps. As can be seen in the figure, this method achieves the best accuracy in the counter-diagonal and around 90% in the other areas, which means that this method can get good accuracy for cross view toe-off detection. Figure 14 presents the ROC curves of this method under all viewing angles compared with [24,25,36]. We can see that the proposed method significantly outperforms the comparation methods.

**Figure 13.** The confusion matrix of cross viewing angle detection accuracy of this method using 3-CSD-maps.

**Figure 14.** The ROC curves of this method compared with [24,25,36] under all viewing angles.

#### **5. Conclusions and Future Work**

This paper presents a promising vision-based method to detect toe-off events. The main contribution of this paper is the design of consecutive silhouettes difference maps for toe-off event detection. Convolutional neural network is employed for feature dimension reduction and toe-off event classification. Experiments on a public database have demonstrated good performance of our method in terms of detection accuracy. The main advantages of the proposed method can be described as following.


Although a promising feature representation method is proposed in this paper for toe-off event detection, more efforts are needed to improve the method of gait events detection from video data in our future work.


**Author Contributions:** Data curation, Z.L.; Formal analysis, J.D.; Investigation, Y.T.; Methodology, Y.T. and B.L.; Software, Z.L.; Supervision, J.D. and B.L.; Validation, H.T.; Visualization, H.T.; Writing–original draft, Y.T.; Writing–review & editing, J.D.

**Funding:** This research was funded by National Key Research and Development Program of China (No.2017YFC0803506, 2017YFC0822003), the Fundamental Research Funds for the Central Universities of China (Grant No.2018JKF217), the National Natural Science Foundation of China (No. 61503387, 61772539).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


c 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Incremental Market Behavior Classification in Presence of Recurring Concepts**

#### **Andrés L. Suárez-Cetrulo 1,2, Alejandro Cervantes 1,\* and David Quintana <sup>1</sup>**


Received: 28 November 2018; Accepted: 20 December 2018; Published: 1 January 2019

**Abstract:** In recent years, the problem of concept drift has gained importance in the financial domain. The succession of manias, panics and crashes have stressed the non-stationary nature and the likelihood of drastic structural or concept changes in the markets. Traditional systems are unable or slow to adapt to these changes. Ensemble-based systems are widely known for their good results predicting both cyclic and non-stationary data such as stock prices. In this work, we propose RCARF (Recurring Concepts Adaptive Random Forests), an ensemble tree-based online classifier that handles recurring concepts explicitly. The algorithm extends the capabilities of a version of Random Forest for evolving data streams, adding on top a mechanism to store and handle a shared collection of inactive trees, called concept history, which holds memories of the way market operators reacted in similar circumstances. This works in conjunction with a decision strategy that reacts to drift by replacing active trees with the best available alternative: either a previously stored tree from the concept history or a newly trained background tree. Both mechanisms are designed to provide fast reaction times and are thus applicable to high-frequency data. The experimental validation of the algorithm is based on the prediction of price movement directions one second ahead in the SPDR (Standard & Poor's Depositary Receipts) S&P 500 Exchange-Traded Fund. RCARF is benchmarked against other popular methods from the incremental online machine learning literature and is able to achieve competitive results.

**Keywords:** ensemble methods; adaptive classifiers; recurrent concepts; concept drift; stock price direction prediction

#### **1. Introduction**

Financial market forecasting is a field characterized by data intensity, noise, non-stationary, unstructured nature, a high degree of uncertainty, and hidden relationships [1], being the financial markets complex, evolutionary, and non-linear dynamical systems [2]. Many approaches try to predict market data using traditional statistical methods. Albeit, these tend to assume that the underlying data have been created by a linear process, trying to make predictions for future values accordingly [3]. However, there is a relatively new line of work based on machine learning, whose success has surprised experts given the theory and evidence from the financial economics literature [4–6]. Many of these algorithms are able to capture nonlinear relationships in the input data with no prior knowledge [7]. For instance, Random Forest [8] has been one of the techniques obtaining better results predicting stock price movements [9–12].

In recent years, the notion of concept drift [13] has gained attention in this domain [14]. The Asian financial crisis in 1997 and, more recently, the great crisis in 2007–2008 have stressed the non-stationary nature and the likelihood of drastic structural or concept changes in financial markets [14–19].

Incremental machine learning techniques deal actively or passively [20] with the non-stationary nature of the data and its concept changes [13]. However, the problem of recurring concepts [21–23], where previous model behaviors may become relevant again in the future, is still a subject of study. As part of the so-called stability–plasticity dilemma, most of the incremental approaches need to re-learn previous knowledge once forgotten, wasting time and resources, and losing accuracy while the model is out-of-date. Although some authors have started to consider recurring concepts [24–28], the number of contributions focused on the financial forecasting domain is still very limited [29,30]. This might be partially explained by the fact that, in this context, the presence of noise and the uncertainties related to the number of market states, their nature, and the transition dynamics have a severe impact on the feasibility of establishing a ground truth.

Our contribution is an algorithm that deals with gradual and abrupt changes in the market structure through the use of an adaptive ensemble model, able to remember recurring market behaviors to predict ups and downs. The algorithm proposed improves a previous algorithm, namely Adaptive Random Forest (ARF) [31], by being able to react more accurately in the case of abrupt changes in the market structure. This is accomplished through the use of a concept history [21,22,32–34], which stores previously learned concept representations. When a structural change is detected, it replaces drifting classifiers with either a new concept model or with a concept extracted from the history, using dynamic time-windows to make the decision. As this concept representation is already trained, our algorithm is able to react faster than its predecessor, which is unable to profit from previous models.

The remainder of the paper is organized as follows. In Section 2, we review related work and approaches. In Section 3, we propose the algorithm RCARF. In Section 4, we describe the experimental design, present our empirical results and discuss their implications. Finally, in Section 5 we conclude with a summary of our findings and future lines of research.

#### **2. Related Work**

The number of approaches proposed for financial applications is vast. In terms of market price forecasting and trend prediction, these can be approached by looking at fundamental and technical indicators. Even though there is controversy regarding the potential of the latter to produce profitable trading strategies [5,35], the fact is that they are widely used in short-term trading [36]. Kara et al. [37] proposed a set of 10 technical indicators identified by domain experts and previous studies [38–44]. This approach has been used in more recent works (e.g., [4]). Some of them, such as the work of Patel [12], discretize features based on a human approach to investing, deriving the technical indicators using assumptions from the stock market.

Stock markets are non-stationary by nature. Depending on the period, they can show clear trends, cycles, periods where the random component is more prevalent, etc. Furthermore, stock prices are affected by external factors such as the general economic environment and political scenarios that may result in cycles [12]. Under these circumstances, incremental and online machine learning techniques [28,45] that adapt to structural changes, usually referred to as concept drift [13], are gaining traction in the financial domain [14].

In parallel, ensemble techniques are known for their good performance at predicting both cyclic and non-stationary data such as stock prices [9,12,46]. Ensembles are able to cover many different situations by using sets of learners. If a specific type of pattern reappears after a certain time, some of the trained models should be able to deal with it. These techniques, which are commonly used for trend prediction in financial data, are also one of the current trends of research in incremental learning. Lately, several incremental ensembles have been proposed [47] to deal not only with stationary data and recurring drifts but, also with non-stationary data in evolving data streams [20,22,34,48–51].

There are different types of concept drift detection mechanisms for handling gradual or abrupt changes, blips or recurring drifts [24,26–28,52] that can be used to deal with changes in the market behavioral structure [53]. As opposed to stationary data distributions, where the error rate of the learning algorithm will decrease when the number of examples increases, the presence of changes affects the learning model continuously [54]. This creates the need to retrain the models over time when they are no longer relevant for the current state of the market [15].

In the case of repeated cycles, handling of recurring concepts can help reduce the cost of retraining a model if a similar one has already been generated in the past. Fast recognition of a reappearing model may also improve the overall model accuracy as the trained model will provide good predictions immediately.

Gomes et al. [31] proposed an adaptive version of Random Forest that creates new trees when the accuracy of a participant in the ensemble decreases down to a certain threshold. These trees, considered background learners, are trained only with new incoming data and replace the model that raised a warning when this is flagged as drifting. Their Adaptive Random Forest algorithm (ARF) provides a mechanism to update decision trees in the ensemble and keep historical knowledge only when this is still relevant. However, once a tree is discarded, it is completely removed from memory. In presence of recurring concepts, ARF needs to train the trees from scratch.

Gonçalves et al. [23] proposed a recurring concept drift framework (RCD) that raises warnings when the error rate of a given classifier increases. Their approach creates a collection of classifiers and chooses one based on the data distribution. This data distribution is stored in a buffer of a limited length for each of the classifiers. When there is a warning, the newest data distribution is compared to the data distributions of other stored classifiers, to verify whether the new context has already occurred in the past.

Elwell et al. [20] dealt with recurrent concepts in a similar way. Their approach, Learn++.NSE, keeps one concept per batch, not limiting the number of classifiers. The idea, along the lines of Hosseini et al. [48], is to keep all the accumulated knowledge in a pool of classifiers to be used eventually, if needed. However, this approach suffers from scalability bottlenecks in continuous data streams as it does not prune the list of active classifiers. Other approaches have proposed explicit handling of recurring concepts by checking for similarity [21,22,32–34]. These store old models in a concept history for comparison when the current model is flagged as changing.

An alternative approach is the use of Evolving Intelligent Systems (EIS) [55]. These have achieved great results classifying non-stationary time series [19,29,30]. The latest EIS works apply meta-cognitive scaffolding theory for tuning the learned model incrementally in what-to-learn, when-to-learn, and how-to-learn [56]. These have also introduced the ability to deal with recurrent concepts explicitly, beating other methods at predicting the S&P500 [29,30]. In this space, Pratama et al. recently proposed pENsemble [57], an evolving ensemble algorithm inspired by Dynamic Weighted Majority (DWM) [50]. pENsemble counts with explicit drift detection, and it is able to deal with non-stationary environments and handle recurring drifts because of its base classifiers. These have a method that functions as a rule recall scenario, triggering previously pruned rules portraying old concepts to be valid again. However, pENsemble differs from our approach and the rest of the architectures of this work in the fact that it is built upon an evolving classifier. There is still an important gap between EIS and the rest of the literature for data stream classification. Features such as meta-cognition and explicit handling of recurrent concepts are still in an early level of adoption outside EIS. Furthermore, extensive application of EIS to challenging domains as stock market prediction is only starting.

Our proposal, which is described in detail in the next section, applies explicit recurring drift handling for price direction prediction to intra-day market data. The foundations of the algorithm start with the core ideas of ARF [31] as an evolving and incremental ensemble technique. The proposed approach extends these with the capability to store old models in a concept history. These models are subsequently retrieved when they are deemed suitable to improve the predictions of the current ensemble. The approach leverages certain ideas from some of the papers cited above, also including adaptive windowing to compare old and background learners based on buffers of different sizes, depending on the speed of the changes.

#### **3. Adaptive Ensemble of Classifiers for Evolving and Recurring Concepts**

The idea behind our proposal, Recurring Concepts Adaptive Random Forest (RCARF), is the development of an algorithm able to adapt itself to gradual, abrupt and also recurring drifts in the volatile data streams of the stock market. The main contribution of the approach is the explicit handling of recurring drifts in an incremental ensemble. This process is managed by two key components: the concept history, and the associated Internal Evaluator. Both are represented in Figure 1, which illustrates the overall structure of the algorithm.

**Figure 1.** RCARF structure.

In Algorithm 1, we show the overall pseudocode for the RCARF algorithm. RCARF inherits several features of the Adaptive Random Forest (ARF) algorithm proposed by Gomes et al. [31].

As mentioned above, RCARF is a Random Forest classifier [8]. These algorithms use a collection (ensemble) of "base" classifiers. Traditionally, the forest is a homogeneous set of tree-based classifiers. The full forest classifier performs a prediction for every example in a data stream. Each example or batch of examples is pushed to all the base classifiers, each of which then casts its vote for the most likely class for the example. Each vote is multiplied by the base classifier's weight, a value that is adapted later depending on whether the related base classifier prediction matches the "real" class of the example. The random component arises from the fact that each of the base classifiers in the ensemble takes into account only a random set of the examples' features. Even though each base classifier is deciding its individual vote based on partial information, the voting mechanism usually provides very accurate predictions, in many circumstances due to the reinforcing process of the voting mechanism.

This general approach requires some adaptations to handle structural changes on the fly. RCARF implements a basic drift handling strategy along these lines inherited from ARF. To be ready to react properly to structural breaks, these algorithms have a mechanism to detect potential drifts in advance and ensure a smooth transition to new trees. A signal (warning) is raised by a very sensitive drift detector. This triggers the creation of a background tree and starts its training. In the case drift is confirmed (drift signal) at a later stage, the background tree replaces the associated one. Otherwise, it is discarded.

Unlike its predecessor, RCARF is also able to spot the recurrence of previously trained trees and retrieve them from a shared collection of inactive classifiers called concept history. Specific mechanisms in the decision process, such as the internal evaluator, are designed to make the best decision under drift conditions by using only the most adequate sample of recent data.

**Algorithm 1** RCARF algorithm. Adapted from ARF in [31]. Symbols: *m*, maximum features evaluated per split; *n*: total number of trees (*n* = |*T*|); *δw*, warning threshold; *δd*, drift threshold; *C*(·), change detection method; *S*, data stream; *B*, set of background trees; *W*(*t*), tree *t* weight; *P*(·), learning performance estimation function; *CH*, concept history; *TC*, temporal concept saved at the start of warning window.


In both adaptive versions of Random Forest, base classifiers are always the Hoeffding Trees used in ARF. That means that, hereafter, we use the term "tree" to refer to each one of these base classifiers. However, it is worth noting that the mechanism we propose does not depend on the type of base classifier, which may be replaced transparently.

For the description of the algorithm, it is important to take into account that every tree generated will be in one of three different states:


Code kept from ARF includes the function responsible for inducing each base tree (Algorithm 2) and the warning and drift detection and handling (Lines 1–21, 27–29 and 43–47 in Algorithm 1). The method retains the mechanisms related to the ensemble itself (bagging, weighting and voting). However, in RCARF, we introduce the steps required to manage the concept history and how to perform an informed decision as to how to replace active trees in case of drift (Lines 23–25, and 35–37). These aspects of RCARF are detailed in the sections that follow.

**Algorithm 2** Random Forest Tree Train (RFTreeTrain). Symbols: *λ*, fixed parameter to Poisson distribution; *GP*, grace period before recalculating heuristics for split test; m: maximum features evaluated per split; t, decision tree selected; (x, y), current training instance. Adapted from [31].

```
1: function RFTREETRAIN(m, t, x, y) 2:
3: k ← Poisson(λ = 6) 4:
5: if k > 0 then
6:
7: l ← FindLeaf(t, x) 8:
9: UpdateLeafCounts(l, x, k) 10:
11: if examplesSeen(l) ≥ GP then
12:
13: AttempSplit(l) 14:
15: if DidSplit(l) then
16:
17: CreateChildren(l, m) 18:
19: end if
20:
21: end if 22:
23: end if
24:
25: end function
26:
```
#### *3.1. Concept History*

As stated previously, one of the core elements of RCARF is the addition of a concept history to the ARF schema. The concept history (*CH*) is a collection of trees shared by all trees in the ensemble. This collection is created during the execution of the algorithm, and is stored for future use when an episode of concept drift impacts the performance of active trees. If an active tree is inserted in the concept history, it becomes available for the whole ensemble. If a tree from the concept history is "promoted" to be an active tree, it is immediately removed from the concept history.

RCARF relies on the assumption that, particularly in the case of abrupt drift, the background tree learned from scratch from the beginning of the warning window may be at a disadvantage compared to an old tree adapted to obtain good results but subsequently discarded. This situation, which would be affected by the speed of the concept drift, is especially likely if we can expect episodes of recurring drift in the data. In that case, the concept history already contains trained trees well-adapted to the recurring concept. Thus, instead of discarding useful trees, the objective would be storing them and then recovering them whenever they become relevant again.

Figure 1 illustrates the structure of RCARF. First, incoming data examples are tested using the ensemble evaluator. Only then, the example is used also for training the active tree.

As stated in the algorithm in Algorithm 1 and by Gomes et al. in [31], when the error statistics increase over time up to a certain threshold, a warning is activated and a background tree is created to replace the active model in the case of drift. After performing these steps, a change detector decides if the algorithm must be prepared for the occurrence of concept drift (warning detection, Line 21) or if a drift has really happened (drift detection, Line 33).

In both ARF and RCARF, the "warning window" is defined as the period of time that starts when an active tree raises a warning and finishes when the same tree detects a drift. Each warning window is specific to an active tree, and resets in the case of false alarm; that is, if a new warning is raised by the same tree before the drift is confirmed. In ARF, if a drift is detected (Line 33), the warning window is finished and the background tree replaces the active tree. In RCARF, during the warning window, there is also an online evaluation on the background tree (the one linked to the active tree that has raised the warning) and all trees in the concept history to compare their performance. This is the task of an "internal evaluator", described below. Only when a drift is detected, the tree with the lowest error according to the internal evaluator is promoted to active (Line 35). The previously stored copy of the active tree is then moved to the concept history (Line 37).

#### *3.2. Internal Evaluator*

RCARF has two types of evaluators: the ensemble one and the internal one.


**Algorithm 3** Internal evaluator. It computes the best transition in the case of drift. Symbols: *t*, active tree; *b*, background tree; *CH*, concept history; *c*, tree from *CH*; *WS*(*CH*), fixed window size in *CH*; *WS*(*b*), current window size in *b*, *W*(*c*), error statistics in *c* for the latest examples in *WS*(*CH*); *W*(*b*), error statistics in *b* according to *WS*(*b*).

```
1: function BESTTRANSITION(t, b, CH) 2:
3: for all c ∈ CH do -
                                                         Rank of errors of each tree in CH
4:
5: addToRank(c, countErrors(W(c))/WS(CH)) 6:
7: end for
8:
9: if minError(rank) ≤ (countErrors(W(b))/WS(b)) then
10:
11: R ← extractClassi fier(CH, minErrorKey(rank)) -
 Get and remove tree from the concept history 12:
13: else
14:
15: R ← b
16:
17: end if
18:
19: return R
20:
21: end function 22:
```
**Algorithm 4** Internal evaluator with dynamic windows for background trees. Symbols: *WS*, evaluator window size; *W*, evaluator window; *SI*, size increments; *MS*, minimum size of window.

```
1: function ADDEVALUATIONRESULTS(value = correctlyClassifies ? 0 : 1) 2:
3: removeFirstElement(W) 4:
5: add(W, value) -
 Add result [1 (error) or 0 (success)] to window 6:
7: updateWindowSize() 8:
9: if (countO f Errors(W)/WS) < getErrorBe f oreWarning then
10:
11: WS = WS + SI
12:
13: else if WS > MS then
14:
15: WS = WS − SI
16:
17: end if
18:
19: end function
20:
```
The adaptation mechanism for the window size in Algorithm 4 is as follows: if the error obtained by a background tree for its internal evaluator window size (*WS*) in the latest testing examples is lower than the error obtained by the active tree when it raised the warning signal, then *WS* decreases down to a minimum size. Otherwise, it increases once per iteration (that is, per example evaluated). Increments and decrements of *WS* are performed according to an input parameter that defines "size increments" (*SI*).

The logic of the resizing mechanism relies on the interpretation of the error obtained by the background trees. In cases where it is greater than the error obtained by the active tree before warning, we believe that the underlying reason must be either because the background tree has not been trained with enough samples yet, or because the structure of the data stream is continuously changing (in a period of transition). In the second scenario, a smaller sample of the latest examples could be more accurate in estimating which is the best classifier for the latest concept (*WS* decreases). Otherwise, a larger sample would be desirable, as it would provide a more representative set of data (*WS* increases).

#### *3.3. Training of the Available Trees*

The addition of the concept history and the differences in the replacement strategy used in RCARF entail the need to discuss the way data are used to train the trees. As in ARF, both the active and background trees are trained with new examples as soon as they are available (Lines 19 and 45 in Algorithm 1). However, trees in the concept history are adapted to data that correspond to a different concept. Therefore, they are not retrained unless they are promoted to active.

As mentioned above, in the case of drift, the active tree is replaced by either the best tree from the concept history or the background tree (Lines 35–37 in Algorithm 1) following Algorithm 3. In the case that the background tree was selected for promotion, the training examples from the warning window would already have been used for its training. Conversely, if a concept history tree were selected for promotion, these training examples would be discarded.

As stated by Alippi et al. [21], there is always a delay from the start of a concept drift to the start of the warning window. During this lag, it is not possible to warrant the isolation of a given concept. In this paper, for simplicity, we avoid taking into consideration this delay as a part of our analysis. Therefore, for the purpose of this work, we assumed that the start of every warning window that ends with the trigger of the drift (thus, when this is not a false alarm), matches the start of a concept drift. For this reason, even though active trees are being updated during warning windows, we consider that the moment in which they are best adapted to a given concept is just before the warning window. Hence, the tree that is pushed to the concept history is a snapshot of the active tree at the start of the warning window (see Lines 25 and 37 in Algorithm 1).

#### **4. Experimentation: Predicting the S&P500 Price Trend Direction**

#### *4.1. Data*

Data for this work were produced in the following way. First, we downloaded Exchange-Traded Fund (ETF) SPY prices for the entire first quarter of 2017 at second level from QuantQuote (Data source: https://www.quantquote.com). This ETF, one of most heavily-traded ones, tracks the popular US index S&P 500. Secondly, we selected 10 different technical indicators as feature subsets based on the work by Kara et al. [37]. The default value of 10 s that we set for the number of periods, *n*, was extended in the case of the two moving averages. Once we considered the additional possibilities, 5 and 20 s, we ended up with the 14 features described in Table 1. These were computed with the TA-lib technical analysis library (Technical Analysis library: http://ta-lib.org/) using its default values for all parameters other than the time period.


**Table 1.** Selected technical indicators. Formulas as reported in Kara et al. [37] applied to second-level. Exponential and simple moving averages for 5 and 20 s added as extra features.

*Ct* is the closing price; *Lt* the low price; *Ht* the high price at time *t*; EMA exponential moving average, *EMA*(*k*)*t*: *EMA*(*k*)*t*−<sup>1</sup> + *α* × (*ct* − *EMA*(*k*)*t*−1); *α* smoothing factor: 2/1 + *k*; *k* is time period of *k* second exponential moving average; *LLt* and *HHt* mean lowest low and highest high in the last *t* seconds, respectively; *Mt* : *Ht* + *Lt* + *Ct*/3; *SMt* : ∑*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *Mt*−*i*+1)/*n*; *Dt* : (∑*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> |*Mt*−*i*+<sup>1</sup> − *SMt*|)/*n*; *U pt* means the upward price change; *Dwt* means the downward price change at time *t*. *n* is the period used to compute the technical indicator in seconds.

The label, categorized as "0" or "1", indicates the direction of the next change in the EFT. If the SPY closing price at time *t* is higher than that at time *t* − 1, direction *t* is "1". If the SPY closing price at time *t* is lower or equal than that at time *t* − 1, direction *t* is "0". Furthermore, as part of the labeling process, a lag of 1 s has been applied over the feature set. Thus, if the technical indicators belong to the instant *t* − 1, the label reflects the price change from *t* − 1 to *t*.

Short sellers are usually averse to holding positions over non-market hours and try to close them at the end of the day [58]. The price may jump, or the market can behave very differently in the next morning. Therefore, only prices during market hours are considered in this work. In addition, as the technical indicators selected depend on the 35 previous seconds of data, the first 35 s are discarded for each day after processing the technical indicators. This filtering aims to avoid the influence of the previous day trends, and prices before market hours.

#### *4.2. Experimental Setting*

We designed the experiments presented in this section with two separate purposes in mind.

First, were compared the utility of the recurring drift detection implemented in RCARF vs. the basic ARF approach. To perform a fair comparison between RCARF and ARF, both algorithms used the same ADaptive WINdowing (ADWIN) [59] change detector for warnings and drifts. Furthermore, both learners used the same adapted version of Hoeffding Trees as base classifier, and the same number of trees in their configuration.

Secondly, we aimed to prove that RCARF is a suitable candidate for this task compared to other state-of-the-art learners for data stream classification. For this comparison, we selected the following learners, all of them from the literature of online classification of non-stationary data streams: DWM [50] using Hoeffding Trees as base classifiers, a RCD learner [23] of Hoeffding Trees and Hoeffding Adaptive Tree (AHOEFT) [60]. All of the experiments were performed using the MOA framework [61], which provides implementations of the aforementioned algorithms, in a Microsoft Azure Virtual Machine "DS3 v2" with the Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30 GHz processor and 14 GB RAM.

ARF is able to train decision trees in parallel with multithreading to reduce the running time. However, in RCARF, multithreading would impact the results because of the addition of the concept history as a shared storage space used by all trees. Thus, in this work, all experiments were run on a single thread. The impact of multithreading is out of the scope of our proposal.

The dataset was modeled as a finite, ordered stream of data, and evaluation and training were performed simultaneously using *Interleaved*-*Test*-*Then*-*Train* evaluation [61]. In this method, a prediction is obtained for each example, and success or failure is recorded before the example is used for training and adjusting the model.

We evaluated each algorithm using the accumulated classification error at the end of the period. This error was calculated by dividing the number of misclassified examples by the total number of examples. However, this accumulated error was not adequate to compare how different algorithms behave in particular moments of time. Thus, we also calculated at regular intervals the error of each algorithm calculated over a fixed window of time (500 examples). This sequence could then be compared graphically.

Given the stochastic nature of the algorithms based on Random Forests (RCARF and ARF), in these cases, we performed 20 experiments and averaged the results. The statistical significance of the differences of performance among the algorithmic approaches was formally tested using a protocol that starts verifying the normality of the distribution of prediction errors over the mentioned experiments using the Lilliefors test. In the case that the null hypothesis of normality was rejected, we relied on the Wilcoxon test [62]. Otherwise, we tested for homoscedasticity using Levene's test and, depending on whether we could reject the null hypothesis, the process ends testing for equality of means using either a Welch test or a *t*-test. The significance levels considered in the tests for normality and homoscedasticity were set at 5%. For the rest, we considered both 5% and 1%.

It is worth emphasizing that the approach that we describe predicts short-term market trends, but it does not generate any trading signals (that would require further processing). All the tested algorithms, including our proposed method, used the values of the raw technical indicators at specific points in time to generate a binary class prediction of the price trend (up or stable/down) for new data patterns. Ensemble based approaches have a number of internal classifiers whose predictions are subsequently combined to provide this prediction for the whole ensemble. We emphasize this idea at the end of Section 5.

#### *4.3. Parameter Selection and Sensitivity*

The parameterization effort was not subject to systematic optimization and, therefore, the performance of the algorithm might be understated. The algorithms in the experiments held most of their respective default parameters or recommended setups according to their authors. Nonetheless, there are certain points common to most of the algorithms that deserve a mention.


As mentioned above, in the experiments for ARF and RCARF, as change detector, we used the ADWIN algorithm proposed in [60]. The detailed procedure is described, for instance, in [63]. ADWIN has a variable sized sliding window of performance values. If a drift is detected, the window

size is reduced; otherwise, it increases, becoming larger with longer concepts. When used as a drift detector, ADWIN stores two sub-windows to represent older and recent data. When the difference in the averages between these sub-windows surpasses a given threshold, a drift is detected.

The ADWIN change detector uses a parameter, *δ*, the value of which is directly related to its sensitivity. A large value sets a smaller threshold in the number of changes in the monitored statistic (error rate in our case) that triggers the detection event. Specific values for parameters of this sort are dependent on the signal-to-noise ratio of the specific data stream and may impact the overall performance of the algorithm.

The RCARF algorithm assumes that a background tree is created and starts learning as soon as a change is reported by the ADWIN detector with *δ<sup>w</sup>* sensitivity. This background tree is only upgraded to "active" status when drift is confirmed by a second change detector triggered with *δ<sup>d</sup>* sensitivity. Thus, value for warning (*δw*) has to be greater than the value for drift (*δd*). Large values were selected for the RCARF change detector (*δ<sup>w</sup>* and *δd*) to ensure that concept history trees are given a chance to replace the active tree often enough to detect abrupt changes. These were set to *δ<sup>w</sup>* = 0.3 and *δ<sup>d</sup>* = 0.15.

The starting size of the dynamic windows for the internal evaluator was 10 examples, with increments or decrements of 1 example in the background trees, and a minimum size of 5 examples.

Although this should be confirmed by further analysis, our experiments suggest that ARF is in general more sensitive to the values of *δ* than RCARF. We believe that this can be explained by the fact that, in the case of early detection of a drift or abrupt changes, when the background tree is not yet ready to replace the active model, RCARF can still transition to a recurring decision tree that outperforms the incompletely trained background tree. Because of the sensitivity of ARF to these parameters, we tested three configurations for ARF (two of them, "*moderate*" and " *f ast*", recommended by the authors in [31]) that are summarized in Table 2. Regarding RCD, given that it uses a single ADWIN change detector, we selected the same value that was used for drift detection in RCARF.

**Table 2.** Sensitivity parameters for the ADWIN change detector in ARF and RCARF.


#### *4.4. Global Performance Comparison*

Table 3 summarizes the results of the experimental work providing the main descriptive statistics for the accumulated error (%) in predicting the market trend, for all the algorithms on the whole dataset over 20 runs. As can be seen, RCARF obtains the most competitive results. The reported differences were formally tested using the previously described protocol, and all of them were statistically significant at 1%.

**Table 3.** Global comparison. Accumulated error (%) for all algorithms on the whole dataset, sorted from best to worst result. Main descriptive statistics over 20 runs. Differences are significant at 1%.


The differences between RCARF and ARF are due to the fact that, in some of the abrupt changes, RCARF is able to replace the active tree with a trained tree from its concept history, the performance of which is better than the performance of the background tree used by ARF under the same circumstances. When these gains are over the whole period (including stable periods without concept drift), the final average difference is small. Because of the low signal to noise ratio in this domain, we believe that these small gains in predictive accuracy may create a competitive advantage for any trading system that might use the prediction of RCARF as part of its decision process.

Two configurations of ARF, ARF*moderate* and ARF*f ast*, obtained the second- and third-best results, followed by RCD. AHOEFT obtained the worst result, which was expected, as this algorithm maintains a single tree (not an ensemble of trees). It is well-known that ensemble methods can be used for improving prediction performance [64]. We can also conclude that configurations for ARF suggested by the authors are better than ARF*ultra*, which used the same parameters as RCARF. This may be explained by the fact that this configuration may be too sensitive to noise. It produces too many switches to background trees that are not yet accurately trained when they must be promoted to be active trees. RCARF, instead, is able to switch to pre-trained trees stored in the concept history, thus avoiding the corresponding decrease in performance.

As can be seen in Table 4, RCARF performed an average of 85 drifts per decision tree, for a total of 3411 drifts on average per experiment. However, the final number of decision trees in the concept history was in average 118 trees. As each recurring drift pushes one tree but also deletes one tree from the concept history, the table shows that there were only 118 background drifts in an average experiment, while there were more than 3000 recurring drifts on every experiment. This, together with the obtained results for RCARF, shows that the recurring drift mechanism was used to resolve most of the drift situations.

**Table 4.** Internal statistics for RCARF on the whole dataset over 20 runs. # Drifts, number of total drifts during the execution (both recurring and background); Drifts per tree, number of total drifts during the execution (both recurring and background) divided by the ensemble size; # F. Warnings, number of active warnings at the end of the execution; # *CH* Trees, number of decision trees in the concept history at the end of the execution.


Another issue of interest is that the final number of active warnings (at the end of the experiments) was between 9 and 19; that is, with 40 trees, a percentage between 22.5% and 50% of the total were in warning at this point of time. Obviously, this fraction changed continuously during the experiment and was different on every run and for each base classifier. This number depends on the sensitivity parameter in RCARF *δw*, and may be taken as a measure of the number of "open" warning windows in a given experiment. A lower value for *δ<sup>w</sup>* may be chosen to reduce the number of warning windows opened simultaneously.

In terms of efficiency, the average full running time of RCARF on the whole dataset over 20 experiments was 35,263 s. This is less than the 10 computing hours for an entire quarter of market data at 1-s level. Hence, although the experiments were not run against the market in real time, RCARF demonstrates the ability to operate in an online setting at 1-s level on the server used.

#### *4.5. Evolution of the Ensemble over Time*

To show the overall behavior of RCARF, we have included Figure 2. It shows the evolution of error in RCARF for a short period of time (the first trading day of the year). Vertical lines are used to signal moments where a drift occurred and an active tree was replaced with one of the trees in the ensemble. Red dotted lines indicate times where a background tree became active, while blue dashed lines indicate times where a concept history tree was re-activated (recurrent drift). As we can observe in Figure 2, drifts are detected throughout the whole period of time.

At the beginning of the experiment, the error was higher because the models had not yet had the opportunity to adjust to the data; therefore, drifts occurred quite often and sometimes with very short intervals among them. Later on, drifts were more sparse. Most of the transitions were to trees that were stored in the concept history (blue dashed drifts in Figure 2), and not very often to background trees (red dotted drifts). That is, concept history trees were used most of the time instead of background trees, which proves that storing information from the past helped the RCARF algorithm in this particular dataset.

**Figure 2.** Sample run of RCARF on a single test for the trading first day. Error measured on windows of 500 examples. Red dotted vertical lines mark drifts to background trees, and blued dashed vertical lines mark drifts to recurring trees.

Figure 3 compares the results of all of the algorithms over a portion of the training set. Due to the sampling frequency of seconds, we have smoothed the plots averaging error on 1000 examples. The first 1000 examples are excluded from the chart due to this fact. The aim of the figure is to illustrate the performance of the algorithms over a specific period of time. Given the length of the time series used in the experimental analysis and the fact that the algorithms were run a number of times, it is hard to extract clear conclusions out of it. The performance comparison should be made based on the global performance indicators and statistical tests reported Table 3.

Having said that, the figure is consistent with the mentioned results. ARF and RCARF show a similar behavior, and their average error over time tended to be below the one found for the other algorithms. This is interesting because it suggests that these algorithms might indeed be superior under most circumstances, and not under some specific market conditions that might be difficult to capture with the AHOEFT, RCD, and DWM. RCARF and ARF*moderate*, the closest competitor, often overlapped. However, RCARF was often dominant for short periods of time. This would be consistent with the notion that RCARF should benefit from the use of its concept history to adjust faster to drifts than ARF, which would eventually accumulate enough evidence to converge to a similar model.

**Figure 3.** Algorithm comparison. Average error measured on windows of 1000 examples for a example period of time. For RCARF, ARF*ultra*, ARF*f ast* and ARF*moderate*, we show the average result of 20 runs.

#### **5. Summary and Conclusions**

In this paper, we introduce RCARF, an ensemble tree-based online classifier that handles recurring concepts explicitly. The algorithm extends the capabilities of Adaptive Random Forests (ARF) adding a mechanism to store and handle a shared collection of inactive trees, called concept history. This works in conjunction with a decision strategy that reacts to drift by replacing active trees with the best available alternative: either a previously stored tree from the concept history or a newly trained background tree. Both mechanisms are designed to provide fast reaction times and are thus applicable to high-frequency data.

The experimentation was conducted on data from a full quarter of both price and trade volumes for the SPY Exchange-Traded Fund. This ETF, one of most heavily-traded ones, tracks the S&P 500 index. Both series were downloaded with a resolution of 1-s. We defined a classification problem where the objective was to predict whether the price will rise or decrease in the price change. For this classification task, we used as attributes a list of technical indicators commonly found in the literature. These indicators were labeled with the predicted behavior (class) and the result was fed as a data stream to our test bench of online stream classifiers, including our proposal, RCARF.

The experimental results show that RCARF offers a statistically significant improvement over the comparable methods. Given that the main difference between ARF and RCARF is the fact that the second one uses recurring concepts, the new evidence would support the hypothesis that keeping a memory of market models adds value versus a mere continuous adaptation. The idea that old models might end up eventually being more useful than the ones that are being fitted at the time, mostly due to faster adaptation to the market state, has interesting implications from a financial point of view. The reported results would support the idea of history repeating in terms of the price generation process. The market would not always transition to completely new market states, but also switch back to previous (or similar) ones. Recognition of the previous aspect is an extra insight for financial experts that might be used to obtain excess returns. This, however, is something to be analyzed in the future.

This work was focused on trend prediction with adaptation to concept drift, but we did not intend to derive any trading system. Actually, the implementation of such system might require reframing the classification problem to include a larger number of alternatives that could discriminate not only the direction of price changes, but also their magnitude. The current version of the algorithm predicts to a certain point short-term market trends, whether there is a way to exploit profitably market regularities is yet to be determined. For that reason, while it is clear that our the results are compatible with arguments against the efficient-market hypothesis, we cannot claim that we can beat consistently buy and hold and, therefore, we cannot reject it.

Future extensions of this work might include optimization of the algorithm for ultra-high frequencies and the development of further methods to adapt and resize the internal evaluator, such as the possibility of saving the window size as part of the concept to be inherited in case of recurring drifts and new window resizing politics for the historical models. All this might contribute to the optimization of the process that currently selects between recurrent or new decision trees. Finally, another possibility would be the addition of meta-cognition to evaluate recurring behaviors from the history by looking at previous transitions of the model.

**Author Contributions:** A.L.S.-C. and A.C. conceived the algorithm; A.L.S.-C. implemented the solution; A.L.S.-C. and A.C. D.Q designed the experiments; A.L.S.-C. ran the experiments; and A.L.S.-C., A.C., and D.Q analyzed the data and wrote the paper.

**Funding:** This research was funded by the Spanish Ministry of Economy and Competitiveness under grant number ENE2014-56126-C2-2-R.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Multi-Modal Deep Hand Sign Language Recognition in Still Images Using Restricted Boltzmann Machine**

#### **Razieh Rastgoo 1,2, Kourosh Kiani 1,\* and Sergio Escalera <sup>2</sup>**


Received: 12 September 2018; Accepted: 3 October 2018; Published: 23 October 2018

**Abstract:** In this paper, a deep learning approach, Restricted Boltzmann Machine (RBM), is used to perform automatic hand sign language recognition from visual data. We evaluate how RBM, as a deep generative model, is capable of generating the distribution of the input data for an enhanced recognition of unseen data. Two modalities, RGB and Depth, are considered in the model input in three forms: original image, cropped image, and noisy cropped image. Five crops of the input image are used and the hand of these cropped images are detected using Convolutional Neural Network (CNN). After that, three types of the detected hand images are generated for each modality and input to RBMs. The outputs of the RBMs for two modalities are fused in another RBM in order to recognize the output sign label of the input image. The proposed multi-modal model is trained on all and part of the American alphabet and digits of four publicly available datasets. We also evaluate the robustness of the proposal against noise. Experimental results show that the proposed multi-modal model, using crops and the RBM fusing methodology, achieves state-of-the-art results on Massey University Gesture Dataset 2012, American Sign Language (ASL). and Fingerspelling Dataset from the University of Surrey's Center for Vision, Speech and Signal Processing, NYU, and ASL Fingerspelling A datasets.

**Keywords:** hand sign language; deep learning; restricted Boltzmann machine (RBM); multi-modal; profoundly deaf; noisy image

#### **1. Introduction**

Profoundly deaf people have many problems in communicating with other people in society. Due to impairment in hearing and speaking, profoundly deaf people cannot have normal communication with other people. A special language is fundamental in order for profoundly deaf people to be able to communicate with others [1]. In recent years, some projects and studies have been proposed to create or improve smart systems for this population to recognize and detect the sign language from hand and face gestures in visual data. While each method provides different properties, more research is required to provide a complete and accurate model for sign language recognition. Using deep learning approaches has become common for improving the recognition accuracy of sign language models in recent years. In this work, we use a generative deep model, Restricted Boltzmann Machine (RBM), using two visual modalities, RGB and Depth, for automatic sign language recognition. A Convolutional Neural Network (CNN) model, using Faster Region-based Convolutional Neural Network (Faster-RCNN) [2], is applied for hand detection in the input image. Then, our goal is to test how a generative deep model, able to generate data from modeled data distribution probabilities, in combination with different visual modalities, can improve recognition performance of state-of-the-art alternatives for sign language recognition. The contributions of this paper are summarized as follows:


The rest of this paper is organized as follows: Section 2 reviews the related materials and methods as well as the details of the proposed model. Experimental results on four publicly available datasets are presented in Section 3. Finally, Section 4 concludes the work.

#### **2. Materials and Methods**

#### *2.1. Related Work*

Sign language recognition has seen a major breakthrough in the field of Computer Vision in recent years [3]. A detailed review of sign language recognition models can be found in [4]. The challenges of developing sign language recognition models range from the image acquisition to the classification process [3]. We present a brief review of some related models of sign language recognition in two categories:

• Deep-based models: In this category, the proposed models use deep learning approaches for accuracy improvement. A profoundly deaf sign language recognition model using the Convolutional Neural Network (CNN) was developed by Garcia and Viesca [5]. Their model classifies correctly some letters of the American alphabet when tested for the first time, and some other letters most of the time. They fine-tuned the GoogLeNet model and trained their model on American Sign Language (ASL) and the Finger Spelling Dataset from the University of Surrey's Center for Vision, Speech, and Signal Processing and Massey University Gesture Dataset 2012 [5]. Koller et al. used Deep Convolutional Neural Network (DCNN) and Hidden-Markov-Model (HMM) to model mouth shapes to recognize sign language. The classification accuracy of their model outperformed state-of-the-art mouth model recognition systems [6]. An RGB ASL Image Dataset (ASLID) and a deep learning-based model were introduced by Gattupalli et al. to improve the pose estimation of the sign language models. They measured the recognition accuracy of two deep learning-based state-of-the-art methods on the provided dataset [7]. Koller et al. proposed a hybrid model, including CNN and Hidden Markov Model (HMM), to handle the sequence data in sign language recognition. They interpreted the output of their model in a Bayesian fashion [8]. Guo et al. suggested a tree-structured Region Ensemble Network (REN) for 3D hand pose estimation by dividing the last convolution outputs of CNN into some grid regions. They achieved state-of-the-art estimation accuracy on three public datasets [9]. Deng et al. designed a 3D CNN for hand pose estimation from a single depth image. This model directly produces the 3D hand pose and does not need further processing. They achieved state-of-the-art estimation accuracy on two public datasets [10]. A model-based deep learning approach has been suggested by Zhou et al. [11]. They used a 3D CNN with a kinematics-based layer to estimate the hand geometric parameters. The report of experimental results of their model shows that they attained state-of-the-art estimation accuracy on some publicly available datasets. A Deep Neural Network (DNN) has been proposed by the LIRIS team of ChaLearn challenge 2014 for hand gesture recognition from two input modalities, RGB

and Depth. They achieved the highest accuracy results of the challenge, using early fusion of joint motion features from two input modalities [12]. Koller et al. presented a new approach to classify the input frames using an embedded CNN within an iterative Expectation Maximum (EM) algorithm. The proposed model has been evaluated on over 3000 manually labelled hand shape images of 60 different classes and led to 62.8 top-1 accuracy on the input data [13]. While their model is applied not only for image input but also for frame sequences of a video, there are many rooms to improve the model performance in the case of time and complexity due to using HMMs and the EM algorithm. Guo et al. [14] proposed a simple tree-structured REN for 3D coordinate regression of depth image input. They partitioned the last convolution outputs of ConvNet into several grid regions and integrated the output of fully connected (FC) regressors from regions into another FC layer.

• Non-deep models: In this category, the proposed model does not use deep learning approaches. Philomena and Jasmin suggested a smart system composed of a group of Flex sensors, machine learning and artificial intelligence concepts to recognize hand gestures and show the suitable form of outputs. Unfortunately, this system has been defined as a research project and the experimental results have not been reported [15]. Narayan Sawant designed and implemented an Indian Sign Language recognition system to recognize the 26-character alphabet by using the HSV color model and Principal Component Analysis (PCA) algorithm. In this work, the experimental results have not been reported [16]. Ullah designed a hand gesture recognition system using the Cartesian Genetic Programming (CGP) technique for American Sign Language (ASL). Unfortunately, the designed system is still restricted and slow. Improving the recognition accuracy and learning ability of the suggested system are necessary [17]. Kalsh and Garewal proposed a real-time system for hand sign recognition using different hand shapes. They used the Canny edge detection algorithm and Gray-level images. They selected only six alphabets of ASL and achieved a recognition accuracy of 100 [18]. An Adaptive Neuro-Fuzzy Inference System (ANFIS) was designed to recognize sign language by Wankhade and Zade. They compared the performance of Neural Network, HMM, and Adaptive Neuro-Fuzzy Inference System (ANFIS) for sign language recognition. Based on their experimental results for 35 samples, ANFIS had a higher accuracy than the other methods [19]. Plawiak et al. [20] designed a system for efficient recognition of hand body language based on specialized glove sensors. Their model used Probabilistic Neural Network, Support Vector Machine, and K-Nearest Neighbor algorithms for gesture recognition. The proposed model has been evaluated on data collected from ten people performing 22 hand body languages. While the experimental results show high recognition performance, gestures with low inter-class variability use are miss-classified.

In this work, we propose a deep-based model using RBM to improve sign language recognition accuracy from two input modalities, RGB and Depth. Using three forms of the input images, original, cropped, and noisy cropped, the hands of these images are detected using CNN. While each of these forms for each modality is passed to an RBM, the output of these RBMs are fused in another RBM to recognize the output hand sign language label. Furthermore, we evaluate the noise robustness of the model by generating different test cases, including different types of noise applied to input images. Based on the labels of the input images, some states, including all or parts of the output class labels, are generated. Some of the letters, such as Z and Y, are hardly detected because of the complexities in their signs. In this regard, we generate different states in order to have the freedom to ignore these hardly detected letters in some of the states. We expect that the states that do not include the hardly detected letters or digits have good recognition accuracy. The proposed model is trained on the Massey, ASL dataset at Surrey, NYU, and ASL Fingerspelling A dataset and achieves state-of-the-art results.

#### *2.2. Proposed Model*

The proposed model includes the following steps:

• Inputs: The original input images are entered into the model in order to extract their features. As Figure 1 shows, we use two modalities, RGB and Depth, in the input images. In the case of one modality in the input images, we use the model illustrated in Figures 2 and 3 for depth and RGB input images.

**Figure 1.** The proposed model.

**Figure 2.** The proposed model in the case of using just depth modality in the input.

**Figure 3.** The proposed model in the case of using just RGB modality in the input.


noisy crops is separately input to the RBM.

Second RBM: Five crops of RGB input image are the inputs of the second RBM.

Third RBM: Only the original detected hand of the RGB input image is considered as the input of third RBM.

Fourth RBM: Five depth noisy cropped images are separately sent to the fourth RBM. Fifth RBM: The inputs of the fifth RBM are five depth cropped images.

	- decrease the dimension but also to generate the distribution of data to recognize the final hand sign label. In Figure 1, we show how to use these RBMs in our model.

Details of the mentioned parts of the proposed method are explained in the following sub-sections.

#### 2.2.1. Input Image

We use two modalities, RGB and depth, in the input images. In the case that we have only one modality in the input images, we use a part of the model for that input modality. In the proposed multi-modal model, Figure 4, the top part of the model, as seen in Figure 2, is the model for depth inputs and the bottom part, as see in Figure 3, is the model for RGB inputs.

**Figure 4.** Flowchart of the proposed model.

#### 2.2.2. Hand Detecting

The hands in the input image are detected using the fine-tuned Faster-RCNN [2]. Faster-RCNN is a fast framework for object detection using CNN. Faster-RCNN network takes an input image and a set of object proposals. The outputs of this network are the real-valued number-encoded refined bounding-box positions for each of the output classes in the network. Faster-RCNN uses a Region Proposal Network (RPN) to share full-image convolutional features with the detection network, which leads to providing approximately cost-free region proposals. RPN is a fully convolutional network that is used to predict the object bounds. Faster-RCNN achieved state-of-the-art object detection accuracy on some public datasets. In addition, Faster-RCNN has a high frame rate detection on very deep networks such as VGG-16. Sharing the convolutional features has led to decreasing the parameters as

well as increasing the detection speed in the network. Due to a high speed and low cost in the object detection, we used the Faster-RCNN to detect the hands in the input images.

#### 2.2.3. Image Cropping

To increase the accuracy of the proposed method in recognizing the hand sign language under different situations, different crops of input images are used, as Figure 5 shows. Using different crops is helpful for increasing the accuracy of the model in recognizing input images in situations where some parts of the images do not exist or have been destroyed. In addition, by using these crops, the size of the dataset is increased, being beneficial for deep learning approaches. The proposed method is evaluated by using different numbers of crops to select the suitable number of crops. Furthermore, the proposed method is trained not only on the input images without any crops but also on the cropped images. A sample generating different crops of an image is shown in Figure 6.

**Figure 5.** Generating different crops of the input image.

**Figure 6.** A sample image and generated crops.

#### 2.2.4. Add Noise

To increase the noise robustness of the proposed method, three types of noise are added to the input images. Figure 7 shows a sample image as well as the applied noises. Gaussian, Gaussian Blur, and Salt-and-Pepper noises are selected due to some beneficial features such as being additive, independent at each pixel, and independent of signal intensity. Four test sets are generated to evaluate the noise robustness of the proposed method as follows:


4. TSet4: In this test set, Gaussian Blur noise is added to the data.

**Figure 7.** A sample image applying different kinds of noise. (**Left column**): original images, (**Internal column**): Gaussian noise, (**Right column**): Salt-and-pepper noise.

#### 2.2.5. Entry into the RBM

RBM is an energy-based model that is shown via an undirected graph, as illustrated in Figure 8. RBM is used as a generative model in different types of data and applications to approximate data distribution. The RBM graph contains two layers, namely visible and hidden units. While the units of each layer are independent of each other, they are conditioned on the units of the other layer. RBM can be trained by using the Contrastive Divergence (CD) learning algorithm. To acquire a suitable estimator of the log-likelihood gradient in RBM, Gibbs sampling is used. Suitable adjustment of the parameters of RBM, such as the learning rate, the momentum, the initial values of the weights, and the number of hidden units, plays a very important role in the convergence of the model [21,22].

**Figure 8.** RBM network graph.

We are using a reduced set of data where CNN approaches are not able to generalize well. In this case, RBM, a deep learning model with fewer parameters on the generated dataset, can be a good alternative. In the proposed method, we use RBM for hand sign recognition. The achieved results comparing the proposed method with the CNN models shows the outperforming of the RBM model for hand sign recognition on the tested datasets. We use some RBMs in the proposed method for generating the distribution of the input data as well as the recognizing the hand sign label. For each input image modality, we use three RBMs for three forms of input images, which are: original detected hand image, five cropped detected hand images, and five noisy cropped detected hand images. While the input layer of these RBMs includes the size of the 227 × 227 × 3 visible neurons, the hidden layer has 500 neurons. Figure 9 shows the RGB cropped detected hand inputs of one of the RBMs used in the proposed model.

**Figure 9.** The RGB cropped detected hand inputs of one of the RBMs used in the proposed model.

#### 2.2.6. Outputs Fusing

The outputs of the RBMs, used for each form of the input image for each input modality, are fused in another RBM for hand sign label recognition, while in the case of having just one modality, RGB or depth, we fused three RBM outputs of three input image forms, and fused six RBM outputs in two-modality inputs. Figure 10 shows the RBM outputs fusing for two-modality inputs of our model.

**Figure 10.** RBM outputs fusing in two-modality inputs of our model.

#### **3. Results and Discussion**

Details of the achieved results of the proposed method on four public datasets are discussed in this section. Results are also compared to state-of-the-art alternatives. Furthermore, we self-compared the proposed model on four used datasets.

#### *3.1. Implementation Details*

We implemented our model on Intel(R) Xeon(R) CPU E5-2699 (2 processors) with 30 GB RAM on Microsoft Windows 10 operating system and Matlab 2017 software on NVIDIA GPU. Training and test sets are set as defined in the public dataset description for all methods. Five crops of input images are generated and used. We use Stochastic Gradient Descent (SGD) with a mini-batch size of 128. The learning rate starts from 0.005 and is divided by 10 every 1000 epochs. The proposed model is trained for a total of 10,000 epochs. In addition, we use a weight decay of 1 <sup>×</sup> <sup>10</sup>−<sup>4</sup> and a momentum of 0.92. Our model is trained from scratch with random initialization. To evaluate the noise robustness of our model, we use the Gaussian and Gaussian Blur noise with zero mean and variance equal to 0.16. The noise density parameter of the Salt-and-Pepper noise is 0.13. Details of the used parameters in the proposed method are shown in Table 1.


**Table 1.** Details of the parameters in the proposed method.

#### *3.2. Datasets*

The ASL Fingerspelling Dataset from the University of Surrey's Center for Vision, Speech and Signal Processing [23], Massey University Gesture Dataset 2012 [24], ASL Fingerspelling A [25], and NYU [26] datasets have been used to evaluate the proposed model. Details of these datasets are shown in Table 2. To show the effect of the background in the achieved results, we used not only the datasets without background but also the datasets including background. Figure 11 shows some samples of the ASL Fingerspelling A dataset.


**Figure 11.** Samples of the American Sign Language (ASL) Fingerspelling A dataset.

#### *3.3. Parameter Evaluation*

Changing some parameters in the proposed method led to different accuracies in the method. Suitable values for the parameters are selected after testing different values for these parameters. Figure 12 shows the effect of changing the learning rate and weight decay parameters in the proposed method. After selecting the best values of the parameters, we fixed and tested the model.

**Figure 12.** Accuracy versus Weight decay and Learning rate parameters.

Using the five crops in the training of the proposed method increases not only the size of the dataset but also the robustness of the method in coping with the missed or destroyed parts of the input images. Selecting the suitable number of the crops was done by testing the different values and analyzing the accuracy of the proposed method on the training data. After testing different numbers of crops, the number five was used. Figure 13 shows the best-achieved accuracy of the proposed method in different crops of input images. As Figure 13 shows, while the accuracy of the proposed method monotonically increases in the crop numbers ranging from 1 to 5, the accuracy is approximately fixed in the higher values of the crop number. Due to decreasing of time and cost complexity, five crop numbers were selected.

**Figure 13.** Accuracy versus number of crops of the proposed method on the Massey University Gesture Dataset 2012.

#### *3.4. Self-Comparison*

The proposed model is trained on four public datasets for hand sign recognition. We use two modalities in the input images, RGB and Depth. We used accuracy for model evaluation and comparison, defined as follows:

$$Acc = NT/NT + NF,\tag{1}$$

with *NT* being the number of the input samples correctly classified and *NF* the number of input samples miss-classified. Model has a better accuracy on Massey University Gesture Dataset 2012 than the other datasets used for evaluation. This was predictable because this dataset includes only the RGB images without background in the images. The other datasets, ASL Fingerspelling Dataset from the University of Surrey's Center for Vision, Speech and Signal Processing, NYU, and ASL Fingerspelling A, have background in their images. Table 3 shows the results of this comparison. Comparison of the results of the proposed model shows that the recognition accuracy of the proposed model on Massey University Gesture Dataset 2012, with RGB input images, were higher than the other used datasets.


**Table 3.** Recognition accuracy of the proposed model on four datasets.

#### *3.5. Evaluating the Robustness to Noise of the Proposed Method*

Four test sets, TSet1, TSet2, TSet3, and TSet4, are generated to evaluate the robustness to noise of the proposed method. Table 4 compares the accuracy of the proposed method in four different states, with the details of the generated test sets being as follows:



**Table 4.** Accuracy of the proposed method on four test sets.

As Table 4 shows, the proposed model achieves higher accuracy on Massey University Gesture Dataset 2012 dataset than with the other used datasets. Due to not having background and occlusion as well as high transparency of the RGB images of this dataset, higher accuracy than the other used datasets with complex background and occlusion in the input images is expected.

#### *3.6. State-of-the-Art Comparison*

The proposed method is compared with state-of-the-art alternatives in hand sign recognition on four publicly available datasets. Comparison is done under the same conditions of training and testing data partitioning as in previous work, for a fair comparison. As one can observe in Table 5, the proposed model achieves the highest performance in all four datasets.

To evaluate the recognition accuracy of the proposed model for hardly detected characters such as *Z* and *Y*, we generate three categories from the Massey University Gesture Dataset 2012 in order to compare the proposed method with the model suggested by Garcia et al. [5]. The first category includes all 26 characters. The second category includes only 11 characters and ignores the *Z* and *Y*. Finally, the third category includes only 11 characters and ignores the *Z* and *Y*. Details of three categories are as follows:



**Table 5.** State-of-the-art comparison.

The results of the comparison of Top-1 and Top-5 accuracies are shown in Tables 6 and 7. The proposed method significantly outperforms the Garcia and Viesca [5] model in recognition accuracy.

**Table 6.** Comparison of Top-1 accuracy of the proposed method and Garcia [5] model in three considered categories on Massey University Gesture Dataset 2012.


**Table 7.** Comparison of Top-5 accuracy of the proposed method and Garcia [5] model in three considered categories on Massey University Gesture Dataset 2012.


#### **4. Conclusions**

We proposed the use of RBM as a deep generative model for sign language recognition in multi-modal RGB-Depth data. We showed the model to provide a generalization in instances of low amounts of annotated data thanks to the low number of model parameters. We also showed the model to be robust against different kinds of noise present in the data, and benefitting from the fusion of RGB and Depth visual modalities. We achieved state-of-the-art results in five public sign recognition datasets. However, the model shows difficulty recognizing characters with low visual inter-class variability, such as in the case of the high similarity of hand poses for defining *Z* and *Y* characters. For future work, we plan to further reduce the complexity of the whole ensemble of RBMs by defining isolated simple RBM models that can share information in early training stages. Furthermore, we plan to extend model behavior to deal with image sequences and model spatio-temporal information of sign gestures.

**Author Contributions:** This work is part of R.R., Ph.D. K.K. and S.E. are work supervisors. Conceptualization, R.R., K.K. and S.E.; Methodology, R.R.; Supervision, K.K. and S.E.; Validation, R.R.; Visualization, R.R.; Writing, R.R.; Review and editing, R.R., K.K. and S.E.

**Funding:** This research received no external funding.

**Acknowledgments:** This work has been partially supported by the Spanish project TIN2016-74946-P (MINECO/FEDER, UE), CERCA Programme/Generalitat de Catalunya, and High Intelligent Solution (HIS) company of Iran. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan XP GPU used for this research.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


c 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Multi-Objective Evolutionary Rule-Based Classification with Categorical Data**

**Fernando Jiménez 1,\*, Carlos Martínez 1, Luis Miralles-Pechuán <sup>2</sup> and Gracia Sánchez <sup>1</sup> and Guido Sciavicco <sup>3</sup>**


Received: 30 July 2018; Accepted: 6 September 2018; Published: 7 September 2018

**Abstract:** The ease of interpretation of a classification model is essential for the task of validating it. Sometimes it is required to clearly explain the classification process of a model's predictions. Models which are inherently easier to interpret can be effortlessly related to the context of the problem, and their predictions can be, if necessary, ethically and legally evaluated. In this paper, we propose a novel method to generate rule-based classifiers from categorical data that can be readily interpreted. Classifiers are generated using a multi-objective optimization approach focusing on two main objectives: maximizing the performance of the learned classifier and minimizing its number of rules. The multi-objective evolutionary algorithms *ENORA* and *NSGA-II* have been adapted to optimize the performance of the classifier based on three different machine learning metrics: accuracy, area under the *ROC* curve, and root mean square error. We have extensively compared the generated classifiers using our proposed method with classifiers generated using classical methods such as *PART*, *JRip*, *OneR* and *ZeroR*. The experiments have been conducted in full training mode, in 10-fold cross-validation mode, and in train/test splitting mode. To make results reproducible, we have used the well-known and publicly available datasets *Breast Cancer*, *Monk's Problem 2*, *Tic-Tac-Toe-Endgame*, *Car*, *kr-vs-kp* and *Nursery*. After performing an exhaustive statistical test on our results, we conclude that the proposed method is able to generate highly accurate and easy to interpret classification models.

**Keywords:** multi-objective evolutionary algorithms; rule-based classifiers; interpretable machine learning; categorical data

#### **1. Introduction**

*Supervised Learning* is the branch of *Machine Learning* (*ML*) [1] focused on modeling the behavior of systems that can be found in the environment. Supervised models are created from a set of past records, each one of which, usually, consists of an input vector labeled with an output. A supervised model is an algorithm that simulates the function that maps inputs with outputs [2]. The best models are those that predict the output of new inputs in the most accurate way. Thanks to modern computing capabilities, and to the digitization of ever-increasing quantities of data, nowadays, supervised learning techniques play a leading role in many applications. The first classification systems date back to the 1990s; in those days, researchers were focused on both precision and interpretability, and the systems to be modeled were relatively simple. Years later, when it became necessary to model more difficult behaviors, the researchers focused on developing more and more precise models, leaving aside the interpretability. *Artificial Neural Networks* (*ANN*) [3], and, more recently, *Deep Learning Neural Networks* (*DLNN*) [4], as well as *Support Vector Machines* (*SVM*) [5], and *Instance-based Learning* (*IBL*) [6] are archetypical examples of this approach. A *DLNN*, for example, is a large mesh of ordered nodes arranged in a hierarchical manner and composed of a huge number of variables. *DLNN*s are capable of modeling very complex behaviors, but it is extremely difficult to understand the logic behind their predictions, and similar considerations can be drawn for *SVN*s and *IBL*s, although the underlying principles are different. These models are known as *black-box* methods. While there are applications in which knowing the ratio behind a prediction is not necessarily relevant, (e.g., predicting a currency's future value, whether or not a user clicks on an advert or the amount of rain in a certain area), there are other situations where the interpretability of a model plays a key role.

The *interpretability* of classification systems refers to the ability they have to explain their behavior in a way that is easily understandable by a user [7]. In other words, a model is considered interpretable when a human is able to understand the logic behind its prediction. In this way, Interpretable classification models allow external validation by an expert. Additionally, there are certain disciplines such as medicine, where it is essential to provide information about decision making for ethical and human reasons. Likewise, when a public institution asks an authority for permission to investigate an alleged offender, or when the CEO of a certain company wants to take a difficult decision which can seriously change the direction of the company, some kind of explanations to justify these decisions may be required. In these situations, using transparent (also called grey-box) models is recommended. While there is a general consensus on how the performance of a classification system is measured (popular metrics include *accuracy*, *area under the ROC curve*, and *root mean square error*), there is no universally accepted metric to measure the interpretability of the models. Nor is there an ideal balance between the interpretability and performance of classification systems but this depends on the specific application domain. However, the rule of thumb says that the simpler a classification system is, the easier it is to interpret. *Rule-based Classifiers* (*RBC*) [8,9] are among the most popular interpretable models, and some authors define the degree of interpretability of an *RBC* as the number of its rules or as the number of axioms that the rules have. These metrics tend to reward models with fewer rules as simple as possible [10,11]. In general, *RBC*s are classification learning systems that achieve a high level of interpretability because they are based on a human-like logic. Rules follow a very simple schema:

#### *IF (Condition 1) and (Condition 2) and* ... *(Condition N) THEN (Statement)*

and the fewer rules the models have and the fewer conditions and attributes the rules have, the easier it will be for a human to understand the logic behind each classification. In fact, *RBC*s are so natural in some applications that they are used to interpret other classification models such as *Decision Trees* (*DT*) [12]. *RBC*s constitute the basis of more complex classification systems based on fuzzy logic [13] such as *LogitBoost* or *AdaBoost* [14].

Our approach investigates the conflict between accuracy and interpretability as a *multi-objective optimization problem*. We define a solution as a set of rules (that is, a classifier), and establish two objectives to be maximized: interpretability and accuracy. We decided to solve this problem by applying *multi-objective evolutionary algorithms* (*MOEA*) [15,16] as meta-heuristics, and, in particular, two known algorithms: *NSGA-II* [15] and *ENORA* [17]. They are both state-of-the-art evolutionary algorithms which have been applied, and compared, on several occasions [18–20]. *NSGA-II* is very well-known and has the advantage of being available in many implementations, while *ENORA* generally has a higher performance. In the current literature, *MOEA*s are mainly used for learning *RBC*s based on fuzzy logic [18,21–26]. However, *Fuzzy RBC*s are designed for numerical data, from which fuzzy sets are constructed and represented by linguistic labels. In this paper, on the contrary, we are interested in *RBC*s for categorical data, for which a novel approach is necessary.

This paper is organized as follows. In Section 2, we introduce multi-objective constrained optimization, the evolutionary algorithms *ENORA* and *NSGA-II*, and the well-known rule-based classifier learning systems *PART*, *JRip*, *OneR* and *ZeroR*. In Section 3, we describe the structure of an *RBC* for categorical data, and we propose the use of multi-objective optimization for the task of learning a classifier. In Section 4, we show the result of our experiments, performed on the well-known publicly accessible datasets *Breast Cancer*, *Monk's Problem 2*, *Tic-Tac-Toe-Endgame*, *Car*, *kr-vs-kp* and *Nursery*. The experiments allow a comparison among the performance of the classifiers learned by our technique against those of classifiers learned by *PART*, *JRip*, *OneR* and *ZeroR*, as well as a comparison between *ENORA* and *NSGA-II* for the purposes of this task. In Section 5, the results are analyzed and discussed, before concluding in Section 6. Appendices A and B show the tables of the statistical tests results. Appendix C shows the symbols and the nomenclature used in the paper.

#### **2. Background**

#### *2.1. Multi-Objective Constrained Optimization*

The term *optimization* [27] refers to the selection of the best element, with regard to some criteria, from a set of alternative elements. *Mathematical programming* [28] deals with the theory, algorithms, methods and techniques to represent and solve optimization problems. In this paper, we are interested in a class of mathematical programming problems called *multi-objective constrained optimization problems* [29], which can be formally defined, for *l* objectives and *m* constraints, as follows:

$$\begin{array}{ll}\text{Min.} \text{/Max.} & f\_i \begin{pmatrix} \mathbf{x} \end{pmatrix}, & \mathbf{i} = \mathbf{1}, \; \dots, \; l \\ \text{subject to} & g\_j \begin{pmatrix} \mathbf{x} \end{pmatrix} \le \mathbf{0}, & j = \mathbf{1}, \; \dots, m \end{array} \tag{1}$$

where *fi* (**x**) (usually called *objectives*) and *gj* (**x**) are arbitrary functions. Optimization problems can be naturally separated into two categories: those with discrete variables, which we call *combinatorial*, and those with continuous variables. In combinatorial problems, we are looking for objects from a finite, or countably infinite, set X , where objects are typically integers, sets, permutations, or graphs. In problems with continuous variables, instead, we look for real parameters belonging to some continuous domain. In Equation (1), **<sup>x</sup>** <sup>=</sup> {*x*1, *<sup>x</sup>*2, ..., *xw*} ∈ X *<sup>w</sup>* represents the set of decision variables, where X is the domain for each variable *xk*, *k* = 1, . . . , *w*.

Now, let <sup>F</sup> <sup>=</sup> {**<sup>x</sup>** ∈ X *<sup>w</sup>* <sup>|</sup> *gj* (**x**) <sup>≤</sup> 0, *<sup>j</sup>* <sup>=</sup> 1, ... , *<sup>m</sup>*} be the set of all feasible solutions to Equation (1). We want to find a subset of solutions S⊆F called *non-dominated set* (or *Pareto optimal set*). A solution **x** ∈ F is *non-dominated* if there is no other solution **x** ∈ F that dominates **x**, and a solution **x** *dominates* **x** if and only if there exists *i* (1 ≤ *i* ≤ *l*) such that *fi* (**x** ) improves *fi* (**x**), and for every *i* (1 ≤ *i* ≤ *l*), *fi* (**x**) does not improve *fi* (**x** ). In other words, **x** *dominates* **x** if and only if **x** is better than **x** for at least one objective, and not worse than **x** for any other objective. The set S of non dominated solutions of Equation (1) can be formally defined as: S = (**x** ∈F∧D 

$$\mathcal{S} = \{ \mathbf{x} \in \mathcal{F} \mid \exists \ \mathbf{x}'(\mathbf{x}' \in \mathcal{F} \land \mathcal{D}\ (\mathbf{x}', \mathbf{x})) \}$$

where:

$$\begin{aligned} \text{Hours or Equation (1) can be formally assumed as:}\\ \mathcal{S} = \left\{ \mathbf{x} \in \mathcal{F} \mid \exists \begin{array}{l} \exists \text{ } \mathbf{x}'(\mathbf{x}' \in \mathcal{F} \land \mathcal{D}\left(\mathbf{x}', \mathbf{x}\right)) \right\} \\\\ \mathcal{D}\left(\mathbf{x}', \mathbf{x}\right) = \exists i \left(1 \le i \le l, f\_i\left(\mathbf{x}'\right) < f\_i\left(\mathbf{x}\right)\right) \land \forall i \left(1 \le i \le l, f\_i\left(\mathbf{x}'\right) \le f\_i\left(\mathbf{x}\right)\right). \end{aligned} $$

Once the set of optimal solutions is available, the most satisfactory one can be chosen by applying a preference criterion. When all the functions *fi* are linear, then the problem is a *linear programming problem* [30], which is the classical mathematical programming problem and for which extremely efficient algorithms to obtain the optimal solution exist (e.g., the *simplex method* [31]). When any of the functions *fi* is non-linear then we have a *non-linear programming problem* [32]. A non-linear programming problem in which the objectives are arbitrary functions is, in general, intractable. In principle, any search algorithm can be used to solve combinatorial optimization problems, although it is not guaranteed that they will find an optimal solution. *Metaheuristics* methods such as *evolutionary algorithms* [33] are typically used to find approximate solutions for complex multi-objective optimization problems, including feature selection and fuzzy classification.

#### *2.2. The Multi-Objective Evolutionary Algorithms ENORA and NSGA-II*

The *MOEA ENORA* [17] and *NSGA-II* [15] use a (*μ* + *λ*) strategy (Algorithm 1) with *μ* = *λ* = *popsize*, where *μ* corresponds to the number of parents and *λ* refers to the number of children (*popsize* is the population size), with *binary tournament selection* (Algorithm 2) and a rank function based on Pareto fronts and *crowding* (Algorithms 3 and 4). The difference between *NSGA-II* and *ENORA* is how the calculation of the ranking of the individuals in the population is performed. In *ENORA*, each individual belongs to a slot (as established in [34]) of the objective search space, and the rank of an individual in a population is the non-domination level of the individual in its slot. On the other hand, in *NSGA-II*, the rank of an individual in a population is the non-domination level of the individual in the whole population. Both *ENORA* and *NSGA-II* MOEAs use the same non-dominated sorting algorithm, the *fast non-dominated sorting* [35]. It compares each solution with the rest of the solutions and stores the results so as to avoid duplicate comparisons between every pair of solutions. For a problem with *l* objectives and a population with *N* solutions, this method needs to conduct *l* · *N* · (*N* − 1) objective comparisons, which means that it has a time complexity of *<sup>O</sup>*(*<sup>l</sup>* · *<sup>N</sup>*2) [36]. However, *ENORA* distributes the population in *<sup>N</sup>* slots (in the best case), therefore, the time complexity of *ENORA* is *<sup>O</sup>*(*<sup>l</sup>* · *<sup>N</sup>*2) in the worst case and *<sup>O</sup>*(*<sup>l</sup>* · *<sup>N</sup>*) in the best case.

**Algorithm 1** (*μ* + *λ*) strategy for multi-objective optimization.

```
Require: T > 1 {Number of generations}
```
**Require:** *N* > 1 {Number of individuals in the population}

```
1: Initialize P with N individuals
```

```
2: Evaluate all individuals of P
```

```
3: t ← 0
```

```
7: while i < N do
```

```
8: Parent1← Binary tournament selection from P
```
9: *Parent2*← Binary tournament selection from *P*

```
10: Child1, Child2← Crossover(Parent1, Parent2)
```

```
15: Q ← Q  {Offspring1, Offspring2}
```

```
16: i ← i + 2
```

```
17: end while
```

```
18: R ← P  Q
```

```
19: P ← N best individuals from R according to the rank-crowding function in population R
```
20: *t* ← *t* + 1

```
21: end while
```

```
22: return Non-dominated individuals from P
```
#### **Algorithm 2** Binary tournament selection.

```
Require: P {Population}
 1: I ← Random selection from P
 2: J ← Random selection from P
 3: if I is better than J according to the rank-crowding function in population P then
```
4: **return** *I* 5: **else**

6: **return** *J*

7: **end if**

#### **Algorithm 3** Rank-crowding function.

```
Require: P {Population}
Require: I, J {Individuals to compare}
 1: if rank (P, I) < rank (P, J) then
 2: return True
 3: end if
 4: if rank(P, J) < rank(P, I) then
 5: return False
 6: end if
 7: return Crowding_distance(P, I) > Crowding_distance(P, J)
```
The main reason *ENORA* and *NSGA-II* behave differently is as follows. *NSGA-II* never selects the individual dominated by the other in the binary tournament, while, in *ENORA*, the individual dominated by the other may be the winner of the tournament. Figure 1 shows this behavior graphically. For example, if individuals *B* and *C* are selected for a binary tournament with *NSGA-II*, individual *B* beats *C* because *B* dominates *C*. Conversely, individual *C* beats *B* with *ENORA* because individual *C* has a better rank in his slot than individual *B*. In this way, *ENORA* allows the individuals in each slot to evolve towards the Pareto front encouraging diversity. Even though in *ENORA* the individuals of each slot may not be the best of the total individuals, this approach generates a better hypervolume than that of *NSGA-II* throughout the evolution process.

*ENORA* is our *MOEA*, on which we are intensively working over the last decade. We have applied *ENORA* to constrained real-parameter optimization [17], fuzzy optimization [37], fuzzy classification [18], feature selection for classification [19] and feature selection for regression [34]. In this paper, we apply it to rule-based classification. *NSGA-II* algorithm was designed by Deb et al. and has been proved to be a very powerful and fast algorithm in multi-objective optimization contexts of all kinds. Most researchers in multi-objective evolutionary computation use *NSGA-II* as a baseline to compare the performance of their own algorithms. Although *NSGA-II* was developed in 2002 and remains a state-of-the-art algorithm, it is still a challenge to improve on it. There is a recently updated improved version for *many-objective optimization* problems called *NSGA-III* [38].

**Algorithm 4** Crowding\_distance function.

**Require:** *P* {Population} **Require:** *I* {Individual} **Require:** *l* {Number of objectives} 1: **for** *j* = 1 to *l* **do** 2: *f max <sup>j</sup>* ← max *<sup>I</sup>*∈*<sup>P</sup>* { *<sup>f</sup> <sup>I</sup> j* } 3: *f min <sup>j</sup>* ← min *I*∈*P* { *f I j* } 4: *f sup<sup>I</sup> j <sup>j</sup>* ← value of the *j*th objective for the individual higher adjacent in the *j*th objective to the individual *I* 5: *f inf <sup>I</sup> j <sup>j</sup>* ← value of the *j*th objective for the individual lower adjacent in the *j*th objective to the individual *I* 6: **end for** 7: **for** *j* = 1 to *l* **do** 8: **if** *f <sup>I</sup> <sup>j</sup>* = *<sup>f</sup> max <sup>j</sup>* or *<sup>f</sup> <sup>I</sup> <sup>j</sup>* = *<sup>f</sup> min <sup>j</sup>* **then** 9: **return** ∞ 10: **end if** 11: **end for** 12: *CD* ← 0.0 13: **for** *j* = 1 to *l* **do** 14: *CD* <sup>←</sup> *CD* <sup>+</sup> *<sup>f</sup> sup<sup>I</sup> j <sup>j</sup>* − *f inf <sup>I</sup> j j f max <sup>j</sup>* <sup>−</sup> *<sup>f</sup> min <sup>j</sup>* 15: **end for** 16: **return** *CD*

**Figure 1.** Rank assignment of individuals with *ENORA* vs. *NSGA-II*.

#### *2.3. PART*

*PART* (*Partial DT Method* [39]) is a widely used rule learning algorithm that was developed at the University of Waikato in New Zealand [40]. Experiments show that it is a very efficient algorithm in terms of both computational performance and results. *PART* combines the divide-and-conquer strategy typical of decision tree learning with the separate-and-conquer strategy [41] typical of rule learning, as follows. A decision tree is first constructed (using *C4.5* algorithm [42]), and the leaf with the highest coverage is converted into a rule. Then, the set of instances that are covered by that rule are discarded, and the process starts over. The result is an ordered set of rules, completed by a *default* rule that applies to instances that do not meet any previous rule.

#### *2.4. JRip*

*JRip* is a fast and optimized implementation in *Weka* of the famous *RIPPER* (*Repeated Incremental Pruning to Produce Error Reduction*) algorithm [43]. *RIPPER* was proposed in [44] as a more efficient version of the incrementally reduced error pruning (*IREP*) rule learner developed in [45]. *IREP* and *RIPPER* work in a similar manner. They begin with a default rule and, using a training dataset, attempt to learn rules that predict exceptions to the default. Each rule learned is a conjunction of propositional literals. Each literal corresponds to a split of the data based on the value of a single feature. This family of algorithms, similar to decision trees, has the advantage of being easy to interpret, and experiments show that *JRip* is particularly efficient in large datasets. *RIPPER* and *IREP* use a strategy based on the separate-and-conquer method to generate an ordered set of rules that are extracted directly from the dataset. The classes are examined one by one, prioritizing those that have more elements. These algorithms are based on four basic steps (growing, pruning, optimizing and selecting) applied repetitively to each class until a stopping condition is met [44]. These steps can be summarized as follows. In the growing phase, rules are created taking into account an increasing number of predictors until the stopping criterion is satisfied (in the *Weka* implementation, the procedure selects the condition with the highest information gain). In the pruning phase redundancy is eliminated and long rules are reduced. In the optimization phase, the rules generated in the previous steps are improved (if possible) by adding new attributes or by adding new rules. Finally, in the selection phase, the best rules are selected and the others discarded.

#### *2.5. OneR*

*OneR* (*One Rule*) is a very simple, while reasonably accurate, classifier based on a frequency table. First, *OneR* generates a set of rules for each attribute of the dataset, and, then, it selects only one rule from that set—the one with the lowest error rate [46]. The set of rules is created using a frequency table constructed for each predictor of the class, and numerical classes are converted into categorical values.

#### *2.6. ZeroR*

Finally, *ZeroR* (*Zero Rules* [40]) is a classifier learner that does not create any rules and uses no attributes. *ZeroR* simply creates the class classification table by selecting the most frequent value. Such a classifier is obviously the simplest possible one, and its capabilities are limited to the prediction of the majority class. In the literature, it is not used for practical classifications tasks, but as a generic reference to measure the performance of other classifiers.

#### **3. Multi-Objective Optimization for Categorical Rule-Based Classification**

In this section, we propose a general schema for an *RBC* specifically designed for categorical data. Then, we propose and describe a multi-objective optimization solution to obtain optimal categorical *RBC*s.

#### *3.1. Rule-Based Classification for Categorical Data*

Let Γ be a classifier composed by *M* rules, where each rule *R*<sup>Γ</sup> *<sup>i</sup>* , *i* = 1, ... , *M*, has the following structure:

$$R\_i^\Gamma \colon \quad IF \quad \mathbf{x}\_1 = b\_{i1}^\Gamma \quad AND \quad \text{ , } \dots, \quad AND \quad \mathbf{x}\_{\overline{p}} = b\_{ip}^\Gamma \quad THEN \quad y = c\_i^\Gamma \tag{2}$$

where for *j* = 1, ... , *p* the attribute *b*<sup>Γ</sup> *ij* (called *antecedent*) takes values in a set {1, ... , *vj*} (*vj* > 1), and *c*Γ *<sup>i</sup>* (called *consequent*) takes values in {1, ... , *w*} (*w* > 1). Now, let **x** = {*x*1, ... , *xp*} be an observed example, with *xj* ∈ {1, ... , *vj*}, for each *j* = 1, ... , *p*. We propose *maximum matching* as *reasoning* *method*, where the *compatibility degree* of the rule *R*<sup>Γ</sup> *<sup>i</sup>* for the example **<sup>x</sup>** (denoted by *<sup>ϕ</sup>*<sup>Γ</sup> *<sup>i</sup>* (**x**)) is calculated as the number of attributes whose value coincides with that of the corresponding antecedent in *R*<sup>Γ</sup> *i* , that is

$$q\_i^\Gamma(\mathbf{x}) = \sum\_{j=1}^p \mu\_{ij}^\Gamma(\mathbf{x}),$$

where:

$$\begin{aligned} \boldsymbol{\varrho}\_i^{\Gamma}(\mathbf{x}) &= \sum\_{j=1}^p \mu\_{ij}^{\Gamma}(\mathbf{x}) \\\\ \mu\_{ij}^{\Gamma}(\mathbf{x}) &= \begin{cases} 1 & \text{if } \mathbf{x}\_j = b\_{ij}^{\Gamma} \\ 0 & \text{if } \mathbf{x}\_j \neq b\_{ij}^{\Gamma} \end{cases} \end{aligned}$$

The *association degree* for the example **x** with a class *c* ∈ {1, ... , *w*} is computed by adding the compatibility degrees for the example **x** of each rule *R*<sup>Γ</sup> *<sup>i</sup>* whose consequent *<sup>c</sup>*<sup>Γ</sup> *<sup>i</sup>* is equal to class *c*, that is:

$$\lambda\_{\boldsymbol{\varepsilon}}^{\Gamma}(\mathbf{x}) = \sum\_{i=1}^{M} \eta\_{i\boldsymbol{\varepsilon}}^{\Gamma}(\mathbf{x}),$$

where:

$$
\lambda\_c^\Gamma(\mathbf{x}) = \sum\_{i=1}^m \eta\_{i\mathbf{c}}^\Gamma(\mathbf{x})
$$

$$
\eta\_{i\mathbf{c}}^\Gamma(\mathbf{x}) = \begin{cases}
\ \mathcal{G}\_i^\Gamma(\mathbf{x}) & \text{if } c = c\_i^\Gamma \\
0 & \text{if } c \neq c\_i^\Gamma
\end{cases}
$$

Therefore, the *classification* (or output) of the classifier Γ for the example **x** corresponds to the class whose association degree is maximum, that is:

$$f^{\Gamma}(\mathbf{x}) = \arg\max\_{c=1}^{w} \lambda\_c^{\Gamma}(\mathbf{x})$$

#### *3.2. A Multi-Objective Optimization Solution*

Let D be a dataset of *K* instances with *p* categorical input attributes, *p* > 0, and a categorical output attribute. Each input attribute *<sup>j</sup>* can take a category *xj* <sup>∈</sup> 1, . . . , *vj* , *vj* > 1, *j* = 1, ... , *p*, and the output attribute can take a class *c* ∈ {1, . . . , *w*}, *w* > 1. The problem of finding an optimal classifier Γ, as described in the previous section, can be formulated as an instance of the multi-objective constrained problem in Equation (1) with two objectives and two constraints:

$$\begin{array}{ll}\text{Max.} \, \text{Min.} & \mathcal{F}\_{\mathcal{D}}(\Gamma) \\ \text{Min.} & \mathcal{N}\mathcal{R}(\Gamma) \\ \text{subject to} & \mathcal{N}\mathcal{R}(\Gamma) \ge w \\ & \mathcal{N}\mathcal{R}(\Gamma) \le M\_{\text{max}} \end{array} \tag{3}$$

In the problem (Equation (3)), the function FD(Γ) is a performance measure of the classifier <sup>Γ</sup> over the dataset D, the function N R(Γ) is the number of rules of the classifier Γ, and the constraints N R(Γ) ≥ *w* and N R(Γ) ≤ *Mmax* limit the number of rules of the classifier Γ to the interval [*w*, *Mmax*], where *<sup>w</sup>* is the number of classes of the output attribute and *Mmax* is given by a user. Objectives FD(Γ) and N R(Γ) are in conflict. The fewer rules the classifier has, the fewer instances it can cover, that is, if the classifier is simpler it will have less capacity for prediction. There is, therefore, an intrinsic conflict between problem objectives (e.g., maximize accuracy and minimize model complexity) which cannot be easily aggregated to a single objective. Both objectives are typically optimized simultaneously in many other classification systems, such as neural networks or decision trees [47,48]. Figure 2 shows the Pareto front of a dummy binary classification problem described as in Equation (3), with *Mmax* = 6 rules, where FD(Γ) is maximized. This front is composed of three non-dominated solutions (three possible classifiers) with two, three and four rules, respectively. The solutions with five and six rules are dominated (both by the solution with four rules).

**Figure 2.** A Pareto front of a binary classification problem as formulated in Equation (3) where FD(Γ) is minimized and N R(Γ) is minimized.

Both *ENORA* and *NSGA-II* have been adapted to solve the problem described in Equation (3) with *variable-length representation* based on a *Pittsburgh approach*, *uniform random initialization*, *binary tournament selection*, *handling constraints*, ranking based on *non-domination level* with *crowding distance*, and *self-adaptive variation operators*. *Self-adaptive variation operators* work on different levels of the classifier: *rule crossover*, *rule incremental crossover*, *rule incremental mutation*, and *integer mutation*.

#### 3.2.1. Representation

We use a variable-length representation based on a Pittsburgh approach [49], where each individual *I* of a population contains a variable number of rules *MI*, and each rule *R<sup>I</sup> <sup>i</sup>* , *i* = 1, ... , is codified in the following components:


Additionally, to carry out self-adaptive crossing and mutation, each individual has two discrete parameters *dI* ∈ {0, ... , *δ*} and *eI* ∈ {0, ... , } associated with crossing and mutation, where *δ* ≥ 0 is the number of crossing operators and *ε* ≥ 0 is the number of mutation operators. Values *dI* and *eI* for self-adaptive variation are randomly generated from {0, *δ*} and {0, }, respectively. Table 1 summarizes the representation of an individual.


**Table 1.** Chromosome coding for an individual *I*.

#### 3.2.2. Constraint Handling

The constraints N R(**Γ**) ≥ *w* and N R(**Γ**) ≤ *Mmax* are satisfied by means of specialized initialization and variation operators, which always generate individuals with a number of rules between *w* and *Mmax*.

#### 3.2.3. Initial Population

The initial population (Algorithm 5) is randomly generated with the following conditions:


**Algorithm 5** Initialize population.

**Require:** *p* > 0 {Number of categorical input attributes} **Require:** *v*1, ..., *vp*, *vj* > 1, *j* = 1, . . . , *p* {Number of categories for the input attributes} **Require:** *w* > 1, {Number of classes for the output attribute} **Require:** *δ* > 0 {Number of crossing operators} **Require:**  > 0 {Number of mutation operators} **Require:** *Mmax* ≥ *w* {Maximum number of rules} **Require:** *N* > 1 {Number of individuals in the population} 1: *P* ← ∅ 2: **for** *k* = 1 to *N* **do** 3: *I* ← new Individual 4: **if** *k* ≤ *Mmax* − *w* + 1 **then** 5: *MI* ← *k* + *w* − 1 6: **else** 7: *MI* ← Int *Random*(*w*,*Mmax*) 8: **end if** 9: {Random rule *R<sup>I</sup> i* } 10: **for** *i* = 1 to *MI* **do** 11: {Random integer values associated with the antecedents} 12: **for** *j* = 1 to *p* **do** 13: *b<sup>I</sup> ij* ←*Random*(1,*vj*) 14: **end for** 15: {Random integer value associated with the consequent} 16: **if** *i* < *w* **then** 17: *c<sup>I</sup> <sup>i</sup>* = *j* 18: **else** 19: *c<sup>I</sup> <sup>i</sup>* ← *Random*(1,*w*) 20: **end if** 21: **end for** 22: {Random integer values for adaptive variation} 23: *dI* ← *Random*(0,*δ*) 24: *eI* ← *Random*(0,) 25: *P* ← *P* ∪ *I* 26: **end for** 27: **return** *P*

#### 3.2.4. Fitness Functions

Since the optimization model encompasses two objectives, each individual must be evaluated with two fitness functions, which correspond to the objective functions FD(Γ) and N R(Γ) of the problem (Equation (3)). The selection of the best individuals is done using the Pareto concept in a binary tournament.

#### 3.2.5. Variation Operators

We use *self-adaptive crossover and mutation*, which means that the selection of the operators is made by means of an adaptive technique. As we have explained (cf. Section 3.2.1), each individual *I* has two integer parameters *dI* ∈ {0, . . . , *δ*} and *eI* ∈ {0, . . . , } to indicate which crossover or mutation is carried out. In our case, *δ* = 2 and  = 2 are two crossover operators and two mutation operators, so that *dI*,*eI* ∈ {0, 1, 2}. Note that value 0 indicates that no crossover or no mutation is performed. Self-adaptive variation (Algorithm 6) generates two children from two parents by self-adaptive crossover (Algorithm 7) and self-adaptive mutation (Algorithm 8). Self-adaptive crossover of individuals *I*, *J* and self-adaptive mutation of individual *I* are similar to each other. First, with a probability *pv*, the values *dI* and *eI* are replaced by a random value. Additionally, in the case of crossover, the value *dJ* is replaced by *dI*. Then, the crossover indicated by *dI* or the mutation indicated by *eI* is performed. In summary, if an individual comes from a given crossover or a given mutation, that specific crossover and mutation are preserved to their offspring with probability *pv*, so the value of *pv* must be small enough to ensure a controlled evolution (in our case, we use *pv* = 0.1). Although the probability of the crossover and mutation is not explicitly represented, it can be computed as the ratio of the individuals for which crossover and mutation values are set to 1. As the population evolves, individuals with more successful types of crossover and mutation will be more common, so that the probability of selecting the more successful crossover and mutation types will increase. Using self-adaptive crossover and mutation operators helps to realize the goals of maintaining diversity in the population and sustaining the convergence capacity of the evolutionary algorithm, also eliminating the need of setting an a priori operator probability to each operator. In other approaches (e.g., [50]), the probabilities of crossover and mutation vary depending on the fitness value of the solutions.

Both *ENORA* and *NSGA-II* have been implemented with two crossover operators, *rule crossover* (Algorithm 9) and *rule incremental crossover* (Algorithm 10), and two mutation operators: *rule incremental mutation* (Algorithm 11) and *integer mutation* (Algorithm 12). *Rule crossover* randomly exchanges two rules selected from the parents, and *rule incremental crossover* adds to each parent a rule randomly selected from the other parent if its number of rules is less than the maximum number of rules. On the other hand, *rule incremental mutation* adds a new rule to the individual if the number of rules of the individual is less than the maximum number of rules, while *integer mutation* carries out a uniform mutation of a random antecedent belonging to a randomly selected rule.

#### **Algorithm 6** Variation.

**Require:** *Parent*1, *Parent*2 {Individuals for variation}


**Algorithm 7** Self-adaptive crossover.

**Require:** *I*, *J* {Individuals for crossing}

**Require:** *pv* (0 < *pv* < 1) {Probability of variation}

**Require:** *δ* > 0 {Number of different crossover operators}

	- {0: No cross}
	- {1: Rule crossover}
	- {2: Rule incremental crossover}

**Algorithm 8** Self-adaptive mutation.

**Require:** *I* {Individual for mutation}

**Require:** *pv* (0 < *pv* < 1) {Probability of variation}

**Require:** > 0 {Number of different mutation operators}

1: **if** a random Bernoulli variable with probability *pv* takes the value 1 **then**

	- {0: No mutation}
	- {1: Rule incremental mutation}
	- {2: Integer mutation}

#### **Algorithm 9** Rule crossover.

**Require:** *I*, *J* {Individuals for crossing} 1: *i* ← *Random*(1,*MI*) 2: *j* ← *Random*(1,*MJ*) 3: Exchange rules *R<sup>I</sup> <sup>i</sup>* and *<sup>R</sup><sup>J</sup> j*

**Algorithm 10** Rule incremental crossover.

```
Require: I, J {Individuals for crossing}
Require: Mmax {Maximum number of rules}
 1: if MI < Mmax then
 2: j ← Random(1,MJ)
 3: Add RJ
            j to individual I
 4: end if
 5: if MJ < Mmax then
 6: i ← Random(1,MI)
 7: Add RI
            i to individual J
 8: end if
```
**Algorithm 11** Rule incremental mutation.

**Require:** *I* {Individual for mutation}

**Require:** *Mmax* {Maximum number of rules}

1: **if** *MI* < *Mmax* **then**

2: Add a new random rule to *I*

3: **end if**

#### **Algorithm 12** Integer mutation.

**Require:** *I* {Individual for mutation}

**Require:** *p* > 0 {Number of categorical input attributes}

**Require:** *v*1, ..., *vp*, *vj* > 1, *j* = 1, . . . , *p* {Number of categories for the input attributes}

1: *i* ← *Random*(1,*MI*)

2: *j* ← *Random*(1,*p*)

3: *b<sup>I</sup> ij* ← *Random*(1,*vj*)

#### **4. Experiment and Results**

To ensure the reproducibility of the experiments, we have used publicly available datasets. In particular, we have designed two sets of experiments, one using the *Breast Cancer* [51] dataset, and the other using the *Monk's Problem 2* [52] dataset.

#### *4.1. The Breast Cancer Dataset*

*Breast Cancer* encompasses 286 instances. Each instance corresponds to a patient who suffered from breast cancer and uses nine attributes to describe each patient. The class to be predicted is binary and represents whether the patient has suffered a recurring cancer event. In this dataset, 85 instances are positive and 201 are negative. Table 2 summarizes the attributes of the dataset. Among all instances, nine present some missing values; in the pre-processing phase, these have been replaced by the mode of the corresponding attribute.


**Table 2.** Attribute description of the *Breast Cancer* dataset.

#### *4.2. The Monk's Problem 2 Dataset*

In July 1991, the monks of *Corsendonk Priory* attended a summer course that was being held in their priory, namely the 2nd European Summer School on Machine Learning. After a week, the monks could not yet clearly identify the best *ML* algorithms, or which algorithms to avoid in which cases. For this reason, they decided to create the three so-called *Monk's problems*, and used them to determine which *ML* algorithms were the best. These problems, rather simple and completely artificial, became later famous (because of their peculiar origin), and have been used as a comparison for many algorithms on several occasions. In particular, in [53], they have been used to test the performance of state-of-the-art (at that time) learning algorithms such as *AQ17-DCI*, *AQ17-HCI*, *AQ17-FCLS*, *AQ14-NT*, *AQ15-GA*, *Assistant Professional*, *mFOIL*, *ID5R*, *IDL*, *ID5R-hat*, *TDIDT*, *ID3*, *AQR*, *CN2*, *WEB CLASS*, *ECOBWEB*, *PRISM*, *Backpropagation*, and *Cascade Correlation*. For our research, we have used the *Monk's Problem 2*, which contains six categorical input attributes and a binary output attribute, summarized in Table 3. The target concept associated with the *Monk's Problem 2* is the binary outcome of the logical formula:

#### *Exactly two of:* {heap\_shape= round, body\_shape=round, is\_smiling=yes, holding=sword, jacket\_color=red, has\_tie=yes}

In this dataset, the original training and testing sets were merged to allow other sampling procedures. The set contains a total of 601 instances, and no missing values.

**Table 3.** Attribute description of the *MONK's Problem 2* dataset.


#### *4.3. Optimization Models*

We have conducted different experiments with different optimization models to calculate the overall performance of our proposed technique and to see the effect of optimizing different objectives for the same problem. First, we have designed a multi-objective constrained optimization model based on the *accuracy*:

$$\begin{array}{ll}\text{Max.} & \mathcal{ACC\_{\mathcal{D}}(\Gamma)}\\\text{Min.} & \mathcal{N}\mathcal{R}(\Gamma) \\\text{subject to:} & \mathcal{N}\mathcal{R}(\Gamma) \ge w \\ & \mathcal{N}\mathcal{R}(\Gamma) \le M\_{\max} \end{array} \tag{4}$$

where ACCD(Γ) is the proportion of correctly classified instances (both true positives and true negatives) among the total number of instances [54] obtained with the classifier Γ for the dataset D. ACCD(Γ) is defined as:

$$\mathcal{ACC}\_{\mathcal{D}}(\Gamma) = \frac{1}{K} \sum\_{i=1}^{K} T\_{\mathcal{D}}(\Gamma, i)$$

where *<sup>K</sup>* is the number of instances of the dataset D, and *<sup>T</sup>*D(Γ, *<sup>i</sup>*) is the result of the classification of the instance *i* in D with the classifier Γ, that is:

$$\begin{aligned} \mathcal{A} \cup\_{\mathcal{D}} (\Gamma) &= \overline{K} \sum\_{i=1}^{\Gamma \mathcal{D}(1, \gamma I)} \\ \text{of the dataset } \mathcal{D} \text{, and } T\_{\mathcal{D}}(\Gamma) \\ \text{Tr } \Gamma \text{, that is:} \\\\ T\_{\mathcal{D}}(\Gamma, i) &= \begin{cases} 1 & \text{if } \mathfrak{c}\_{i}^{\Gamma} = \mathfrak{c}\_{\mathcal{D}}^{i} \\ 0 & \text{if } \mathfrak{c}\_{i}^{\Gamma} \neq \mathfrak{c}\_{\mathcal{D}}^{i} \end{cases} \end{aligned}$$

where *c*ˆ Γ *<sup>i</sup>* is the predicted value of the *<sup>i</sup>*th instance in <sup>Γ</sup>, and *<sup>c</sup><sup>i</sup>* <sup>D</sup> is the corresponding true value in <sup>D</sup>. Our second optimization model is based on the *area under the ROC curve*:

$$\begin{array}{ll}\text{Max.} & \mathcal{A} \mathcal{U} \mathcal{C}\_{\mathcal{D}}(\Gamma) \\ \text{Min.} & \mathcal{N} \mathcal{R}(\Gamma) \\ \text{subject to:} & \mathcal{N} \mathcal{R}(\Gamma) \ge w \\ & \mathcal{N} \mathcal{R}(\Gamma) \le M\_{\max} \end{array} \tag{5}$$

where AUCD(Γ) is the area under the *ROC* curve obtained with the classifier <sup>Γ</sup> with the dataset D. The *ROC* (*Receiver Operating Characteristic*) curve [55] is a graphical representation of the *sensitivity* versus the *specificity* index for a classifier varying the *discrimination threshold* value. Such a curve can be used to generate statistics that summarize the performance of a classifier, and it has been shown in [54] to be a simple, yet complete, empirical description of the decision threshold effect, indicating all possible combinations of the relative frequencies of the various kinds of correct and incorrect decisions. The area under the *ROC* curve can be computed as follows [56]: AUCD(Γ) = 1

$$\mathcal{A} \mathcal{U} \mathcal{C}\_{\mathcal{D}}(\Gamma) = \int\_0^1 \mathcal{S}\_{\mathcal{D}}(\Gamma, E\_{\mathcal{D}}^{-1}(\Gamma, v)) dv$$

where *<sup>S</sup>*D(Γ, *<sup>t</sup>*) (*sensitivity*) is the proportion of positive instances classified as positive by the classifier <sup>Γ</sup> in D, 1 − *<sup>E</sup>*D(Γ, *<sup>t</sup>*) (*specificity*) is the proportion of negative instances classified as negative by <sup>Γ</sup> in D, and *t* is the discrimination threshold. Finally, our third constrained optimization model is based on the *root mean square error* (*RMSE*):

$$\begin{array}{ll}\text{Max.} \, \text{Min.} & \mathcal{R}\mathcal{M}\mathcal{SS}\_{\mathcal{D}}(\Gamma) \\ \text{Min.} & \mathcal{N}\mathcal{R}(\Gamma) \\ \text{subject to} & \mathcal{N}\mathcal{R}(\Gamma) \ge w \\ & \mathcal{N}\mathcal{R}(\Gamma) \le M\_{\text{max}} \end{array} \tag{6}$$

where RMSED(Γ) is defined as the square root of the *mean square error* obtained with a classifier <sup>Γ</sup> in the dataset D:

$$\mathcal{RModS}\_{\mathcal{D}}(\Gamma) = \frac{1}{K} \sqrt{\sum\_{i=1}^{K} (\mathfrak{c}\_i^{\Gamma} - c\_{\mathcal{D}}^i)^2}$$

where *c*ˆ Γ *<sup>i</sup>* is the predicted value of the *<sup>i</sup>*th instance for the classifier <sup>Γ</sup>, and *<sup>c</sup><sup>i</sup>* <sup>D</sup> is the corresponding output value in the database D. Accuracy, area under the *ROC* curve, and root mean square error are all well-accepted measures used to evaluate the performance of a classifier. Therefore, it is natural to use such measures as fitting functions. In this way, we can establish which one behaves better in the optimization phase, and we can compare the results with those in the literature.

#### *4.4. Choosing the Best Pareto Front*

To compare the performance of *ENORA* and *NSGA-II* as metaheuristics in this particular optimization task, we use the *hypervolume metric* [57,58]. The hypervolume measures, simultaneously, the diversity and the optimality of the non-dominated solutions. The main advantage of using hypervolume against other standard measures, such as the *error ratio*, the *generational distance*, the *maximum Pareto-optimal front error*, the *spread*, the *maximum spread*, or the *chi-square-like deviation*, is that it can be computed without an optimal population, which is not always known [15]. The hypervolume is defined as the volume of the search space dominated by a population *P*, and is formulated as: 

$$HV\left(P\right) = \bigcup\_{i=1}^{|Q|} v\_i\tag{7}$$

where *Q* ⊆ *P* is the set of non-dominated individuals of *P*, and *vi* is the volume of the individual *i*. Subsequently, the *hypervolume ratio* (*HVR*) is defined as the ratio of the volume of the non-dominated search space over the volume of the entire search space, and is formulated as follows:

$$HVR\left(P\right) = 1 - \frac{H\left(P\right)}{VS} \tag{8}$$

where *VS* is the volume of the search space. Computing *HVR* requires reference points that identify the maximum and minimum values for each objective. For *RBC* optimization, as proposed in this work, the following minimum (FD*lower*, N R*lower*) and maximum (FD*upper*, N R*upper*) points, for each objective, are set in the multi-objective optimization models in Equations (4)–(6):

$$\mathcal{F}\_{\mathcal{D}}{}^{lower} = 0, \quad \mathcal{F}\_{\mathcal{D}}{}^{upper} = 1, \quad \mathcal{N}\mathcal{R}^{lower} = w, \quad \mathcal{N}\mathcal{R}^{upper} = M\_{\text{max}}$$

A first single execution of all six models (three driven by *ENORA*, and three driven by *NSGA-II*), over both datasets, has been designed for the purpose of showing the aspect of the final Pareto front, and compare the hypervolume ratio of the models. The results of this single execution, with population size equal to 50 and 20,000 generations (1,000,000 evaluations in total), are shown in Figures 3 and 4 (by default, *Mmax* is set to 10, to which we add 2, because both datasets have a binary class). Regarding the configuration of the number of generations and the size of the population, our criterion has been established as follows: once the number of evaluations is set to 1,000,000, we can decide to use a population size of 100 individuals and 10,000 generations, or to use a population size of 50 individuals and 20,000 generations. The first configuration (100 × 10,000) allows a greater diversity with respect to the number of rules of the classifiers, while the second one (50 × 20,000) allows a better adjustment of the classifier parameters and therefore, a greater precision. Given the fact that the maximum number of rules of the classifiers is not greater than 12, we think that 50 individuals are sufficient to represent four classifiers on average for each number of rules (4 × 12 = 48∼50). Thus, we prefer the second configuration (50 × 20,000) because having more generations increases the chances of building classifiers with a higher precision.

**Figure 3.** Pareto fronts of one execution of *ENORA* and *NSGA-II*, with *Mmax* = 12, on the *Breast Cancer* dataset, and their respective *HVR*. Note that in the case of multi-objective classification where FD is maximized (ACCD and AUCD), function FD has been converted to minimization for a better understanding of the Pareto front.

**Figure 4.** Pareto fronts of one execution of *ENORA* and *NSGA-II*, with *Mmax* = 12, on the *Monk's Problem 2* dataset, and their respective *HVR*. Note that in the case of multi-objective classification where FD is maximized (ACCD and AUCD), function FD has been converted to minimization for a better understanding of the Pareto front.

Experiments were executed in a computer x64-based PC with one processor Intel64 Family 6 Model 60 Stepping 3 GenuineIntel 3201 Mhz, RAM 8131 MB. Table 4 shows the run time for each method over both datasets. Note that, although *ENORA* has less algorithmic complexity than *NSGA-II*, it has taken longer in experiments than *NSGA-II*. This is because the evaluation time of individuals in *ENORA* is higher than that of *NSGA-II* since *ENORA* has more diversity than *NSGA-II*, and therefore *ENORA* evaluates classifiers with more rules than *NSGA-II*.

**Table 4.** Run times of *ENORA* and *NSGA-II* for *Breast Cancer* and *Monk's Problem 2* datasets.


From these results, we can deduce that, first, *ENORA* maintains a higher diversity of the population, and achieves a better hypervolume ratio with respect to *NSGA-II*, and, second, using accuracy as the first objective generates better fronts than using the area under the *ROC* curve, which, in turn, performs better than using the root mean square error.

#### *4.5. Comparing Our Method with Other Classifier Learning Systems (Full Training Mode)*

To perform an initial comparison between the performance of the classifiers obtained with the proposed method and the ones obtained with classical methods (*PART*, *JRip*, *OneR* and *ZeroR*), we have executed again the six models in full training mode.

The parameters have been configured as in the previous experiment (population size equal to 50 and 20,000 generations), excepting the *Mmax* parameter that was set to 2 for the *Breast Cancer* dataset (this case), while, for the *Monk's Problem 2*, it was set to 9. Observe that, since *Mmin* = 2 in both cases, executing the optimization models using *Mmax* = 2 leads to a single objective search for the *Breast Cancer* dataset. In fact, after the preliminary experiments were run, it turned out that the classical classifier learning systems tend to return very small, although not very precise, set of rules on *Breast Cancer*, and that justifies our choice. On the other hand, executing the classical rule learners on *Monk's Problem 2* returns more diverse sets of rules, which justifies choosing a higher *Mmax* in that case. To decide, a posteriori, which individual is chosen from the final front, we have used the default algorithm: the individual with the best value on the first objective is returned. In the case of *Monk's Problem 2*, that individual has seven rules. The comparison is shown in Tables 5 and 6, which show, for each classifier, the following information: *number of rules*, *percent correct*, *true positive rate*, *false positive rate*, *precision*, *recall*, *F-measure*, *Matthews correlation coefficient*, *area under the ROC curve*, *area under precision-recall curve*, and *root mean square error*. As for the *Breast Cancer* dataset (observe that the best result emerged from the proposed method), in the optimization model driven by *NSGA-II*, with root mean square error as the first objective (see Table 7), only *PART* was able to achieve similar results, although slightly worse, but at the price of having 15 rules, making the system clearly not interpretable. In the case of the *Monk's Problem 2* dataset, *PART* returned a model with 47 rules, which is not interpretable by any standard, although it is very accurate. The best interpretable result is the one with seven rules returned by *ENORA*, driven by the root mean square error (see Table 8). The experiments for classical learners have been conducted using the default parameters.

**Table 5.** Comparison of the performance of the learning models in full training mode—*Breast Cancer* dataset.


**Table 6.** Comparison of the performance of the learning models in full training mode—*Monk's Problem 2* dataset.


**Table 7.** Rule-based classifier obtained with *NSGA-II-RMSE* for *Breast Cancer* dataset.



**Table 8.** Rule-based classifier obtained with *ENORA-RMSE* for *Monk's Problem 2* dataset.

*4.6. Comparing Our Method with Other Classifier Learning Systems (Cross-Validation and Train/Test Percentage Split Mode)*

To test the capabilities of our methodology in a more significant way, we proceeded as follows. First, we designed a *cross-validated* experiment for the *Breast Cancer* dataset, in which we iterated three times a 10-fold cross-validation learning process [59] and considered the average value of the performance metrics *percent correct*, *area under the ROC curve*, and *serialized model size* of all results. Second, we designed a *train/test percentage split* experiment for the *Monk's Problem 2* dataset, in which we iterated ten times a 66% (training) versus 33% (testing) split and considered, again, the average result of the same metrics. Finally, we performed a statistical test over on results, to understand if they show any statistically significant difference. An execution of our methodology, and of standard classical learners, has been performed to obtain the models to be tested precisely under the same conditions of the experiment Section 4.5. It is worth observing that using two different types of evaluations allows us to make sure that our results are not influenced by the type of experiment. The results of the experiments are shown in Tables 9 and 10.

**Table 9.** Comparison of the performance of the learning models in 10-fold cross-validation mode (three repetitions)—*Breast Cancer* dataset.


The statistical tests aim to verify if there are significant differences among the means of each metric: *percent correct*, *area under the ROC curve* and *serialized model size*. We proceeded as follows. First, we checked normality and sphericity of each sample by means of the *Shapiro–Wilk normality test*. Then, if normality and sphericity conditions were met, we applied *one way repeated measures ANOVA*; otherwise, we applied the *Friedman test*. In the latter case, when statistically significant differences were detected, we applied the *Nemenyi post-hoc test* to locate where these differences were. Tables A1–A12 in Appendix A show the results of the performed tests for the *Breast Cancer* dataset for each of the three metrics, and Tables A13–A24 in Appendix B show the results for the *Monk's Problem 2* dataset.


**Table 10.** Comparison of the performance of the learning models in split mode—*Monk's problem 2* dataset.

#### *4.7. Additional Experiments*

Finally, we show the results of the evaluation with 10-fold cross-validation for *Monk's problem 2* dataset and for the following four other datasets:



**Table 11.** Attribute description of the *Tic-Tac-Toe-Endgame* dataset.

**Table 12.** Attribute description of the *Car* dataset.



**Table 13.** Attribute description of the *kr-vs-kp* dataset.

**Table 14.** Attribute description of the *Nursery* dataset.


We have used the *ENORA* algorithm together with the ACCD and RMSED objective functions in this case because these combinations have produced the best results for the *Breast Cancer* and *Monk's problem 2* datasets evaluated in 10-fold cross-validation (population size equal to 50, 20,000 generations and *Mmax* = 10 + number of classes). Table 15 shows the results of the best combination *ENORA-ACC* or *ENORA-RMSE* together with the results of the classical rule-based classifiers.


**Table 15.** Comparison of the performance of the learning models in 10-fold cross-validation mode—*Monk's Problem 2*, *Tic-Tac-Toe-Endgame*, *Car*, *kr-vs-kp* and *Nursery* datasets.

#### **5. Analysis of Results and Discussion**

The results of our tests allow for several considerations. The first interesting observation is that *NSGA-II* identifies fewer solutions than *ENORA* on the Pareto front, which implies less diversity and therefore a worse hypervolume ratio, as shown in Figures 3 and 4. This is not surprising: in several other occasions [19,34,60], it has been shown that *ENORA* maintains a higher diversity in the population than other well-known evolutionary algorithms, with generally positive influence on the final results. Comparing the results in full training mode against the results in cross-validation or in splitting mode makes it evident that our solution produces classification models that are more resilient to over-fitting. For example, the classifier learned by *PART* with *Monk's Problem 2* presents a 94.01% accuracy in full training mode that drops to 73.51% in splitting mode. A similar, although with a more contained drop in accuracy, is shown by the classifier learned with *Breast Cancer* dataset; at the same time, the classifier learned by *ENORA* driven by accuracy shows only a 5.57% drop in one case, and even an improvement in the other case (see Tables 5, 6, 9, and 10). This phenomenon is easily explained by looking at the number of rules: the more rules in a classifier, the higher the risk of over-fitting; *PART* produces very accurate classifiers, but at the price of adding many rules, which not only affects the interpretability of the model but also its resilience to over-fitting. Full training results seem to indicate that when the optimization model is driven by *RMSE* the classifiers are more accurate; nevertheless, they are also more prone to over-fitting, indicating that, on average, the optimization models driven by the accuracy are preferable.

From the statistical tests (whose results are shown in the Appendixes A and B) we conclude that among the six variants of the proposed optimization model there are no statistical significative differences, which suggests that the advantages of our method do not depend directly on a specific evolutionary algorithm or on the specific performance measure that is used to drive the evolutions. Significant statistical differences between our method and very simple classical methods such as *OneR*

were expectable. Significant statistical differences between our method and a well-consolidated one such as *PART* have not been found, but the price to be paid for using *PART* in order to have similar results to ours is a very high number of rules (15 vs. 2 in one case and 47 vs. 7 in the other case).

We would like to highlight that both the *Breast Cancer* dataset and the *Monk's problem 2* dataset are difficult to approximate with interpretable classifiers and that none of the analyzed classifiers obtains high accuracy rates using the cross-validation technique. Even powerful black-box classifiers, such as *Random Forest* and *Logistic*, obtain success rates below 70% in 10-fold cross-validation for these datasets. However, *ENORA* obtains a better balance (trade-off) between precision and interpretability than the rest of the classifiers. For the rest of the analyzed datasets, the accuracy obtained using *ENORA* is substantially higher. For example, for the *Tic-Tac-Toe-Endgame* dataset, *ENORA* obtains a 98.3299% success percentage with only two rules in cross-validation, while *PART* obtains 94.2589% with 49 rules, and *JRip* obtains 97.8079% with nine rules. With respect to the results obtained in the datasets *Car*, *kr-vs-kp* and *Nursery*, we want to comment that better success percentage can be obtained if the maximum number of evaluations is increased. However, better success percentages imply a greater number of rules, which is to the detriment of the interpretability of the models.

#### **6. Conclusions and Future Works**

In this paper, we have proposed a novel technique for categorical classifier learning. Our proposal is based on defining the problem of learning a classifier as a multi-objective optimization problem, and solving it by suitably adapting an evolutionary algorithm to this task; our two objectives are minimizing the number of rules (for a better interpretability of the classifier) and maximizing a metric of performance. Depending on the particular metric that is chosen, (slightly) different optimization models arise. We have tested our proposal, in a first instance, on two different publicly available datasets, *Breast Cancer* (in which each instance represents a patient that has suffered from breast cancer and is described by nine attributes, and the class to be predicted represents the fact that the patient has suffered a recurring event) and *Monk's Problem 2* (which is an artificial, well-known dataset in which the class to be predicted represents a logical function), using two different evolutionary algorithms, namely *ENORA* and *NSGA-II*, and three different choices as a performance metric, i.e., accuracy, the area under the *ROC* curve, and the root mean square error. Additionally, we have shown the results of the evaluation in 10-fold cross-validation of the publicly available *Tic-Tac-Toe-Endgame*, *Car*, *kr-vs-kp* and *Nursery* datasets.

Our initial motivation was to design a classifier learning system that produces interpretable, yet accurate, classifiers: since interpretability is a direct function of the number of rules, we conclude that such an objective has been achieved. As an aside, observe that our approach allows the user to decide, beforehand, a maximum number of rules; this can also be done in *PART* and *JRip*, but only indirectly. Finally, the idea underlying our approach is that multiple classifiers are explored at the same time in the same execution, and this allows us to choose the best compromise between the performance and the interpretability of a classifier a posteriori.

As a future work, we envisage that our methodology can benefit from an *embedded* future selection mechanism. In fact, all attributes are (ideally) used in every rule of a classifier learned by our optimization model. By simply relaxing such a constraint, and by suitably re-defining the first objective in the optimization model (e.g., by minimizing the sum of the lengths of all rules, or similar measures), the resulting classifiers will naturally present rules that use more features as well as rules that use less (clearly, the implementation must be adapted to obtain an initial population in which the classifiers have rules of different lengths as well as mutation operators that allow a rule to grow or to shrink). Although this approach does not follow the classical definition of feature selection mechanisms (in which a subset of features is selected that reduces the dataset over which a classifier is learned), it is natural to imagine that it may produce even more accurate classifiers, and more interpretable at the same time.

Currently, we are implementing our own version of *multi-objective differential evolution* (*MODE*) for rule-based classification for inclusion in the Weka Open Source Software issued under the GNU General Public License. The implementation of other algorithms, such as *MOEA/D*, their adaptation in the Weka development platform and subsequent analysis and comparison are planned for future work.

**Author Contributions:** Conceptualization, F.J. and G.S. (Gracia Sánchez); Methodology, F.J. and G.S. (Guido Sciavicco); Software, G.S. (Gracia Sánchez) and C.M.; Validation, F.J., G.S. (Gracia Sánchez) and C.M.; Formal Analysis, F.J. and G.S. (Guido Sciavicco); Investigation, F.J. and G.S. (Gracia Sánchez); Resources, L.M.; Data Curation, L.M.; Writing—Original Draft Preparation, F.J., L.M. and G.S. (Guido Sciavicco); Writing—Review and Editing, F.J., L.M. and G.S. (Guido Sciavicco); Visualization, F.J.; Supervision, F.J.; Project Administration, F.J.; and Funding Acquisition, F.J., L.M., G.S. (Gracia Sánchez) and G.S. (Guido Sciavicco).

**Funding:** This research received no external funding.

**Acknowledgments:** This study was partially supported by computing facilities of Extremadura Research Centre for Advanced Technologies (CETA-CIEMAT), funded by the European Regional Development Fund (ERDF). CETA-CIEMAT belongs to CIEMAT and the Government of Spain.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **Appendix A. Statistical Tests for** *Breast Cancer* **Dataset**


**Table A1.** Shapiro–Wilk normality test *p*-values for *percent correct* metric—*Breast Cancer* dataset.

**Table A2.** Friedman *p*-value for *percent correct* metric—*Breast Cancer* dataset.


**Table A3.** Nemenyi post-hoc procedure for *percent correct* metric—*Breast Cancer* dataset.


**Table A4.** Summary of statistically significant differences for *percent correct* metric—*Breast Cancer* dataset.


**Table A5.** Shapiro–Wilk normality test *p*-values for *area under the ROC curve* metric—*Breast Cancer* dataset.



**Table A7.** Nemenyi post-hoc procedure for *area under the ROC curve* metric—*Breast Cancer* dataset.


**Table A8.** Summary of statistically significant differences for *area under the ROC curve* metric—*Breast Cancer* dataset.


**Table A9.** Shapiro–Wilk normality test *p*-values for *serialized model size* metric—*Breast Cancer* dataset.


**Table A10.** Friedman *p*-value for *serialized model size* metric—*Breast Cancer* dataset.


**Table A11.** Nemenyi post-hoc procedure for *serialized model size* metric—*Breast Cancer* dataset.



**Table A12.** Summary of statistically significant differences for *serialized model size* metric—*Breast Cancer* dataset.

#### **Appendix B. Statistical Tests for** *Monk's Problem 2* **Dataset**

**Table A13.** Shapiro–Wilk normality test *p*-values for *percent correct* metric—*Monk's Problem 2* dataset.


**Table A14.** Friedman *p*-value for *percent correct* metric—*Monk's Problem 2* dataset.


**Table A15.** Nemenyi post-hoc procedure for *percent correct* metric—*Monk's Problem 2* dataset.


**Table A16.** Summary of statistically significant differences for *percent correct* metric—*Monk's Problem 2* dataset.



**Table A17.** Shapiro–Wilk normality test *p*-values for *area under the ROC curve* metric—*Monk's Problem 2* dataset.

**Table A18.** Friedman *p*-value for *area under the ROC curve* metric—*Monk's Problem 2* dataset.


**Table A19.** Nemenyi post-hoc procedure for *area under the ROC curve* metric—*Monk's Problem 2* dataset.


**Table A20.** Summary of statistically significant differences for *area under the ROC curve* metric—*Monk's Problem 2* dataset.


**Table A21.** Shapiro–Wilk normality test *p*-values for *serialized model size* metric—*Monk's Problem 2* dataset.




**Table A23.** Nemenyi post-hoc procedure for *serialized model size* metric—*Monk's Problem 2* dataset.


**Table A24.** Summary of statistically significant differences for *serialized model size* metric—*Monk's Problem 2* dataset.


#### **Appendix C. Nomenclature**

**Table A25.** Nomenclature table (Part I).


**Table A25.** *Cont.*


**Table A26.** Nomenclature table (Part II).


#### **References**


c 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Entropy* Editorial Office E-mail: entropy@mdpi.com www.mdpi.com/journal/entropy

MDPI St. Alban-Anlage 66 4052 Basel Switzerland

Tel: +41 61 683 77 34 Fax: +41 61 302 89 18