**1. Introduction**

Human Activity Recognition (HAR) is an emergen<sup>t</sup> research area for autonomous and real-time monitoring of human activities and has been widely explored because of its good practical applications, such as behavior detection, ambient assisted living, elderly care, and rehabilitation [1–8]. Individuals express their routines through activities that are performed in particular situations and understanding these situations enables people to improve their daily lives. Physical activities performed by an individual (e.g., walking and running) can create recommendations and avoid the negative impacts of illness. For instance, the time spent sitting is associated with an increased risk of becoming obese and developing diabetes and cardiovascular diseases [3]. A HAR system can observe elderly people by analyzing data from a smart wearable and improve their lifestyle by warning them about forthcoming unprecedented events such as falls or other health risks [9].

Smartphones have been used to monitor everyday activities automatically through a variety of embedded sensors such as accelerometers, gyroscopes, microphones, cameras and GPS units [1,10]. Understanding how individuals behave by analyzing smartphone data through machine learning is the fundamental challenge in the human activity recognition research area [11].

**Citation:** Bragança, H.; Colonna, J.G.; Oliveira, H.A.B.F.; Souto, E. How Validation Methodology Influences Human Activity Recognition Mobile Systems. *Sensors* **2022**, *22*, 2360. https://doi.org/10.3390/ s22062360

Academic Editors: Tanja Schultz, Hui Liu and Hugo Gamboa

Received: 12 February 2022 Accepted: 15 March 2022 Published: 18 March 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

To recognize physical activity from reading data from sensors, most proposed solutions rely on the Activity Recognition Process (ARP) protocol: Data acquisition, preprocessing/segmentation, feature extraction, classification and evaluation [10,12–14]. Several parameters in each one of these stages (sample size, experimental methodology, crossvalidation settings and type of application) can affect the overall recognition [15]. Even when these parameters are well adjusted, the final evaluation of the system may not reflect the true accuracy when recognizing data from new individuals. The main reason for this is that, in most cases, the methodology used to validate the results does not consider the label that identifies the individuals.

The most commonly adopted validation strategy in Machine Learning (ML) literature is the *k*-fold cross-validation (*k*-CV) [16]. The *k*-CV splits a dataset into two subsets: One for training the ML algorithm and one for testing the performance, repeating this process *k* times. The *k*-CV procedure does not consider whether all samples belong to the same subject (i.e., individual). This is usually because of the windowing step used to segmen<sup>t</sup> the time series during the pre-processing stage. Therefore, in a HAR application that aims for generalization, randomly partitioning the dataset becomes a problem when samples of one subject are in both training and test sets at the same time. As a result, an information leak appears, artificially increasing the classifier's accuracy. We can confirm this observation in several studies in the literature [7,17–22].

In practice, the introduction of illegitimate information in the evaluation stage is unintentional and facilitated by most data partitioning processes, making it hard to detect and eliminate. Even then, identifying this situation as the reason for the overestimated results might be non-trivial [18,23].

In this article, we use the explainable artificial intelligence (XAI) tools' capacity to detect and address bias and fairness issues when choosing different validation methodologies. This is a critical topic that has grown rapidly in the community because the decisions of machine learning models can reproduce biases in historical data used to train them [24]. A variety of reasons, like lack of data, imbalanced datasets and biased datasets, can affect the decision rendered by the learning models. We found it is possible to explain model behavior and its capability in a simple way. Machine learning engineers can use this information to sugges<sup>t</sup> modifications needed in the system to reduce critical issues linked to bias or fairness.

Our work aims to discover bias problems that overestimate the predictive accuracy of a machine learning algorithm because of an inappropriate choice of validation methodology. We examine how different HAR system approaches make generalizations based on a new subject(s) by using k-fold cross-validation, holdout and leave-one-subject-out cross validation. In particular, we show how the SHAP (Shapley additive explanations) framework presents itself as a tool that provides graphical insights into how human activity recognition models achieve their results. This is important because it allows us to see which features are relevant to a HAR system in each validation method.

We can summarize the main contributions of this work as follows:

(1) We evaluate three different approaches for building a HAR system: Personalized, universal and hybrid. Our experiments reveal pitfalls caused by incorrectly dividing the dataset, which can lead to unnoticed over-fitting. We show that *k*-CV achieves an average accuracy of 98% on six human activities, whereas with leave-one-subject-out crossvalidation the accuracy drops to 85.37%. We achieved the results by merging three widely used datasets, SHOAIB [2], WISDM [25] and UCI-HAR [26], which have human activities performed by 59 different subjects.

(2) We propose a new approach by using XAI methods to show how machine learning models choose different features to make its prediction based on the selected validation strategy. We performed several experiments that allowed us to measure the impacts of each of these methodologies on the final results. With this, we could quantify the importance of choosing the correct evaluation methodology of a HAR system.

The remainder of this paper is organized as follows. Section 4 presents the most common procedures for building a HAR system. Section 5 presents a discussion of a fair evaluation for HAR systems. Section 6 introduces explainable algorithms used to interpret the predictions of machine learning models. Section 7 presents the experimental protocol and Section 8 the results of our evaluation scenarios. Section 3 presents the work related to this research. Finally, Section 9 presents the conclusions of this work.

#### **2. Human Activity Recognition**

Smartphones are devices capable of monitoring everyday activities automatically through a variety of built-in sensors such as accelerometers, gyroscopes, microphones, cameras and GPS units [10]. Human activity recognition involves complicated tasks which often require dedicated hardware, sophisticated engineering and computational and statistical techniques for data pre-processing and analysis [7].

To find patterns in sensors data and associate them to human activities, the standard pipeline used in most works follows the Activity Recognition Process (ARP) protocols [7,11,13,27–30]. As depicted in Figure 1, ARP consists of five steps, acquisition, preprocessing, segmentation, feature extraction and classification [7]. Our work also includes the evaluation phase in the ARP pipeline to present in detail how validation methodology impacts the general performance of a HAR system. We can find in literature extensions of the standard pipeline with specific stages such as annotation and application stage [31], or even privacy and interpretability [32,33].

**Figure 1.** The common methodology used in HAR: Data acquisition, segmentation, feature extraction, classification and evaluation.

#### *2.1. Data Acquisition*

In the data acquisition phase, motion sensors are used to gather data, such as angle, vibration, rotation and oscillation from the smartphone. The individual actions are reflected in the sensor data or linked to the physical environment in which the device is located. For this reason, choosing suitable sensors is crucial. Currently, the accelerometer sensor is mostly used in HAR systems because it is built-in in most smartphones and wearable devices and also has shown superior results concerning representing activities if compared with other inertial sensors. The combination of accelerometer and gyroscope allows HAR systems to find patterns in the sensor signals and associate them with the activities performed, such as activity of daily living (ADL) and sports. However, finding such patterns is not trivial, since smartphones are often held near different parts of the user's body and each subject may have a personal signature of activity [11,13,28].

#### *2.2. Pre-processing and Segmentation*

After data acquisition, the raw data collected by motion sensors may contain noises and must be processed, adapted into a readable format and segmented to be used by future stages of the HAR applications. The segmentation phase comprises dividing the signals that are recorded continuously into smaller segments. Choosing smaller segments allows the detection of activities faster, since the wait to mount the segmen<sup>t</sup> is smaller and the resource requirements in the process are also reduced. Using larger segments allows more complex activities to be recognized, but an additional amount of time will be required to assemble and process the segment. The HAR community have used different sizes of segments in the literature with the use of segmen<sup>t</sup> sizes ranging from 1 s to 10 s, with a recognition rate above 90% [7,25,34,35].

#### *2.3. Feature Extraction Process*

The feature extraction phase aims to find a high-level representation from each activity segment. For sensor-based activity recognition, feature extraction is more difficult because there is inter-activity similarity. Different activities may have similar characteristics (e.g., walking and running). Therefore, it is difficult to produce distinguishable features to represent activities uniquely [32].

Many HAR systems are based on shallow approaches. Features are handcrafted by a domain specialist that transform data gathered from sensors into a high-level representation. The handcraft features can be divided into three domains: Time, frequency and symbolic [11,13,14,26,27,35].

The time-domain features are obtained by simple statistical calculations, such as average, median and standard deviation. These features are simple to calculate and understand and have low computational complexity when compared to other feature extraction processes, such as those based on deep neural networks [12,36]. The frequencydomain features are used to capture natural repetitions by decomposing the time series into a set of real and imaginary values representing wave components, through the use of Fourier or Wavelet transforms, for example. The symbolic domain features represent the sensor signals in a sequence of symbols obtained through a discretization process, allowing data to be compressed into a smaller space than the original data [11,27].

The low computational complexity and its simple calculation process make handcrafted features still practicable. The major disadvantage is that resources created or selected manually are time consuming, domain specific and require specialized knowledge.

#### *2.4. Human Activity Classification*

A machine learning algorithm can automatically detect patterns in a dataset and can be used to make decisions in situations of uncertainty. There are several supervised learning algorithms such as decision tree, naive Bayes, support vector machine (SVM), artificial neural networks (ANN), logistic regression and KNN (K-Nearest Neighbors) [37]. For these methods, it is essential that the sensors data be converted into a high-level representation, since machine learning models do not work very well if they are applied directly to the raw data [25].

More recently, deep learning models reach human-level performance in various domains, including HAR. This approach can automatically learn abstract features from sensors' data and thus eliminating the need for a dedicated feature extraction phase because the entire process is performed within the network hidden layers. Moreover, it outperforms in performance when applied to large masses of data if compared with traditional ML algorithms.

For HAR, the most common solutions found in the literature are based on Convolutional Neural Networks (CNN) and Long-Short-Term Memory Recurrent (LSTMs) [28,36,38,39]. Unfortunately, one of the main drawbacks of deep learning algorithms is related to their high computational cost which could make its implementation unsuitable for creating realtime HAR applications implemented on devices with low computational power [36,39].

A crucial stage of the classification process is to assess the performance of the trained machine learning algorithm. The next section presents the most common evaluation metrics used to estimate the future performance of a classifier.

#### *2.5. Evaluation Metrics for HAR Systems*

The performance of the classification model is evaluated by a set of metrics that shows how reliable is the model under evaluation, in mathematical terms [16,40]. The evaluation metrics commonly used in the smartphone-based HAR literature are accuracy, recall (sensitivity), specificity, precision and *F*-measure [15,41]. In HAR, accuracy is the most popular and can be calculated by dividing the number of correctly classified activities and the total number of activities. Accuracy gives a general idea of classification models' performance. However, this metric treats all classes as equally important in a dataset. This leads to an unreliable metric in unbalanced databases, strongly biased by dominant classes, usually the less relevant background class [15,42]. To avoid unreliable results in unbalanced datasets, there are other metrics that evaluate classes separately, such as precision and recall, as shown in Table 1.

The precision metric is the ratio of true positives and the total positives. If the precision value is equal to "1", it means that the classifier correctly predicts all true positives and is able to correctly classify between correct and incorrect labeling classes. The recall metric analyzes the true positive (*TP*) rate to all the positives. A low recall value means that the classifier has a high number of false negatives. Finally, *F*-measure deals with a score resulting from the combination of precision and recall values to provide a generic value that represents these two metrics. High *F*-measure values imply both high precision and high recall. It gives a balance between precision and recall, which is suitable to imbalanced classification problems, including HAR.

**Table 1.** Summarization of accuracy, recall, precision and F-measure. *TP* means true positives, *TN* true negatives, *FP* false positives and *FN* means false negatives.


Different from most works that comprise all ARP stages, our focus in this article relies on the evaluation process and validation methodologies. By looking deep into the evaluation stage, we aim to understand how human activity recognition models achieve their results according to validation methodology.

#### **3. Related Works**

Many works in the literature alert researchers to the correct assessment of activity recognition models and, although this problem is widely known, it is often overlooked. Hammerla and Plötz [43] found inappropriate use of *k*-CV by almost half of the retrieved studies in a systematic literature review that used accelerometers, wearable sensors or smartphones to predict clinical outcomes, showing that record-wise (segmentation over the same user data) cross-validation often overestimates the prediction accuracy. Nevertheless, HAR system designers often either ignore these factors or even neglect their importance. Widhalm et al. [22] also has pointed unnoticed over-fitting because of autocorrelation (i.e., dependencies between temporally close samples). Hammerla and Plötz [43] showed that the adjacent overlapping frames probably record the same activity in the same context and, therefore, they share the same information. These adjacent segments are not statistically independent.

Dehghani et al. [29] extend the work of Banos et al. [44] by investigating the impact of Subject Cross-Validation (Subject CV) on HAR, both with overlapping and with non-overlapping sliding windows. The results show that *k*-CV increases the classifier performance by about 10%, and even by 16% when overlapping windows are used. Bulling et al. [15] provide an educational example, demonstrating how different design decisions in the HAR applications impact the overall recognition performance.

Gholamiangonabadi et al. [45] examine how well different machine learning architectures make generalizations based on a new subject(s) by using Leave-One-Subject-Out (LOSO) in six deep neural networks architectures. Results show that accuracy improves from 85.1% when evaluated with LOSO to 99.85% when evaluated with the traditional 10-fold cross-validation.

In contrast to the reviewed works related that deal with validation methodologies, our study examines bias problems that overestimate the predictive accuracy of a machine learning algorithm using graphical insights obtained from SHAP framework to understand how human activity recognition models achieve their results according to validation methodology. While there are works that make use of explainable methods in HAR context [9,33,46,47], most of the methods for explainability focus on interpreting and making the entire process of building an AI system transparent. The finds focused on validation methodology are important because it allows us to see which features are relevant to a HAR system in each validation method. We examine how different HAR systems approaches make generalizations based on a new subject(s) by using k-fold cross-validation, holdout and leave-one-subject-out cross-validation.

#### **4. Evaluation Procedures**

A common practice for computing a performance metric (e.g., accuracy), when performing a supervised machine learning experiment, is to hold aside part of the data to be used as a test set [16]. Splitting data into training and test sets can be done using various methods, such as hold-out, k-fold cross-validation (*k*-CV), leave-one-out cross-validation (LOOCV) and leave-one-subject-out (LOSO). Then, the classifier is trained on the training set, while its accuracy is measured on the test set. Thus, the test set is seen as new data never seen by the model before [16]. We briefly explained these methods in the following sections.
