*4.1. Hold-Out*

The hold-out is the simplest form of splitting data and relies on a single split of the dataset into two mutually exclusive subsets called training set and a test set [16,42]. A common dataset split uses 70% or 80% for training and 30% or 20% for testing. The advantage of this method is the lower computational load. The hold-out is a pessimistic estimator because the classifier is trained only with part of the samples. If more data is left for the test, the bias of the classifier will be higher, but if only a few samples are used for the test, then the confidence interval for accuracy will be wider [42]. It has lower computational costs because it needs to run once but if the data are split again, the results of the model probably will change. This means that the accuracy depends on the subject(s) selected for the evaluation [45].

#### *4.2. K-fold Cross-Validation (k-CV)*

The *k*-CV consists of averaging several hold-out estimators corresponding to different data splits [16,37,40]. This procedure randomly divides the dataset (from one subject or all subjects) into *k* disjoint folds with approximately equal size, and each fold is in turn used to test the classification model induced from the remaining *k* − 1 folds. Then, the overall performance is computed as the average of the *k* accuracies resulting from *k*-CV [40,42]. The disadvantage of using this strategy is its computational cost when the values of k are relatively high for large samples. In addition, no single cross validation procedure is universally better but it should focus on the particular settings [16].

#### *4.3. Leave-One-Subject-Out Cross-Validation (LOSO)*

The LOSO (Leave-One-Subject-Out Cross-Validation) strategy aims at finding out whether a model trained on a group of subjects generalizes well to new, unseen subjects. It is a variant of the *k*-fold cross-validation approach but with folds consisting of a subject [45]. To measure this, we have to ensure that all of the samples in the test fold come from subjects that are not represented at all in the paired training fold.

The LOSO strategy uses a subset of size *p* for testing and *n* − *p* for training, where *p* keeps all of the samples from a single subject together. This procedure ensures that the same subject is not represented in both testing and training sets at the same time. This configuration allows evaluating the generalization of the model based on data from new subjects; otherwise, if the model learns person-specific features, it may fail to generalize to new subjects.

When the number of subjects in a dataset is small, it is common to adopt LOSO to evaluate the performance of a classification algorithm. The LOSO is an extreme case of *k*-CV, where the number of folds is equal to the number of subjects on the dataset. It has a high variability as only one subject is used as the validation set to test the model prediction. This exhaustive procedure should be used when the random partition in *k*-CV has a large impact on performance evaluation [40,42].

#### *4.4. Types for HAR Systems*

The main goal of machine learning algorithms is to develop models that work not only on the specific dataset for which they were trained but also on new and unseen data. However, what does new data mean in human activity recognition problems? To answer this question, we need to know the purpose for which the algorithm is being developed. If we developed specifically for one subject, new data means new samples, or records, from the same subject. This falls into the category of personal systems. However, if the goal is to develop universal systems that can classify activities from a new subject, new data means new subjects.

Each type of HAR system addresses a slightly different learning problem and makes different assumptions about how the learning algorithm is applied [45,48]. There are three types of HAR systems [45,49,50], as shown in Figure 2: Universal or generalized, personal or personalized and hybrid.

**Figure 2.** Visualization of each procedure used to build the model types for HAR systems. To build universal models, the LOSO procedure (**a**) separates train and test by subjects. For personalized models (**b**), the *k*-CV is used with samples of only one subject. Finally, for hybrid models (**c**), the *k*-CV is also used, but in this case, for a group of subjects, data are split by samples into the training and test set.

Universal systems must be capable of generalizing patterns of any subject. The most common validation procedure used in this context is the leave-one-subject-out crossvalidation (LOSO) [45,51]. The LOSO considers the subject information when splitting the training and test set. This information is useful for preventing data from the same subject being present in both sets, as shown in Figure 2a.

Personalized systems aim at creating models that are experts in recognizing patterns from the same subject. This is called personalized validation [8,34,49,50,52]. In personalized systems, the selected machine learning algorithm is trained and tested with data from only one subject. Thus, the samples from this subject are divided between training and testing, as shown in Figure 2b.

Most studies in HAR use hybrid systems to validate the model performance with *k*-CV as validation methodology. It is hybrid because all sample data from over one subject are mixed and data from the same subject can be in the training and test sets, as shown in Figure 2c [25].

Each system has a specialization and this determines how training and test data are partitioned for evaluation. The next section presents a discussion about the correct evaluation that systems designers should consider in each model type.

#### **5. A Fair Evaluation for Human Activity Recognition Systems**

Most machine learning algorithms need Independent and Identically Distributed (i.i.d.) data [40]. If the recruitment process used is i.i.d., the subjects will be a representative sample of the overall population. Otherwise, if the i.i.d. is not assured, such as recording activities in which several samples are collected from each subject, the sampling process might generate groups of dependent samples. Therefore, *k*-CV is not a good procedure for validating universal models because of the temporal dependency among samples from the same subject. This means that the model trained using *k*-CV procedure knows the activity patterns of a specific subject, shared in both training and test sets. This situation may lead a model to learn a biased solution, where the machine learning algorithm can find a strong association between unique features of a subject (e.g., walking speed), artificially increasing its accuracy on the test set [20,45]. It explains why some systems report high accuracies.

To minimize the problem of weak generalization, the data should be adequate for a fair validation procedure according to the application purpose. This means that the application of the *k*-CV to new samples or new subjects does not measure the same thing and it should be determined by the application scenario, not by the statistics of the data. For instance, if the generative process has some kind of group structure, such as samples collected from different subjects, experiments or measurement devices, it is more appropriate to use crossvalidation by subject or by a group. In this procedure, the preservation of independence means that full subjects' information must be left out for CV. For applications that aim at a personalized classification model, the traditional *k*-CV is an acceptable procedure.

#### **6. Explainable Algorithms for Creating Fairer Systems**

Explaining the decisions made by a machine learning model is extremely important in many applications. Explainable models can provide valuable information on how to improve them and also help to better understand the problem and the information provided by the input variables [53].

Identifying issues like biased data could allow systems design to select sensitive attributes that they want to focus their evaluations on. This is a key feature for explainability that has a clear purpose for evaluating fairness, as well as in non-fairness-related explanations where certain features should be weighed more or less heavily in class selection than others. Mitigating the bias and unfairness within the training data is a necessity, both out of ethical duty and because of the impact that perceived inaccuracies have on user trust [54–56].

More recently, XAI methods have been proposed to help interpret the predictions of machine learning models, as example, LIME [57], Deep Lift [58] and SHAP [59]. XAI methods have been used in HAR context to understand the rationale behind the predictions of the classifier [9,33,46,47]. In this work, we choose a unified framework for interpreting model predictions, called SHAP (Shapley additive explanations), to explain graphically and intuitively the results of different validation methodologies used in HAR systems.

The SHAP (Shapley additive explanations) [59] is based on a game-theoretic approach extensively used in literature to explain the predictions of any machine learning model. The Shapley values acted as a unified measure of feature importance. It aims to explain the

prediction of an instance *x* by computing the contribution of each feature to the prediction. In summary, the Shapley values give each feature a score that is distributed across the features of that instance.

The Algorithm 1 shows the pseudo-code to approximate Shapley estimation for single feature value [60]. First, select an instance of interest *x*, a feature *j* and the number of iterations *M*. For each iteration, a random instance *z* is selected from the data and a random order of the features is generated. Two new instances are created by combining values from the instance of interest *x* and the sample *z*. The instance *<sup>x</sup>*+*j* is the instance of interest, but all values in the order after feature *j* are replaced by feature values from the sample *z*. The instance *<sup>x</sup>*−*j* is the same as *<sup>x</sup>*+*j*, but in addition has feature *j* replaced by the value for feature *j* from the sample *z*. The difference in the prediction from the black box is computed by *φm j* = ˆ *f*(*x<sup>m</sup>* +*j* ) − ˆ *f*(*x<sup>m</sup>* −*j* ) and all differences are averaged, resulting in *φj*(*x*) = 1 *M* ∑ *M <sup>m</sup>*=1 *φm j* Averaging implicitly weighs samples by the probability distribution of *X*. The procedure has to be repeated for each of the features to ge<sup>t</sup> all Shapley values.

.

#### **Algorithm 1** SHAP basic algorithm

1: **Required**: **M**: Number of interactions; **x**: Instance of interest; **j**: Features index; **X**: Data matrix; **f**: Machine learning model.

ˆ

ˆ


The Shapley values can be combined into global explanations [59,60] by running SHAP algorithm for every instance to obtain a matrix of Shapley values, one row per data instance and one column per feature. We can interpret the entire model by analyzing the Shapley values in this matrix. The idea behind SHAP feature importance is simple: Features with large absolute Shapley values are important. Since we want the global importance, we average the absolute Shapley values per feature across the data.

In this sense, SHAP framework can understand the decision-making of a classification model globally by summarizing how a model generates its outputs. Global explanations are beneficial as they might reveal biases and help diagnose model problems [61]. They can also explain model predictions at the instance level once each observation gets its own set of SHAP values. This greatly increases its transparency.

We have used a specific method for local explanations of tree-based models, called TreeExplainer [59,62], which provides fast and accurate results by calculating the SHAP values for each leaf of a tree.

#### **7. Experimental Protocol**

This section describes the experimental protocol, considering four evaluation scenarios. We detail the datasets used in this study, the baselines that are built with time and frequency domain features and the performance metrics.

## *7.1. Datasets*

The physical activity data used in this work were obtained from three publicly available datasets: SHOAIB (SH) [2], WISDM [25] and UCI [26]. Table 2 presets a summarization of the datasets used in our study [11].


**Table 2.** Summarization of SHOAIB (SH) [2], WISDM [25] and UCI [26] datasets. Items marked with (\*) were not used in the experiment.

In our experiments, we use only accelerometer data. For the WISDM dataset, we chose users who performed all activities, totaling 19 individuals. For the SHOAIB dataset, we selected the six most similar activities with WISDM and UCI datasets, so that all datasets had the same number of classes to compare results. We removed the Jogging and biking activities from our experiments because of this. The SHOAIB dataset contains data collected from five different body positions merged to run our experiments. Moreover, SHOAIB is balanced and should represent a fairer evaluation. This means a reduction in bias caused both by individuals with more activity or unbalanced class labels.
