*7.2. Baselines*

The baselines are shallow approaches based on traditional machine learning algorithms such as Random Forest (RF), Naive Bayes (NB), K-Neighbors (KNN) with *k* = 1 and Simple Logistic (SL). We trained each algorithm with a set of handcraft features extracted from the time and frequency domain. Table 3 presents a list of mathematical functions used for creating the features used by the baselines [7,11,27]. The experiments were executed in the WEKA library (Waikato Environment for Knowledge Analysis) [63].

While we know the benefits of using complex models, especially in dealing with large masses of data, in our HAR context, we are adopting simple models, such as random forest, mainly because of speed, good performance and easy interpretation. Our focus is not on evaluating the best model for recognizing human activities but discovering bias problems that overestimate the predictive accuracy because of an inappropriate choice of validation methodology.

**Table 3.** List of all features used in the experiments with the baseline classifiers.


#### *7.3. Evaluation Scenarios*

The experiments are based on the three model types commonly found in literature: Personalized, universal and hybrid.


In addition, we use SHAP explanation method to understand how machine learning models tend to select different features based on the validation methodology. For this experiment, we have used UCI dataset with a 561-feature vector with time and frequency domain variables [26] using holdout and cross-validation to analyze how model select different features to make its predictions. We select the Random forest algorithm to conduct this experiment. In Section 8.1 we also provide explanations of individual predictions using shap values based on subject number 9 of UCI dataset.

#### *7.4. Performance Metrics*

To measure the performance metrics of universal, hybrid and personal models, we use standard metrics such as accuracy, precision, recall and F-measure (or F-Score) [1,45] obtained from confusion matrix analysis. We used other metrics, besides accuracy, since it alone may not be the most reliable way to measure the actual performance of a classification algorithm, mainly because class imbalance can influence the results. Our research employs the metrics summarized in Table 1 (Section 2).

## **8. Results**

This section presents a comparative analysis of different validation procedures based on the machine learning results and the interpretable methods. We divide results into four different setups. First, we deal with the validation of personalized models. Second, we deal with the valuation of universal models. The third scenario deals with the validation of a hybrid model. The performance results of five classifiers are presented using accuracy as the main metric for universal models (Figure 3), personalized models (Figure 4) and hybrid models (Figure 5). Finally, we present the insights based on Shapley values that give us an alternative manner to analyze and understand results.

The results presented in Figures 3–5 show that, for all classification algorithms, the personal models perform very well, the hybrid models perform similarly and the universal models have the worst performance. The main reason for this result is that different people may move differently and universal models cannot effectively distinguish between some activities which are highly related to the respective user and, consequently, it will have low performance on classification and a high confidence interval because of the variance in the population.

The hybrid models have performed much closer to personal models. Most HAR studies use cross-validation (*k*-CV) to evaluate and compare algorithms' performance. The mixture between the train and test sets results in a classification model that already knows part of its test set. In other words, the model trained using the *k*-CV can associate activity patterns of a specific subject on both train and test sets. The result of this process is the creation of models with higher classification accuracy. However, they do not reflect reality. If we insert new subjects into the domain, the model will have difficulties recognizing them.

**Figure 3.** Accuracy results based on universal model for the classifiers 1-NN, Naive Bayes, Random Forest and Simple Logistic.

**Figure 4.** Accuracy results based on personalized model for the classifiers 1-NN, Naive Bayes, Random Forest and Simple Logistic.

**Figure 5.** Accuracy results based on hybrid model for the classifiers 1-NN, Naive Bayes, Random Forest and Simple Logistic.

Tables 4 and 5 analyze the performance of the best model (random forest) on individual subjects for universal and personalized scenarios using SHOAIB dataset. In Table 4 the rows for subjects 1 to 10 represent the folds of the subject cross-validation. For subject 1 row, the model is trained using subjects 2–10 and evaluated on subject 1 and so on. In Table 5 the rows represent subjects 1 to 10. For subject 1 row, the model is trained using only data of subject 1, subject 2 using data of subject 2 and so on.

As can be observed in Table 4 the accuracy of the same model varies greatly among subjects as opposed to *k*-CV used in the personalized model (Table 5) in order to capture variability among subjects. Moreover, the standard deviation is higher for the universal model than for the personalized model. This shows that if a single generic model will be used for all users, the standard deviation should be considered when selecting the model.

The Figure 6 allows us to compare the universal model (LOSO) and personal model (*k*-CV) using the Random Forest algorithm. As it can be noticed from the confusion matrix, most of classes are correctly classified with very high accuracy. However, the universal model has difficulty in differentiating the walking class from upstairs and downstairs classes. This is expected as these three are very similar activities so the underlying data may not be sufficient to accurately different them. For stationary classes, such as standing and sitting, the misclassification is very low, demonstrating that distinction between those classes generalizes well for new subjects.


**Table 4.** Random Forest results for universal model in SHOAIB dataset.

**Table 5.** Random Forest results for personalized model in SHOAIB dataset.


(**b**)**Figure 6.** Confusion matrix of Random Forest algorithm results for universal model and personal model using the SHOAIB dataset. (**a**) Universal model. (**b**) Personalized model.

For methods whose goal is to generate a custom classification model, such as a personalized or hybrid model, *k*-CV will work very well. However, it may not be a good validation procedure for universal models. These results have shown the importance of a careful analysis of these different scenarios.

#### *8.1. Global Explanations Using Shap Values*

(**a**)

In this section, we showed through a summary plot [59,60] how validation methodologies affect feature importance and also discuss strategies to avoid potential issues in pre-processing data.

The summary plot combines feature importance with feature effects, considering the absolute average Shapley values along with all the classes. With a multi-classification problem, it shows the impact of each feature considering the different classes. The position on the y-axis is determined by the feature and on the x-axis by the Shapley value.

Figures 7 and 8 show that there are changes in the features that each model chose based on the validation methodology. This slightly different importance that the classification model gives for features when using different validations methodology causes a grea<sup>t</sup> boost in performance for cross-validation. Moreover, the CV has 17 features in common with holdout, a difference of 15%.

Given a feature, we also extract the importance proportion for each class. The results also show that the importance that each feature assigns to the classes is different according to the adopted methodology. By analyzing the contribution of each class for each feature, for example, Feature 53 (related with accelerometer gravity in x-axis) has a greater contribution to the class "walking upstairs" in the holdout methodology while it contributes more to the class "laying" when using the CV. Similar results can be observed in features 97, 57, 41 and many others.

When using cross-validation, the classifier already knows the individual attributes because its data can be shared in training and testing. Knowing the individual's pattern, the classifier can choose features that best suit the individual's behavior. Models trained using different training and test sets are more realistic because they reflect the approximate performance of what would happen in the real world, and thus it is possible to choose classifiers that select more generic features that better represent the population or group of individuals.

**Figure 7.** Summary plot for SHAP analysis using holdout methodology on UCI dataset. It shows the mean absolute SHAP value of 20 most important features for six activities.

**Figure 8.** Summary plot for SHAP analysis using cross-validation on UCI dataset. It shows the mean absolute SHAP value of 20 most important features for six activities.

While the results presented are promising, in many situations, it is not trivial to find meanings in statistical features extracted from inertial sensors in HAR. However, our results show that the adopted methodology can significantly influence the selection of features to overestimate the results, not being appropriate for real-world applications.

#### *8.2. Explaining Individual Predictions*

In this section, we give a closer look at individual predictions to understand how validation influences at instance level. For this purpose, we present results based on shap force plot [60,64]. The force plot shows shap values contributions in generating the final prediction using an additive force layout. We can visualize feature attributions as "forces". The prediction starts from the baseline. The baseline for Shapley values is the average of all predictions. In the plot, each Shapley value is an arrow that pushes to increase with positive value (red) or decrease with negative value (blue) the prediction. These forces balance each other out at the actual prediction of the data instance.

We used data from subject 9 from the UCI database to conduct this experiment. For the walking activity, the model obtains high accuracy for cross-validation methodology. These results are expected and also confirm the results presented earlier in this section. For both methodologies, the top five features that contribute negatively to model prediction are the same. In addition, the model tends to give more importance to a set of different features according to the chosen methodology.

As shown in Figure 9b, half of the features (50%), if we look at the top 10 most important, are different for holdout and cross-validation. Feature number 70, based on the accelerometer, is ranked as one of the most important for the walking class. Features such as 393 and 508 are ranked as important when using holdout but do not appear in crossvalidation. The cross-validation has features such as number 57, based on the accelerometer energy, which is top-rated by the model.

For non-stationary activities, such as walking and walking upstairs (Figure 10), the model shows a greater difference in prediction performance when compared to nonstationary activities. The model accuracy can achieve up to 10% higher when the crossvalidation methodology is used. Moreover, the model can pick up to 50% different features to predict user activity when using each methodology.

Figure 10 show that the difference in the model accuracy can achieve up to 10%, showing that the classifier can overestimate the results when he knows patterns of an individual, choosing the features that best represent him. These results for the upstairs activity are similar to the walking activity. For the top 10 features, up to 50% can be different for holdout and cross-validation. Features like 71 are marked as relevant when using holdout for walking upstairs class, but they don't even appear in cross-validation.

Figures 11 and 12 showed that for stationary activities (e.g., stand activity) the model presents a similar performance in terms of accuracy in both methodologies, CV and holdout. However, there are differences between the selected features for decision making. For standing class, features such as 439 are marked as important when using cross-validation, but they do not appear in holdout (top 10).

(**b**) Holdout

**Figure 9.** Walk activity for cross-validation (**a**) and holdout (**b**) validation methodology.

**Figure 10.** Walking upstairs activity for cross-validation (**a**) and holdout (**b**) validation methodology.

(**b**) Holdout

(**b**) Holdout

**Figure 12.** Sitting activity for cross-validation (**a**) and holdout (**b**) validation methodology.

We can observe in these studies that, when analyzing the predictions individually for a given class, the classifier can change the order of importance between the features, but we perceive these changes more drastically when holdout and cross-validation are compared.
