**9. Conclusions**

In this paper, we present and discuss in detail the importance of a proper evaluation during the design and assessment of a HAR system with inertial sensors. We have conducted several experiments intending to exemplify the overall impact of the evaluation procedures on accuracy and fairness. We also illustrate how these procedures can be set up, showing how to reduce the accuracy overestimation. For this task, the tests were performed in three datasets (UCI-HAR, SHOAIB and WISDM) using k-fold cross-validation and leave-one-subject-out validation procedures. The main conclusions drawn from these results are summarized below.

The models that use *k*-CV in the data achieved 98% of accuracy. However, when considering individual information (i.e., the label associated to the subject), the accuracy achieves 85.37% in the best scenario. There is a 12% loss of accuracy when choosing a better evaluation method, that is, the initial result was overestimated by 12%.

The universal model performs poorly in the test phase and also has greater margins of error when compared with personalized models. This shows that the model will struggle to recognize new subjects. In general, the model may perform well in the training phase, but it has a degraded performance in the test set that leads to overfitting. To build a universal model, traditional *k*-CV is not the best solution. For this scenario, the recommended validation procedure would be LOSO or even Holdout when in scenarios were implement LOSO has a significant impact on training.

In personalized models, there is no problem if *k*-CV is used as a validation procedure since, for this type of application, the algorithm should aim at a model that fits the user. In this scenario, the classification algorithms have higher accuracy since the classification model was trained with instances that are very similar to those found in the test set. The very high accuracy values inductee that this is a suitable model for evaluating a customized application for a specific user. Besides, personal models can be trained with dramatically less data.

The hybrid model is used in many related works but they are not very suitable for real-world situations. Most of the commercial HAR applications have preferences for universal or personalized models. The results obtained from this model also present higher accuracy in the classification because some segments that belong to the same subject may be present in both the test set and training set, leading to an overoptimistic result. Again, this does not reflect the true accuracy.

We have shown how the SHAP framework presents itself as a tool that can provide graphical insights into how human activity recognition models manage to achieve their results. Our work has presented manners that can be explored by using explainable algorithms to improve the transparency of creating machine learning models. The SHAP results reinforce that the incorrect choice of validation methodology leads to changes in how attributes are used by models to improve their performance. This situation may cause poor prediction performance and can lead to unreliable results.

Our evaluations also reveal that while current XAI tools provide important functions for data and model analysis, they are still lacking when it comes to analyzing results in scenarios where it is not trivial to find meanings in statistical features extracted from sensors data.

**Author Contributions:** All authors (H.B., J.G.C., H.A.B.F.O. and E.S.) designed the study. H.B. and J.G.C. implemented and interpreted the experiments and wrote the manuscript. H.B., J.G.C., H.A.B.F.O. and E.S. reviewed and edited the article. J.G.C., H.A.B.F.O. and E.S. supervised the overall work and reviewed the experimental results. All the authors (H.B., J.G.C., H.A.B.F.O. and E.S.) contributed in discussing, reviewing, revising the article and approved the final manuscript. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research, according to Article 48 of Decree n<sup>º</sup> 6.008/2006, was partially funded by Samsung Electronics of Amazonia Ltda, under the terms of Federal Law n<sup>º</sup> 8.387/1991, through agreemen<sup>t</sup> n<sup>º</sup> 003/2019, signed with ICOMP/UFAM and by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior—Brasil (CAPES)—Finance Code 001. The funding bodies played no role in the design of the study and collection, analysis and interpretation of data and in writing the manuscript.

**Data Availability Statement:** The WISDM can be found on http://www.cis.fordham.edu/wisdm/ includes/datasets/latest/WISDM\_ar\_latest.tar.gz (accessed on 23 May 2019). The SHOAIB dataset can be found in the author's researchgate profile or https://www.researchgate.net/profile/Muhammad\_ Shoaib20/publication/266384007\_Sensors\_Activity\_Recognition\_DataSet/data/542e9d260cf277d58e8 ec40c/Sensors-Activity-Recognition-DataSet-Shoaib.rar (accessed on 12 February 2020). The UCI-HAR dataset is avaliable on https://archive.ics.uci.edu/ml/machine-learning-databases/00240/ UCI%20HAR%20Dataset.zip (accessed on 22 May 2019).

**Conflicts of Interest:** The authors declare that they have no competing interests. The funders had no role in the design of the study; in the collection, analyses or interpretation of data; in the writing of the manuscript or in the decision to publish the results.
