**6. Conclusions**

This paper hypothesizes that models using HC features generalize better than deep learning models across domains in HAR tasks. Three OOD settings were implemented by testing on unseen users and (single or multi-source) datasets. Five public datasets were homogenized so that they could be combined in different ways to create diverse tasks.

Several metrics were used to quantify the degree of OOD of four domain generalization settings. The DC metric was used to validate our OOD settings. In turn, the Wasserstein distance ratio was chosen as our primary metric for the study since it was able to quantify our three OOD settings in the expected order.

In our main experiments, it was verified that, although deep models have better ID performance, they are outperformed in all three OOD settings by shallow models using features that were computed based on domain knowledge. Furthermore, as the drop in f1-score in OOD settings is less accentuated for classic models, it can be inferred that HC are more robust. Hybrid models achieved intermediate results between deep and classic methods, supporting the idea that HC features can stabilize training, which helps to validate our hypothesis.

Acknowledging the limitation of current deep learning techniques in being robust with respect to OOD settings, as compared to models based on HC features, we believe our work could pave the way for further research on the development of novel training methods for making deep learning models more robust and thus bridge the generalization gap toward new, more trustworthy, gold standards in the field of HAR.

**Author Contributions:** Conceptualization , N.B., J.R., M.B. and A.V.C.; data curation, N.B., J.R. and M.B.; methodology, N.B., J.R., M.B., A.V.C. and A.C.; software, N.B. and J.R.; validation, N.B., J.R., M.B., A.V.C. and A.C.; formal analysis, N.B., J.R. and A.C.; writing—original draft preparation, N.B.; writing—review and editing, N.B., J.R., M.B., A.V.C., A.C., F.C. and H.G.; visualization, N.B., J.R. and M.B.; supervision, M.B., A.V.C., F.C. and H.G. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work is financed by national funds through FCT—Fundação para a Ciência e a Tecnologia, I.P., within the scope of SAIFFER project under the Eureka Eurostars program (E!114310).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A. Supplementary Experiments**

Figure A1 shows the behavior of different models over all four domain generalization settings addressed in the study in comparison to TSFEL+LR, the approach with the highest generalization performance. Similarly to the main results, an inversion tendency can be observed from the ID to the OOD regime.

Figure A1a shows a larger gap in performance for the OOD regime. This gap is mitigated in the hybrid model (Figure A1b) and becomes much smaller in Figure A1c, where handcrafted features are the only source of information.

**Figure A1.** *Cont*.

**Figure A1.** F1-score vs. log(distance ratio). Each marker represents a different task. Distance ratios are based on the CNN-base embeddings. Error bars represent one standard deviation away from the mean. (**a**) TSFEL + LR vs. ResNet. (**b**) TSFEL + LR vs. CNN-base hybrid. (**c**) TSFEL + LR vs. TSFEL + MLP.

By using TSFEL features to compute the distance ratios (see Figure A2), we reach the same conclusions. However, the plots in Figures 6 and A1 were based on the CNN-base embeddings, as the distance ratios presented less outliers.

Figure A3 shows the confusion matrices for the ID, OOD-U, and OOD-MD settings of the SAD dataset. It can be verified that, as expected, performance decreased in OOD settings.

**Figure A2.** TSFEL + LR vs. CNN-base. Distance ratios are based on TSFEL features.

(**b**) **Figure A3.** *Cont*.

(**c**)

> **Figure A3.** Confusion matrices for the SAD dataset. (**a**) In-distribution (ID). Accuracy: 99.0%, F1- score: 98.9%; (**b**) Out-of-Distribution leaving users out (OOD-U). Accuracy: 91.5%, F1-score: 91.6%; (**c**) Out-of-Distribution leaving a dataset out (OOD-MD). Accuracy: 76.0%, F1-score: 73.5%.
