**5. Discussion**

This work aimed to compare the generalization performance of HC features and deep representations, focusing in particular on generalization in OOD settings.

In the first experiment, several metrics were compared to validate and quantify our OOD settings. For TSFEL representations, all the considered metrics were in agreemen<sup>t</sup> with our ordering hypothesis. In particular, the DC was able to clearly identify each of the OOD settings as such. In contrast, for the case of deep representations, there was some disagreement among the considered metrics. Still, the MMD and Wasserstein distance ratios remained in agreemen<sup>t</sup> with the adopted hypothesis. They were seen as more robust concerning the change of data representation induced by the deep learning model.

In our experiments involving HAR tasks, despite reaching lower f1-scores in the ID setting, models based on HC features were more robust in OOD settings. This difference in OOD performance supporting higher robustness for HC features may be due to their stability since they are fixed a priori based on domain knowledge, which should be valid across tasks. Conversely, deep features are automatically learned and could thus fail to identify generally helpful features, as there are known inefficiencies in the current methods for training neural networks. These are typically biased toward simple solutions [15] and rely on spurious correlations [10] rather than previous knowledge or causal relations.

In regard to the generalizability of our results to other settings, we note that even though we focused on HAR, with minor adaptations, our experiments and analyses could be replicated in a wide range of fields. For example, similar deep learning models and handcrafted features could be used and compared in fields that depend on sensor data, such as fall detection, predictive maintenance, or physiological signal processing (e.g., EEG, EMG, and ECG). Different deep learning architectures and feature extraction libraries would have to be employed for image or video processing.

Concerning practical purposes, HC features, being more robust, appear to be better suited for real-world HAR systems. However, their reimplementation in mobile or edge devices may be an arduous task. CNNs do not show this limitation, as the representations are encoded in weight matrices and can, in principle, be ported to these devices without significant effort [58]. More studies should, thus, be devoted to exploring this trade-off between increased robustness and reimplementation efforts, possibly considering the application of hybrid approaches (such as the ones also considered in this paper), as well as alternative training techniques for CNNs that attempt to improve robustness.
