**1. Introduction**

Human Activity Recognition (HAR) has the objective of automatically recognizing patterns in human movement given sensor-based inputs, namely inertial measurement units (IMUs), currently available in most wearables and smartphones [1]. HAR is an important enabling technology for applications such as remote patient monitoring, locomotor rehabilitation, security, and pedestrian navigation [1].

The IMU itself may contain several sensors, such as accelerometers and gyroscopes, which possess microelectromechanical properties, allowing their capacitance to vary with movement [2]. The accelerometer measures acceleration, while the gyroscope measures angular velocity [3]. Usually, Machine Learning (ML) is applied to enable an association between the signals obtained from these sensors and specific human activities [2]. The typical HAR system comprises the following steps [4]: data acquisition, preprocessing, segmentation, feature extraction, and classification.

Similar to most ML tasks, HAR models perform well when testing on a randomly sampled subset of a carefully acquired dataset (i.e., out-of-sample validation) and struggle in Out-of-Distribution (OOD) settings (i.e., external validation). These settings occur when the source and target domains are different, such as when the models are tested across different datasets or sensor positions [5–7].

**Citation:** Bento, N.; Rebelo, J.; Barandas, M.; Carreiro, A.V.; Campagner, A.; Cabitza, F.; Gamboa, H. Comparing Handcrafted Features and Deep Neural Representations for Domain Generalization in Human Activity Recognition. *Sensors* **2022**, *22*, 7324. https://doi.org/10.3390/ s22197324

Academic Editor: Christian Haubelt

Received: 4 August 2022 Accepted: 23 September 2022 Published: 27 September 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Deep learning is becoming increasingly popular in HAR applications [8]. While the typical pipeline includes a feature extraction step before training a classifier, deep neural networks automatically learn and extract features through a continuous minimization of a cost function. In principle, a neural network may have millions of learnable parameters, which translates into a large capacity to learn more complex and discriminative features [9]. These models have potential for HAR applications since sensor signals may have many inherent subtleties that may not be recognized by Handcrafted (HC) features. Although a promising approach, significant limitations have been discussed when deep learning models are deployed in real-world environments. Current methods for training deep neural networks may converge to solutions that rely on spurious correlations [10], resulting in models that lack robustness and fail in test domains that are trivial for humans [11].

On the other hand, HC features in this field are well-studied [1,12], more interpretable, and can reach high performance. In HAR, results with HC features approximate those of deep learning [13,14] even in tasks where the latter thrives, namely when the train and test sets are split by randomly shuffling the data, thus showing similar distributions [15].

Since both methods have advantages and limitations, there is a need for a more detailed comparison between them in various domains. This translates into a need for benchmarks where the similarity between train and test distributions has considerable variability.

As HAR naturally includes many kinds of possible domains, it can be considered an excellent sandbox to study the OOD generalization ability of learning algorithms (Domain Generalization), being previously used for this purpose [16].

This paper compares the performance of learning algorithms based on HC features with deep learning approaches for In-Distribution (ID) and OOD settings. For this comparison, we use five public datasets, homogenized to have the same label space and input shape, so that the models can be easily trained and tested across them. To validate whether the tasks are in fact OOD, several metrics are considered and compared with the purpose of assessing the disparity between train and test sets. To extract HC features, Time Series Feature Extraction Library (TSFEL) [12] was used. We use one-dimensional Convolutional Neural Networks (CNNs) for our deep learning baselines.

In summary, the major contributions of this work are the following:


#### **2. Related Work**

Several studies compared classic ML approaches using HC features with deep learning methods. The authors from [13,14,17,18] compare CNNs with models based on support vector machines, multilayer perceptrons, and random forests. In all these studies, deep learning approaches outperformed classic methods. However, in their experiments, data splits were created by randomly shuffling the datasets, so samples from possibly different domains are represented in both the train and test sets with similar data distributions.

In regard to the use of data similarity to quantify the degree of OOD, associated with generalization, this is both an old and important question in the ML literature, as several ML methods implicitly rely on properties related to similarity (e.g., the large margin assumption in SVM learning) to guarantee good generalization performance [19]. The potential relationship between data similarity and the generalization properties of ML models was first investigated from an empirical point of view in [20], where the authors discovered that datasets found to be substantially dissimilar likely stemmed from different distributions. Based on these findings, the authors of [21] demonstrated that information about similarity can be used to understand why a model performs poorly on a validation set, while the same information can be used to understand when and how to successfully perform

domain adaptation (see, for example, the recent review [22]). To that end, several metrics for measuring data similarity have been proposed in the literature. Bousquet et al. [20] developed a measure (Data Agreement Criterion, DAC) based on the Kullback–Leibler divergence, which has since become frequently used to assess the similarity of distributions [23]. More recently, Schat et al. [24] suggested a modification to the DAC measure (Data Representativeness Criterion, DRC), and investigated the link between data similarity and generalization performance. Cabitza et al. [25] proposed instead a different approach based on a multivariate statistical testing procedure to obtain a hypothesis test for OOD data, the Degree of Correspondence (DC), and also studied the correlation between DC scores and the generalization of ML models. By contrast, in the Deep Learning literature, approaches based on the use of statistical divergence measures, such as the Wasserstein distance [26] or the Maximum Mean Discrepancy (MMD) [27], have become increasingly popular to design methods for OOD detection. See also, the recent review by Shen et al. [28].

Deep learning approaches have been explored in OOD settings by testing the models on data from unseen domains [4,29–32]. Gholamiangonabadi et al. [33] verified that the accuracy went from 85.1% when validating using leave-one-subject-out (LOSO) crossvalidation to 99.85% when using *k*-fold cross-validation. Bragança et al. [34] had similar results with HC features, reporting an accuracy of 85.37% for LOSO and 98% for *k*-fold. The most important features used by each model differed significantly. They concluded that LOSO would be a better validation method for generalization. Li et al. [4] and Logacjov et al. [30] compared several deep learning models with classic ML pipelines using LOSO validation. As opposed to what was verified in the previous studies involving ID settings, in the context of OOD, classic methods were mostly on par with deep learning approaches, outperforming them in some cases. Still, data acquired from different subjects of the same dataset may not be as diverse as the data encountered by HAR systems in real-world environments since datasets are usually recorded in controlled conditions with similar devices worn in the same positions. In Hoelzemann et al. [7], significant drops in performance were reported when testing on different positions and different datasets, which were then mitigated by the use of transfer learning techniques.

Transfer learning has previously been applied to HAR in cases where feature representations can be used in downstream tasks or across domains [6,35]. These methods leverage information about the target task or domain to approximate the source and target representations [5]. For example, Soleimani et al. [5] used a Generative Adversarial Network (GAN) to adapt the model to each user, outperforming other domain adaptation methods. However, the performance was poor when no transfer learning method was used (see Table 2 of [5]). The same phenomenon can be noticed in [35], where the domain adaptation methods outperformed the baseline model, which did not have access to data from the target domain. These studies illustrate the difficulty of generalizing to different domains, even when using deep learning models.

Gagnon et al. [16] included a HAR dataset in a benchmark to compare domain generalization methods applied to deep neural networks. The results indicate a 9.07% drop in accuracy from 93.35% ID to 84.28% OOD on a dataset where different devices worn in different positions characterize the possible domains. The same study showed that domain generalization techniques [11,36] did not improve results in a significant manner, and that empirical risk minimization (ERM) is still a strong baseline [37].

Boyer et al. [38] compared HC features and deep representations on an ID supervised classification task and on an OOD detection task. They concluded that, while a k-nearest neighbors (KNN) model using deep features as input outperforms the same model using HC features on the ID task, the situation partially reverts for the OOD detection task, where models based on HC features achieve the best results in two out of three datasets. However, the ID and OOD tasks are not directly comparable, since they are of different kinds and use different evaluation methods.

Trabelsi et al. [39] compared three deep learning approaches and a random forest classifier with handcrafted features as input. Similar to the experiments in our work, the datasets were homogenized by including only common activities and separated the test sets by the user. They concluded that only one of the deep learning approaches outperformed the baseline model with handcrafted features. While they formulated two different domain generalization settings (OOD-U and OOD-MD), the results for each of these settings are not directly comparable since the test sets were combined when reporting the results for the OOD-MD setting.

This paper adds to previous work by explicitly comparing the OOD robustness of HC features and deep representations in four domain generalization settings with different distances between train and test sets.
