*3.1. Datasets*

The datasets used in this study include human activity data recorded using smartphones and wearable inertial measurement units (IMUs). Table 1 contains a detailed description of these publicly available datasets.

**Table 1.** Description of the datasets, including activities, positions, devices, and number of subjects.


Several criteria were followed to select the datasets for this study. Only datasets with a sampling rate close to or over 50 Hz were considered, to avoid the need for oversampling. The search was restricted to datasets that included most of the main activities seen in the literature (e.g., walk, sit, stand, run, and ascending/descending stairs). For better compatibility and to avoid large drops in performance caused by having considerably different sensor positions [7], we selected datasets that included overlapping positions with at least one of the other datasets that fulfilled the remaining criteria.

The accelerometer was the selected sensor for this work. The magnitude values were computed as the Euclidean norm of all three axes (*<sup>x</sup>*, *y*, and *z*), as this quantity is invariant to the orientation of the device and can give information that is more stable across domains. The magnitude signal was used along with the signal from each axis, so that all the information given by the accelerometer was retained. From those four channels, five-second windows were extracted without overlap.

All selected datasets were homogenized [47] so that a model trained on a specific dataset could be directly tested in any other. This procedure included resampling all the recordings to 50 Hz and mapping the different activity labels to a common nomenclature: walking, running, sitting, standing, and stairs. Stair-related labels were joined into a general "stairs" label, as having to distinguish between going up and down the stairs would add unnecessary complexity to the task, since it is hard to infer the direction of vertical displacement without access to a barometer [48]. The RealWorld dataset [46] generated considerably more windows than the other datasets, so one-third of these windows was randomly sampled and used in the experiments. The final distribution of windows and activities per dataset is shown in Table 2. This table contains the percentage of samples (five-second windows) of each activity in a given dataset, as well as the total number of samples and corresponding percentage of each activity and dataset. In this table, it can be seen that, while not being very well balanced, the activities have a substantial amount of samples for all the datasets. On the other hand, even with the effort of reducing samples, the RealWorld and SAD datasets have a larger influence in the experiments, which should not be an issue, since the conditions remain the same for both deep and classic approaches.


**Table 2.** Distribution of samples and activity labels per dataset. The # symbol represents the number of samples.

#### *3.2. Handcrafted Features*

To extract HC features, TSFEL [12] was used. This library extracted features directly from the 5-second accelerometer windows generated from each public dataset. To decrease computation time, we removed the features that included individual coefficients, such as Fast Fourier Transform (FFT), empirical Cumulative Distribution Function (eCDF), and histogram values. Nonetheless, the high-level spectral features computed from the FFT were kept. We did not extract wavelet and audio-related features, such as MFCC and LPCC. The total number of features per window was 192.

After the features were computed, samples were split according to each task (see Section 4). Subsequently, features were scaled by subtracting the mean of the train set and dividing by its standard deviation (Z-score normalization). The classifiers used were Logistic Regression (LR) and a Multilayer Perceptron (MLP) with a single hidden layer of 128 neurons and Rectified Linear Unit (ReLU) activation. These classifiers were chosen to enable a fair comparison with deep learning, as they resemble the last layer(s) of a deep neural network, usually responsible for the final prediction after feature learning.

#### *3.3. Deep Learning*

Convolutional neural networks were the selected deep learning models for this study since they achieved significantly better performance and converged faster when compared with recurrent neural networks (RNN) in preliminary experiments, which was consistent with the literature [49,50]. A scheme of the baseline CNN architectures is presented in Figure 1. We chose three different architectures, which we named CNN-base, CNN-simple, and ResNet. The training process was identical for all the architectures and is explained in Section 4. CNN-simple is a simplified version of the CNN-base with only two convolutional layers and a logistic regression directly applied to the flattened feature maps. ReLU was used as the activation function for the hidden layers of both architectures. The ResNet (Figure 1c) is a residual network inspired by Ferrari et al. [18], with a few modifications. Its convolutional block is represented in Figure 2.

In an attempt to bridge the performance gap between HC features and deep representations, we built a hybrid version of each architecture. There, the HC features are concatenated with the flattened representations of each model and fed to a fusion layer before entering the final classification layer. The number of hidden units for the fusion layer was 128 on both CNN-simple and CNN-base, increasing to 256 for the ResNet. An illustration of the hybrid version of CNN-base is in Figure 3.

For all these models, the input windows were scaled by Z-score normalization, with mean and standard deviation computed across all the windows of the train set.

(**c**)

**Figure 1.** Convolutional neural network architectures. The values above the representation of each feature map indicate their shape (Signal length × Number of channels). Convolutional layers (1D): k = kernel size; nr\_f = number of filters; stride = 1; padding = 0. Max pooling layers: k = kernel size; stride = 1; padding = 0. (**a**) CNN-simple Architecture. (**b**) CNN-base Architecture. (**c**) ResNet Architecture. The convolutional block is depicted in Figure 2.

**Figure 2.** ResNet convolutional block. The letter k stands for "kernel size".

**Figure 3.** Simplified illustration of the hybrid version of CNN-base (excluding the CNN backbone for ease of visualization).
