*3.4. Evaluation*

To quantify the degree to which a test domain is OOD, different metrics were applied, namely Euclidean distance, Cosine similarity, Wasserstein distance, MMD, and DC. Each metric was applied to the representations of each model before the classification stage. Regarding the Wasserstein distance [51], the Wasserstein-1 version was used and is given by:

$$\mathcal{W}\_1(\mathbf{X}, \mathbf{Y}) = \inf\_{\pi \in \Gamma(\mathbf{X}, \mathbf{Y})} \int\_{\mathbb{R} \times \mathbb{R}} |\mathbf{x} - \mathbf{y}| \mathrm{d}\pi(\mathbf{x}, \mathbf{y}), \tag{1}$$

where <sup>Γ</sup>(*<sup>X</sup>*,*<sup>Y</sup>*) is the set of distributions whose marginals are *X* and *Y* on the first and second factors, respectively. *x* and *y* are samples from each distribution *<sup>π</sup>*(*<sup>x</sup>*, *y*) from the set. Intuitively, the distance is given by the optimal cost of moving a distribution until it overlaps with the other. In our experiments, *x* and *y* are the feature representations of subsets of the train and test data, thus *W*1 represents the cost of mapping the distribution of *x* into the distribution of *y* (or vice versa).

Regarding the MMD , this is a kernel-based statistical procedure that aims at determining whether two given datasets come from the same distribution [52]. Given a fixed kernel function *k* : *X* × *X* → R and two datasets *X*,*Y* with sizes |*X*| = *n*, |*Y*| = *m*, the MMD can be estimated as:

$$\text{MMD}(X, Y) = \frac{1}{n(n-1)} \sum\_{i \neq j} k(\mathbf{x}\_i, \mathbf{x}\_j) + \frac{1}{m(m-1)} \sum\_{i \neq j} k(y\_i, y\_j) - \frac{2}{nm} \sum\_{i,j} k(\mathbf{x}\_i, y\_j) \tag{2}$$

Intuitively, the MMD measures the distance between *X* and *Y* by computing the average similarity in *X* and *Y* separately, and then subtracting the average cross-similarity between the two datasets, where the similarity between two instances is quantified by means of the selected kernel *k*. In this work, a simple linear kernel was selected. Furthermore, as for the Wasserstein distance, *x* and *y* represent the feature representations of subsets of the train and test data. Thus, MMD quantifies the average kernel similarity among instances in *x* and *y*, discounted by the cross-similarity between the two datasets.

The DC, by contrast, is a multivariate hypothesis testing procedure for the hypothesis that two samples of data come from the same distribution: having fixed a representative data sample, the obtained *p*-value, then, can be considered as a measure of how much any other data sample is OOD with respect to the representative one. In particular, scores close to 0 can be interpreted as being most likely OOD (since, assuming the null hypothesis of the two data samples coming from the same distribution, observing a *p*-value close to 0 has low probability). While the DC cannot be defined and computed by means of a closed-form

procedure, in [25] a permutation-resampling algorithm (see Algorithm 1) was defined to compute the corresponding *p*-value, based on the selection of a base distance metric.

**Algorithm 1** The algorithm procedure to compute the similarity between the two dataset *T* and *V*, using the Degree of Correspondence (DC).

**procedure** DC(*T*, *V*: datasets, *d*: distance, *∂* distance metrics) *dT* = {*d*(*<sup>t</sup>*, *t*) : *t*, *t* ∈ *T*} For each *v* ∈ *V*, find *tv* ∈ *T*, nearest neighbor of *v* in *T <sup>T</sup>*|*V* = {*t* ∈ *T* : *v* ∈ *Vs*.*t*.*t* = *tv*} ∪ *V dT*<sup>|</sup>*V* = {*d*(*<sup>t</sup>*, *t*) : *t*, *t* ∈ *<sup>T</sup>*|*V*} *δ* = *∂*(*dT*, *dT*<sup>|</sup>*V* ) Compute DC = *Pr*(*δ* ≥ *δ*) using a permutation procedure return DC **end procedure**

The selection of the distance metrics *∂* in Algorithm 1 is important to obtain sensible results for the DC. Intuitively, *∂* should represent the appropriate notion of distance in the instance space of interest. In [53], lacking any appropriate definition of distance in the instance space, the authors sugges<sup>t</sup> the use of a general baseline, e.g., the Euclidean or cosine distance, or robust non-parametric deviation metrics, e.g., MMD or Kolmogorov– Smirnov statistics.

In previous work, model performance has been evaluated using metrics such as accuracy, sensitivity, specificity, precision, recall, and f1-score [1]. As class imbalance is common in most publicly available HAR datasets (see Table 2), f1-score is used as the main performance metric since it is more robust than accuracy in these settings [30]. To be able to compare deep learning models and classic models with HC features, the f1-scores are compared in tasks across different OOD scenarios and including five public HAR datasets.

#### **4. Experiments and Results**

The main purpose of this paper is to compare the performance of HC features and deep representations in different OOD settings for HAR. A scheme of the full pipeline used for the experiments is presented in Figure 4.

HAR is a classification task that usually involves multiple domains, easily turning into a domain generalization task if the domains are considered when splitting the data. We devise four domain generalization settings, starting with a baseline ID setting where 30% of each dataset is randomly sampled for testing, and three OOD settings: (a) splitting by user within the same dataset, where approximately 30% of the users were assigned to the test set—OOD by user (OOD-U); (b) leaving a dataset out for testing, while including all the others for training—OOD with multiple source datasets (OOD-MD); (c) training on a dataset and leaving another for testing, running all the possible combinations—OOD with a single source dataset (OOD-SD). To obtain a direct comparison, the test set of OOD-U is used as a test set for all the OOD settings. Of the three OOD settings, OOD-U is the one that is expected to be closest to the training distribution since it is drawn from the same dataset, where devices and acquisition conditions are usually similar. It is followed by OOD-MD, since joining all the datasets (except one) for training averages their distributions onto a more general space. Subsequently, as it includes only a single dataset for training, OOD-SD should capture the largest distances between train and test distributions.

**Figure 4.** Scheme of the experimental pipeline.

In order to validate our hypothesis about the ordering of the distances between the train and test splits on our four settings, different metrics were applied to the feature representations. This experiment has the following objectives: (1) to validate that our three OOD settings are in fact OOD; and (2) to obtain the best metric for our main experiments, which should output values that agree with our ordering hypothesis for both HC features and deep representations. For models based on HC features, metrics were computed directly from the features. In contrast, for deep models, metrics were calculated from the hidden representations of the last layer before classification.

We note that different distance metrics have different scales, therefore, making their interpretation and comparison more difficult. For this reason, we computed distance ratios instead of raw distances, so as to make the values of the different metrics more consistent across tasks. The distance ratios were computed for each task, i.e., setting/dataset combination, using the following equation:

$$\text{Distance\\_ratio} = \frac{\partial (tr\_{1\prime} tr\_1)}{\partial (tr\_{2\prime} tr\_3)}\,'\,\tag{3}$$

where *∂* is a distance metric and *tri* and *tsi* are subsets randomly sampled (with replacement) from the train and test sets, respectively. The sample size is half the minimum of the train and test set lengths. By contrast, for the DC, the raw value without any ratio-based normalization was used, since it is already normalized in the [0, 1] range and is able to deal with any data representation directly.

A comparison of the considered metrics based on the TSFEL features is presented in Table 3. It is easy to observe that all the metrics agree with the OOD ordering hypothesis stated above. Indeed, the value of all metrics was higher for the OOD-U, OOD-MD, and OOD-SD (respectively, in this order) than for the ID setting. In particular, it can be seen that DC with Euclidean-based metrics saturates to values close to zero for all three OOD settings, indicating that, by the comments above on the interpretation of this score, the test sets are likely to be OOD.

Table 4 shows a comparison of the considered metrics based on the CNN-base representations. In contrast to the case of TSFEL features, the metrics showed a much lower degree of agreemen<sup>t</sup> with the OOD ordering hypothesis. First, it can be noted that only Wasserstein and MMD have values that clearly increase with the expected degree of OOD, being in agreemen<sup>t</sup> with the results of the TSFEL representations and, consequently, with our OOD ordering hypothesis. Nonetheless, it can be verified that both metrics had a large degree

of variation, with the confidence intervals for the ID, OOD-U, and OOD-MD partially overlapping. In the case of DC Cosine, the score for the OOD datasets was higher than that for the ID one. This seemingly paradoxical behavior may have an intuitive geometric explanation, as it may be a consequence of the transformations that take place during training, which influence the shape of the instance space and possibly make the representations of instances that would otherwise be OOD closer to the training data manifold. In support of this hypothesis, it can be easily observed that most metrics reported a significantly different value for the OOD-SD setting than for the other OOD settings, showing that the training of the deep learning model had an important influence on the natural representation of the data manifold. In this sense, both the Wasserstein and MMD metrics seemed to be more apt at naturally adapting to this change of representation.

**Table 3.** Comparison of metrics over all four domain generalization settings based on the TSFEL feature representations. For each setting, values were averaged over every test set. All metrics are ratios except the ones with (\*).


**Table 4.** Comparison of metrics over all four domain generalization settings based on the CNN-base representations. For each setting, values were averaged over all the datasets. All metrics are ratios except the ones with (\*).


Thus, as a consequence of these results, we chose the Wasserstein distance ratio as our main metric to quantifiy the degree of OOD due to the fact that it agrees with our hypothesis when using both TSFEL features and deep representations as input. This metric has also been applied by Soleimani et al. [5] to compute distances between source and target distributions.

Our experiments were run on an NVIDIA (Santa Clara, CA, USA) A16-8C GPU and an AMD (Santa Clara, CA, USA) Epyc 7302 processor with python version 3.8.12 and Visual Studio Code (Microsoft, Redmond, WA, USA) as the development environment. All the learning models were implemented using the PyTorch library [54]. Adam [55] was adopted as the optimizer used for the training process. To reduce bias [16], results were averaged over nine combinations of three different batch sizes (64, 128, and 256) and three learning rates (0.0008, 0.001, and 0.003). To account for class imbalance, the percentage of instances per class in the training set was given to the cross-entropy loss function as class weights.

To make the experiments as agnostic to the training method as possible, the same procedure was used for training the classifiers based on HC features and the deep learning models. Figure 5 shows the training and validation loss over the course of training for a single task. The chosen task was the OOD-U setting on the SAD dataset, an example of a task in which there was a verified occurrence of instability in training. One of the ways to handle this instability is by ending the training process earlier—early stopping [56].

**Figure 5.** Evolution of loss by epoch on SAD dataset in the OOD-U setting. The red dots indicate the minimum loss of each curve.

Over all the tasks, most models reached plateaus on validation performance after 30 to 50 epochs, so the training process was limited to 140 epochs to leave a margin for models to converge, but not so much as to fully overfit the data. For validation, we randomly sampled a 10% subset of the training data without replacement. While training, a checkpoint model was saved every time the validation loss achieved its best value since the start of training. Our early stopping method consisted of stopping training if the validation loss did not improve for 30 epochs in a row, which proved helpful in cases where training was not very stable. In these cases, the validation error oscillates, increasing for a certain number of epochs before decreasing again and, on many occasions, achieving a slightly lower error rate than in any of the previous epochs, which can be seen in the loss curves for the CNN models in Figure 5. This resembles the effects of double descent [57]. In our case, one of the causes of such unstable training may be the fact that these datasets are noisy, due to the diversity in users, devices, and positions, among other factors. It may also be a consequence of overparameterization, as the phenomenon was much more pronounced when training CNNs, which have significantly more parameters than our MLP and LR models. Both these potential causes were documented by Nakkiran et al. [57].

The evolution of the f1-score over the Wasserstein distance ratio for the best performing model of each family (CNN-base and TSFEL+LR) is documented in Figure 6. For each combination of model, dataset, and setting, the average and standard deviation of the f1-score were computed over nine different runs with varying learning rates and batch sizes. The CNN-base embeddings were chosen to compute distance ratios for this figure since they contain less outliers when compared to the distance ratios computed from TSFEL representations (see Figure A2). It can be verified that, initially, the CNN model outperforms the model using HC features. However, as the distance between train and test domains increases, the situation is reverted, with the classic approach outperforming the CNN. This suggests that HC features are more robust to the shifts that occur in OOD data. The regression curves reinforce the idea of OOD stability. As expected, there is a negative correlation between f1-score and distance ratio, meaning that performance decreases as the test data becomes more distant from the distribution seen during training. In general, the distance ratios given by the Wasserstein distance appear to agree with the previously stated OOD ordering hypothesis, with OOD-SD being the most OOD of the three settings, followed by OOD-MD and OOD-U, respectively. Still, a few outliers can be seen in the figure. The higher values of standard deviation for the CNN indicate that these models are more susceptible to the choice of hyperparameters, which is reasonable due to the much larger number of trainable parameters. However, it is not always ideal to have

such variability, as it indicates that the validation loss has become less correlated with the test loss. In practice, an apparently good model may perform surprisingly well in some settings while failing in situations that would otherwise be trivial to a simple model.

**Figure 6.** F1-score vs. log(distance ratio). Each marker represents a different task. Distance ratios are based on the CNN-base embeddings. Error bars represent one standard deviation away from the mean. The natural logarithm was applied to the distance ratios to make the regression curves linear.

More detailed results are presented in Table 5. For each combination of model and setting, the average and standard deviation of the f1-score were computed over all five datasets. The last column represents the average of the three OOD settings, which gives an idea of the overall generalization performance. The significant overturn from the ID to the OOD settings can be noticed in the table. TSFEL + LR, which had the worst ID f1-score (90.54%), turned out to be the best overall in the OOD regime, with an f1-score of 70% for the average of all three OOD settings. Using an MLP instead of LR slightly decreased the overall OOD performance to 69.55%, while increasing the ID performance to 92.87%, becoming closer to the deep learning results. This phenomenon may be related to an increase in the number of trainable parameters. Including HC features as an auxiliary input to deep models improved both ID and OOD results, with the hybrid version of CNN-base being the deep learning model with the strongest generalization performance (average OOD f1-score of 66.95%). However, this improvement is still insufficient to reach the OOD robustness of models solely based on HC features.

**Table 5.** Average f1-score in percentage over all the tasks in a given setting. Values in bold indicate the best performance for each setting.


Despite being simpler than the ResNet, the CNN-base model achieves a slightly higher generalization performance. On the other hand, CNN-simple, the simplest deep learning

model, did not perform well in OOD tasks. There appears to be an optimal number of parameters, possibly dependent on the architecture, so more studies should be conducted to understand this trade-off.
