CMI-Net: cross-modality interaction network; CE: cross-entropy; CB: class-balanced; \* the *γ* of value is 0.5, which could refer to Table 3.

#### *3.2. Ablation Study*

#### 3.2.1. Evaluation of CMIM

To explore the effectiveness of CMIM and the impact of its position in the network on classification performance, the results corresponding to four different variants are shown in Table 2. Our proposed CMI-Net with softmax CE loss showed superior performance to Variant0 (i.e., the network without CMIM), indicating the effective performance of our interaction module. Variant1, Variant2, and Variant3 (i.e., networks with CMIM inserted after 1st, 2nd, and 3rd max-pooling layer, respectively) did not perform better in terms of precision and recall compared with Variant0, which obtained precision and recall values of 79.02% and 77.09%, respectively. This might be explained by the fact that modality-specific features learned in the shallow layer were simple and contained noise, which interfered with the process by which CMIM learned complex intermodality correlations, leading to poor predictions [40]. In addition, our architecture obtained the best performance since it applied the CMIM after a deeper layer, which enabled the network to discover more discriminative patterns and suppress irrelevant variations more effectively [41].

**Table 2.** Performance comparison of our CMI-Net with its variants. The best results for each metric are highlighted in bold.


& denotes all networks presented in this table were trained using softmax CE loss; # denotes the network without a cross-modality interaction module (CMIM); \* denotes the network where the CMIM was inserted after the 1st, 2nd, and 3rd max-pooling layers, respectively.

The results above have proven that the inclusion of the CMIM in the network provided quantifiable improvements in identification performance. This was also reflected in the qualitative visualization of the embeddings and the corresponding clusters in Figure 3, with the help of t-distributed stochastic neighbor-embedding (t-SNE), a technique for visualizing high-dimensional data by giving each data point a location in a two- or threedimensional map [42]. Figure 3 shows the two-dimensional embedded features from the part test dataset after the fully connected layers of both CNN branches under the network without and with CMIM by using the t-SNE technique with an init of 'pca' and perplexity of 30. Comparing the left and right columns in Figure 3, it can be observed that more compact clusters were generated under the network with CMIM by reducing the intraclass distance and enlarging the interclass distance. The core technical point was that the joint interaction module enabled adaptive amplification of salient features and suppression of unrelated features based on information from two-modality data. To further provide insights into its contribution, we presented two spatial attention maps for features extracted from the triaxial accelerometer and triaxial gyroscope data (Figure 4). As illustrated in Figure 4, the value per pixel represented the contribution degree corresponding to each temporal period and each axis, and it was adaptively recalibrated through intermodality interaction. Therefore, both quantitative and qualitative findings reinforced the suitability of our proposed CMI-Net to tasks using two-modality sensor data.

#### 3.2.2. Evaluation of CB Focal Loss

To study the effect of CB focal loss on the optimization of CMI-Net, we show the quantitative performance in Table 3 and explore the sensitivity of its hyperparameter *γ*. CMI-Net with CB focal loss (*γ* = 0.5) achieved the best precision of 82.50%, recall of 83.73%, and F1-score of 82.94%. This indicated that CB focal loss was beneficial to the improvement of classification performance when the modulation strength was controlled appropriately, whereas negative effects occurred if the value of *γ* was too large or too small.

(**b**) data.

(**b**) data.

max-pooling layers, respectively.

metric are highlighted in bold.

bold.

sion of unrelated features based on information from two-modality data. To further provide insights into its contribution, we presented two spatial attention maps for features extracted from the triaxial accelerometer and triaxial gyroscope data (Figure 4). As illustrated in Figure 4, the value per pixel represented the contribution degree corresponding to each temporal period and each axis, and it was adaptively recalibrated through intermodality interaction. Therefore, both quantitative and qualitative findings reinforced the

To study the effect of CB focal loss on the optimization of CMI-Net, we show the quantitative performance in Table 3 and explore the sensitivity of its hyperparameter γ. CMI-Net with CB focal loss (*γ* = 0.5) achieved the best precision of 82.50%, recall of 83.73%, and F1-score of 82.94%. This indicated that CB focal loss was beneficial to the improvement of classification performance when the modulation strength was controlled appropriately, whereas negative effects occurred if the value of *γ* was too large or too small.

suitability of our proposed CMI-Net to tasks using two-modality sensor data.

**Table 2.** Performance comparison of our CMI-Net with its variants. The best results for each metric are highlighted in

CMI-Net + softmax CE loss **79.74 79.57 79.02 93.37** & denotes all networks presented in this table were trained using softmax CE loss; # denotes the network without a crossmodality interaction module (CMIM); \* denotes the network where the CMIM was inserted after the 1st, 2nd, and 3rd

**Table 3.** Performance comparison between softmax CE loss and CB focal loss with different γ. The best results for each

Softmax CE Loss (baseline) 79.74 79.57 79.02 **93.37** CB focal loss (*γ* = 0.1) 81.31 83.60 81.97 89.57

CB focal loss (*γ* = 2) 78.92 78.48 77.97 91.05

**Loss Functions Precision (%) Recall (%) F1-Score (%) Accuracy (%)**

3.2.2. Evaluation of CB Focal Loss

**Methods & Precision (%) Recall (%) F1-score (%) Accuracy (%)** Variant0 # 79.02 77.09 76.88 91.76 Variant1 \* 78.18 77.07 77.40 92.17 Variant2 \* 77.50 78.44 77.91 92.92 Variant3 \* 78.36 76.94 77.02 92.62

**Figure 3.** Embedding visualization of the features extracted from triaxial accelerometer and gyroscope data under network without and with CMIM, respectively. **Figure 3.** Embedding visualization of the features extracted from triaxial accelerometer and gyroscope data under network without and with CMIM, respectively. **Figure 3.** Embedding visualization of the features extracted from triaxial accelerometer and gyroscope data under network without and with CMIM, respectively.

**Figure 4.** Attention maps for features extracted from the triaxial accelerometer (**a**) and gyroscope **Figure 4.** Attention maps for features extracted from the triaxial accelerometer (**a**) and gyroscope **Figure 4.** Attention maps for features extracted from the triaxial accelerometer (**a**) and gyroscope (**b**) data.

tivities varied slightly when using CB focal loss. This explained that the overall classification performance increased mainly due to the increase in walking-natural, as it focused more on difficult samples and samples of minority classes. However, the overall accuracy of CMI-Net with CB focal loss decreased by 2.69% (Table 3), which was related to the different variations of recall values in different activities and the current imbalanced dataset. In particular, the overall accuracy could also be presented as the weighted average of the recall value for each activity according to the sampling frequency of each activity. As shown in Figure 5, the recall increases were 35.92% for walking-natural, 1.17% for standing, and 0.91% for galloping, and the recall decreases were 8.41% for walking-rider, 4.26% for eating, and 0.36% for trotting when using CB focal loss. It can be observed that all activities with increased recall belonged to the minority class, while the remaining activities with decreased recall belonged to the majority class, resulting in a decrease in over-

recall, and F1-score of the walking-natural were significantly improved, while other activities varied slightly when using CB focal loss. This explained that the overall classification performance increased mainly due to the increase in walking-natural, as it focused more on difficult samples and samples of minority classes. However, the overall accuracy of CMI-Net with CB focal loss decreased by 2.69% (Table 3), which was related to the different variations of recall values in different activities and the current imbalanced dataset. In particular, the overall accuracy could also be presented as the weighted average of the recall value for each activity according to the sampling frequency of each activity. As shown in Figure 5, the recall increases were 35.92% for walking-natural, 1.17% for standing, and 0.91% for galloping, and the recall decreases were 8.41% for walking-rider, 4.26% for eating, and 0.36% for trotting when using CB focal loss. It can be observed that all activities with increased recall belonged to the minority class, while the remaining activities with decreased recall belonged to the majority class, resulting in a decrease in over-

all accuracy. Thus, it is necessary to collect a more balanced dataset in the future.

all accuracy. Thus, it is necessary to collect a more balanced dataset in the future.

To provide further insight into the influence of CB focal loss (γ = 0.5) on the classifi-

To provide further insight into the influence of CB focal loss (γ = 0.5) on the classifi-


**Table 3.** Performance comparison between softmax CE loss and CB focal loss with different *γ*. The best results for each metric are highlighted in bold.

To provide further insight into the influence of CB focal loss (*γ* = 0.5) on the classification performance, we present the classification results of each activity under CMI-Net with CB focal loss and softmax CE loss, respectively, in Figure 5. It shows that precision, recall, and F1-score of the walking-natural were significantly improved, while other activities varied slightly when using CB focal loss. This explained that the overall classification performance increased mainly due to the increase in walking-natural, as it focused more on difficult samples and samples of minority classes. However, the overall accuracy of CMI-Net with CB focal loss decreased by 2.69% (Table 3), which was related to the different variations of recall values in different activities and the current imbalanced dataset. In particular, the overall accuracy could also be presented as the weighted average of the recall value for each activity according to the sampling frequency of each activity. As shown in Figure 5, the recall increases were 35.92% for walking-natural, 1.17% for standing, and 0.91% for galloping, and the recall decreases were 8.41% for walking-rider, 4.26% for eating, and 0.36% for trotting when using CB focal loss. It can be observed that all activities with increased recall belonged to the minority class, while the remaining activities with decreased recall belonged to the majority class, resulting in a decrease in overall accuracy. Thus, it is necessary to collect a more balanced dataset in the future. *Sensors* **2021**, *21*, x FOR PEER REVIEW 12 of 18

**Figure 5.** Precision (**a**), recall (**b**), and F1-score (**c**) comparison of each activity under softmax cross-entropy (CE) loss and class-balanced (CB) focal loss. In addition, experiments under different loss functions were conducted to verify the **Figure 5.** Precision (**a**), recall (**b**), and F1-score (**c**) comparison of each activity under softmax crossentropy (CE) loss and class-balanced (CB) focal loss.

effectiveness of the CB focal loss, as illustrated in Table 4. The contrasting losses mainly

better than any of them used alone, which indicated that adding the CB term to the focal loss function improved the overall classification performance on the imbalanced dataset. In addition, the precision, recall, and F1-score of CS\_CE loss and CB focal loss increased by different degrees, while both accuracies decreased compared with softmax CE loss. Specifically, the accuracy was only 83.79%, although the recall reached the highest value of 85.11%. This was because the recall of walking-rider was only 72.49%, although that of walking-natural was 69.16% (Figure 6). This result further verified that decreased accuracy occurred when using balancing techniques on the imbalanced dataset. In addition, we found that the recall of majority classes decreased while that of minority classes increased when using CS\_CE loss and CB focal loss (Figure 6). This result revealed that both highlighted in bold.

(CS\_CE) loss, and CB focal loss.

In addition, experiments under different loss functions were conducted to verify the effectiveness of the CB focal loss, as illustrated in Table 4. The contrasting losses mainly included CS\_CE loss, CB\_CE loss, focal loss, and ACS loss, as mentioned in the "Optimization" section. We found that CB focal loss combining CB loss and focal loss performed better than any of them used alone, which indicated that adding the CB term to the focal loss function improved the overall classification performance on the imbalanced dataset. In addition, the precision, recall, and F1-score of CS\_CE loss and CB focal loss increased by different degrees, while both accuracies decreased compared with softmax CE loss. Specifically, the accuracy was only 83.79%, although the recall reached the highest value of 85.11%. This was because the recall of walking-rider was only 72.49%, although that of walking-natural was 69.16% (Figure 6). This result further verified that decreased accuracy occurred when using balancing techniques on the imbalanced dataset. In addition, we found that the recall of majority classes decreased while that of minority classes increased when using CS\_CE loss and CB focal loss (Figure 6). This result revealed that both losses effectively focused on the samples of minority classes during training, but it is inevitable that more samples in majority classes were misclassified as minority classes so that overall accuracy would decrease. *Sensors* **2021**, *21*, x FOR PEER REVIEW 13 of 18 losses effectively focused on the samples of minority classes during training, but it is inevitable that more samples in majority classes were misclassified as minority classes so that overall accuracy would decrease. **Table 4.** Classification performance comparison with different loss functions. The best two results for each metric are

> **Table 4.** Classification performance comparison with different loss functions. The best two results for each metric are highlighted in bold. **Loss Functions**# **Precision (%) Recall (%) F1-Score (%) Accuracy (%)** Softmax CE loss 79.74 79.57 79.02 **93.37**

