**1. Introduction**

The behavior of horses provides rich insight into their mental and physical status and is one of the most important indicators of their health, welfare, and subjective state [1]. However, behavioral monitoring for animals, to date, largely relies on manual observations, which are labor-intensive, time-consuming, and prone to subjective judgments of individuals [1]. The use of sensors and machine learning is well-established in monitoring gait change [2], and for lameness detection as part of the equine veterinary examination,

**Citation:** Mao, A.; Huang, E.; Gan, H.; Parkes, R.S.V.; Xu, W.; Liu, K. Cross-Modality Interaction Network for Equine Activity Recognition Using Imbalanced Multi-Modal Data. *Sensors* **2021**, *21*, 5818. https:// doi.org/10.3390/s21175818

Academic Editors: Yongliang Qiao, Lilong Chai, Dongjian He and Daobilige Su

Received: 26 July 2021 Accepted: 27 August 2021 Published: 29 August 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

increasing the accuracy of identification of subtle lameness, which is one of the most expensive health issues in the equine industry [3,4]. Therefore it is of significant importance to investigate and develop an automatic, objective, accurate, and quantifiable measurement system for equine behaviors. Such a system will allow caretakers to identify variations in the animal behavioral repertoire in real-time, decreasing the workloads in veterinary clinics and improving the husbandry and management of animals [5,6].

Over recent decades, automated animal activity recognition has been studied widely with the aid of various sensors (e.g., accelerometers, gyroscopes, and magnetometers) and the use of machine learning techniques. For instance, a naïve Bayes (NB) classifier was applied to recognize horse activities (e.g., eating, standing, and trotting) using triaxial acceleration and obtained 90% classification accuracy [7]. Four classifiers including a linear discriminant analysis (LDA), a quadratic discriminant analysis (QDA), a support vector machine (SVM), and a decision tree (DT) were utilized to detect dog behaviors (e.g., galloping, lying on chest, and sniffing) based on accelerometer and gyroscope data, and the results revealed that the sensor placed on the back and collar yielded 91% and 75% accuracy at best, respectively [8]. A random forest (RF) algorithm was applied to categorize cow activities using triaxial acceleration and gained high classification accuracy with 91.4%, 99.8%, 88%, and 99.8% for feeding, lying, standing, and walking events, respectively [9]. In horses, the use of receiver-operating characteristic curve analysis classified standing, grazing, and ambulatory activities with a sensitivity of 94.7–97.7% and a specificity of 94.7–96.8% [10]. However, to classify animal behaviors accurately using these machine learning methods, feature extraction and method selection are often conducted manually and separately, which requires expert domain knowledge and easily induces feature engineering issues [11]. Moreover, handcrafted features often fail to capture general and complex features, resulting in low generalization ability, i.e., these extracted features perform well in recognizing the activities of some subjects but badly for others.

Along with the recent advances in internet technology and fast graphics processing units, various deep learning approaches have been increasingly and successfully adopted in animal activity recognition with wearable sensors. Classification models based on deep learning achieve automatic feature learning through data driving and subsequent animal activity recognition. For example, feed-forward neural networks (FNNs) and long shortterm memory (LSTM) models were applied to automatically recognize cattle behaviors (e.g., feeding, lying, and ruminating) using data collected from inertial measurement units (IMUs) [12,13]. Convolutional neural networks (CNNs), which accurately capture local temporal dependency and scale invariance in signals, were developed in automated equine activity classification based on triaxial accelerometer and gyroscope data [1,14,15]. FilterNet, presented based on CNN and LSTM architectures, was adopted to classify important health-related canine behaviors (e.g., drinking, eating, and scratching) using a collar-mounted accelerometer [16].

However, multi-modal data fusion has not been well handled when different sensors are used simultaneously in existing studies. Multi-modal data with different characteristics are often simply processed using common fusion strategies such as early fusion, feature fusion, and result fusion [17]. The early fusion strategy used in previous studies [12,13], i.e., extracting the same features without distinction of modalities, often caused interference between multimodal information due to their distribution gap [18]. The result fusion scheme was suboptimal since rich modality information was gradually compressed and lost in separate processes, ignoring the intermodality correlations. As a better choice, the feature fusion strategy fuses the intermediate information of multiple modalities, which avoids the distribution gap problem and achieves intermodality interaction simultaneously [19,20]. However, feature fusion is often limited to linear fusion (e.g., simple concatenation and addition) and fails to explore deep multi-modality interactions and achieve complementary-redundant information combinations between multiple modalities [17].

In addition, the collected sensor datasets often present class imbalance problems due to the inconsistent frequency and duration of each activity resulting from specific

animal physiology. Deep learning methods trained on imbalanced datasets tend to be biased toward majority classes and away from minority classes, which easily causes poor modal generalization ability and high classification error rates for rare categories [21]. Commonly used methods on imbalanced datasets mainly involve two techniques, namely, resampling and reweighting. Resampling attempts to sample the data to obtain an evenly distributed dataset, e.g., oversampling and undersampling [22]. However, oversampling and undersampling come with high potential risks of overfitting and information loss, respectively [21]. Reweighting is more flexible and convenient by directly assigning a weight for the loss function per training sample to alleviate the sensitivity of the model to data distribution [23]. This method is further divided into class-level and sample-level reweighting. The former, such as cost-sensitive (CS) loss [24] and class-balanced (CB) loss [25], depends on the prior category frequency, while the latter, such as focal loss [26] and adaptive class suppression (ACS) loss [27], relies on the network output confidences of each instance. In addition, CB focal loss, combining a CB term with a modulating factor, effectively focuses on difficult samples and considers the proportional impact of effective numbers per class simultaneously [25].

To improve the recognition performance for equine activities while tackling the abovementioned challenges, we have developed a cross-modality interaction network (CMI-Net) which achieved a good classification performance in our previous work [28], and a CB focal loss [25] was adopted to supervise the training of CMI-Net. The CMI-Net consisted of a dual CNN trunk architecture and a joint cross-modality interaction module (CMIM). Specifically, the dual CNN trunk architecture extracted modality-specific features for accelerometer and gyroscope data, respectively, and the CMIM based on attention mechanism adaptively recalibrated the importance of the elements in the two modality-specific feature maps by leveraging multi-modal knowledge. The attention mechanism has been widely utilized in different tasks using multi-modal datasets such as RGB-D images [17,29]. It has also been adopted to focus on important elements along with channels and spatial dimensions of the same input feature [30,31]. The favorable performance presented in these studies with the attention mechanism indicated the rationality of our proposed CMIM. In our method, softmax cross-entropy (CE) loss was initially used to supervise the training of CMI-Net. However, softmax CE loss suffered from inferior classification performance, especially for monitory classes [23]. In contrast, CB focal loss, by adding a CB term to focal loss, focuses more on minor-class samples and hard-classified samples and can alleviate the class imbalance problem. Therefore, a CB focal loss [25] was also adopted. In this study, the CMI-Net was trained based on an extensively labeled dataset [32] to automatically recognize equine activities including eating, standing, trotting, galloping, walking-rider (walking while carrying a rider), and walking-natural (walking with no rider). The leaveone-out cross-validation (LOOCV) method was applied to test the generalization ability of our model, and the results were then compared to the existing algorithms. The main contributions of this paper can be summarized as follows:


#### **2. Materials and Methods**

#### *2.1. Data Description*

The dataset used in this study was a public dataset created by Kamminga et al. [32]. In this dataset, more than 1.2 million 2 s data samples were collected from 18 individual equines using neck-attached IMUs. The sampling rate was set to 100 Hz for both the triaxial accelerometer and gyroscope and 12 Hz for the triaxial magnetometer. The majority of the samples were unlabeled, but data from six equines and six activities including eating, standing, trotting, galloping, walking-rider, and walking-natural were labeled extensively (87,621 2 s samples in total) and were used to classify equine activities in previous studies [7,34]. In this study, data from the triaxial accelerometer and gyroscope among the 87,621 samples were exploited separately, forming up to two tensors with a size of 1 × 3 × 200 for each sample. As demonstrated in Figure 1, the activities of eating, standing, trotting, galloping, walking-rider, and walking-natural occupied 18.32%, 5.84%, 28.62%, 4.50%, 38.94%, and 3.80% of the total sample number, respectively, producing a maximum imbalance ratio of 10.25. In addition, the input sample of each axis per sensor modality was normalized by removing the mean and scaling to unit variance, which can be formulated as follows:

$$S\_i = \frac{S\_i - \mu\_i}{\sigma\_i} \,\prime \tag{1}$$

where *S<sup>i</sup>* denotes all samples of a particular axis per sensor modality (i.e., X-, Y-, and Z-axis of the accelerometer, and X-, Y-, and Z-axis of the gyroscope), *S<sup>i</sup>* denotes all normalized samples, and *µ<sup>i</sup>* and *σ<sup>i</sup>* denote mean and standard deviation values in each axis per sensor modality, respectively. *Sensors* **2021**, *21*, x FOR PEER REVIEW 5 of 18

**Figure 1.** Histogram of class distribution.

**Figure 1.** Histogram of class distribution.

*2.2. Cross-Modality Interaction Network*

2.2.1. Dual CNN Trunk Architecture

ݒ݊ܥ and(•)

vation function [35].

ଵ×ଷ

described below.

ݒ݊ܥ ଵ×ଵ

as demonstrated in Figure 2b. The definition is given below.

ଵ×ଵݒ݊ܥ)ܷܮܧܴ = ାଵܺ

CNN branches (represented by CNNacc and CNNgyr) separately, is shown in Figure 2a. The dual CNN was constructed to extract modality-specific features and concatenate these features before the final dense layer. To achieve deep interaction between the two-modality data and capture the complementary information and suppress unrelated information from them, a joint CMIM was designed and inserted in the upper layer. The details are

The CNNacc and CNNgyr contained four convolution blocks, three max-pooling layers, one global average-pooling layer, and one fully connected layer, followed by concatenation and one joint fully connected layer. Inspired by the residual unit in the deep residual network that behaves like ensembles and has smaller magnitudes of responses [33], to promote the representation ability and robustness of the model, we designed a Res-LCB,

where ܺ and ܺାଵ denote feature maps in the ݈ and ݈ + 1 layers, respectively,

⊕ denotes the elementwise addition, and *RELU*(•) denotes the rectified linear unit acti-

ଷ×ଵݒ݊ܥ ⊕ (ܺ)

(•) represent 1×1 and 1×3 convolution operations, respectively,

(ܺ)), (2)

#### *2.2. Cross-Modality Interaction Network* promote the representation ability and robustness of the model, we designed a Res-LCB,

2.2.1. Dual CNN Trunk Architecture

**Figure 1.** Histogram of class distribution.

Walkingrider

*2.2. Cross-Modality Interaction Network*

described below.

0% 5% 10% 15% 20% 25% 30% 35% 40%

Percentage

*Sensors* **2021**, *21*, x FOR PEER REVIEW 5 of 18

Our proposed CMI-Net, where accelerometer and gyroscope data were fed into two CNN branches (represented by CNNacc and CNNgyr) separately, is shown in Figure 2a. The dual CNN was constructed to extract modality-specific features and concatenate these features before the final dense layer. To achieve deep interaction between the two-modality data and capture the complementary information and suppress unrelated information from them, a joint CMIM was designed and inserted in the upper layer. The details are described below. as demonstrated in Figure 2b. The definition is given below. ଵ×ଵݒ݊ܥ)ܷܮܧܴ = ାଵܺ ଷ×ଵݒ݊ܥ ⊕ (ܺ) (ܺ)), (2) where ܺ and ܺାଵ denote feature maps in the ݈ and ݈ + 1 layers, respectively, ݒ݊ܥ ଵ×ଵ ݒ݊ܥ and(•) ଵ×ଷ (•) represent 1×1 and 1×3 convolution operations, respectively, ⊕ denotes the elementwise addition, and *RELU*(•) denotes the rectified linear unit activation function [35].

Our proposed CMI-Net, where accelerometer and gyroscope data were fed into two CNN branches (represented by CNNacc and CNNgyr) separately, is shown in Figure 2a. The dual CNN was constructed to extract modality-specific features and concatenate these features before the final dense layer. To achieve deep interaction between the two-modality data and capture the complementary information and suppress unrelated information from them, a joint CMIM was designed and inserted in the upper layer. The details are

Trotting Eating Standing Galloping Walking-

Activity Class

natural

The CNNacc and CNNgyr contained four convolution blocks, three max-pooling layers, one global average-pooling layer, and one fully connected layer, followed by concatenation and one joint fully connected layer. Inspired by the residual unit in the deep residual network that behaves like ensembles and has smaller magnitudes of responses [33], to

**Figure 2.** The architecture of our proposed cross-modality interaction network (CMI-Net). (**a**) Our proposed CMI-Net. The size of the feature maps is marked after every residual-like convolution block (Res-LCB) layer. Here, "A" and "G" denote the modality-specific features for the accelerometer and gyroscope, respectively, and "ܣ <sup>ᇱ</sup>" and "ܩ <sup>ᇱ</sup>" denote the refined features after modality interaction. "GAP" and "FC" are the global average-pooling layer and fully connected layer, respectively. (**b**) Res-LCB and (**c**) cross-modality interaction module (CMIM). **Figure 2.** The architecture of our proposed cross-modality interaction network (CMI-Net). (**a**) Our proposed CMI-Net. The size of the feature maps is marked after every residual-like convolution block (Res-LCB) layer. Here, "A" and "G" denote the modality-specific features for the accelerometer and gyroscope, respectively, and "*A* 0" and "*G* 0" denote the refined features after modality interaction. "GAP" and "FC" are the global average-pooling layer and fully connected layer, respectively. (**b**) Res-LCB and (**c**) cross-modality interaction module (CMIM).

#### 2.2.2. Cross-Modality Interaction Module 2.2.1. Dual CNN Trunk Architecture

×ு×ௐ and ܩ ∋ ܴ

ܴ ∋ ܣ Let

tention maps ܣ ∋ ܴ

features, i.e., ܣ

<sup>ᇱ</sup> ∈ ܴ

ܥ where

Inspired by the multi-modal transfer module that recalibrates channel-wise features of each modality based on multi-modal information [36] and the convolutional block attention module that focuses on the spatial information of the feature maps [30], we devised a CMIM based on an attention mechanism to adaptively recalibrate temporal- and axis-wise features in each modality by utilizing multi-modal information. The detailed CMIM is illustrated in Figure 2c. The CNNacc and CNNgyr contained four convolution blocks, three max-pooling layers, one global average-pooling layer, and one fully connected layer, followed by concatenation and one joint fully connected layer. Inspired by the residual unit in the deep residual network that behaves like ensembles and has smaller magnitudes of responses [33], to promote the representation ability and robustness of the model, we designed a Res-LCB, as demonstrated in Figure 2b. The definition is given below.

$$\mathbf{X}\_{l+1} = \operatorname{RELU}\left(\mathbf{Conv}^{1 \times 1}(\mathbf{X}\_l) \oplus \mathbf{Conv}^{1 \times 3}(\mathbf{X}\_l)\right),\tag{2}$$

ଷ×ଵݒ݊ܥ)ߪ = ீܣ ,((ܼ)

×ு×ௐ represent the features at a given layer of CNNacc

(3) ,(([(ܩ)݈݃ݒܣ ,(ܣ)݈݃ݒܣ])

ଵ×ு×ௐ were generated through two independent

(5) ,ܩ ⊕ ீܣ ⊗ ܩ = <sup>ᇱ</sup>

(ܼ)), (4)

respectively. The CMIM receives *A* and *G* as input features. We first applied averagepooling operations along channels of the input features, generating two spatial maps. These two maps were then concatenated and mapped into a joint representation ܼ ∈ ܴ <sup>ᇲ</sup>×ு×ௐ. The operation was shown as follows: where *X<sup>l</sup>* and *Xl*+<sup>1</sup> denote feature maps in the *l* and *l* + 1 layers, respectively, *Conv*1×<sup>1</sup> (•) and *Conv*1×<sup>3</sup> (•) represent 1 × 1 and 1 × 3 convolution operations, respectively, ⊕ denotes the elementwise addition, and *RELU* (•) denotes the rectified linear unit activation function [35].

convolution layers with a sigmoid function ߪ (•)using the joint representation *Z*:

×ு×ௐ:

ܩ ,ܣ ⊕ ܣ ⊗ ܣ = <sup>ᇱ</sup>

ܣ and ܣீ were then used to recalibrate the input features, generating two final refined

where ⊗ denotes the elementwise multiplication. Specifically, each convolution operation under this study was followed by a batch normalization operation. The increases in channel numbers and decreases in spatial dimensions were implemented through Res-

mensions of features. Specifically, *H* and *W* correspond to the axial and temporal signals,

×ு×ௐand ܩ

LCB and max-pooling operations, respectively.

ܣ

ଵ×ு×ௐ and ܣீ ∋ ܴ

ଷ×ଵݒ݊ܥ)ߪ = ܣ

<sup>ᇱ</sup> ∈ ܴ

ଷ×ଵݒ݊ܥ)ܷܮܧܴ = ܼ

#### 2.2.2. Cross-Modality Interaction Module

Inspired by the multi-modal transfer module that recalibrates channel-wise features of each modality based on multi-modal information [36] and the convolutional block attention module that focuses on the spatial information of the feature maps [30], we devised a CMIM based on an attention mechanism to adaptively recalibrate temporal- and axis-wise features in each modality by utilizing multi-modal information. The detailed CMIM is illustrated in Figure 2c.

Let *A* ∈ *R <sup>C</sup>*×*H*×*<sup>W</sup>* and *<sup>G</sup>* <sup>∈</sup> *<sup>R</sup> <sup>C</sup>*×*H*×*<sup>W</sup>* represent the features at a given layer of CNNacc and CNNgyr, respectively. Here, *C*, *H*, and *W* denote the channel number and spatial dimensions of features. Specifically, *H* and *W* correspond to the axial and temporal signals, respectively. The CMIM receives *A* and *G* as input features. We first applied average-pooling operations along channels of the input features, generating two spatial maps. These two maps were then concatenated and mapped into a joint representation *Z* ∈ *R C* <sup>0</sup>×*H*×*W*. The operation was shown as follows:

$$Z = \operatorname{RELII}\left(\operatorname{Conv}^{1 \times 3}([Avgpool(A), Avgpool(G)])\right),\tag{3}$$

where *C* <sup>0</sup> denotes the channel number of feature *Z*, *Avgpool* (•) denotes the average-pooling operation, and [•] denotes the concatenation operation. Furthermore, two spatial attention maps *A<sup>A</sup>* ∈ *R* <sup>1</sup>×*H*×*<sup>W</sup>* and *<sup>A</sup><sup>G</sup>* <sup>∈</sup> *<sup>R</sup>* <sup>1</sup>×*H*×*<sup>W</sup>* were generated through two independent convolution layers with a sigmoid function *σ*(•) using the joint representation *Z*:

$$A\_A = \sigma\left(\mathbb{C}onv^{1\times3}(Z)\right)\_\prime, \; A\_G = \sigma\left(\mathbb{C}conv^{1\times3}(Z)\right)\_\prime \tag{4}$$

*A<sup>A</sup>* and *A<sup>G</sup>* were then used to recalibrate the input features, generating two final refined features, i.e., *A* <sup>0</sup> ∈ *R <sup>C</sup>*×*H*×*<sup>W</sup>* and *G* <sup>0</sup> ∈ *R <sup>C</sup>*×*H*×*W*:

$$A' = A \otimes A\_A \oplus A,\ \mathbf{G'} = \mathbf{G} \otimes A\_{\mathbf{G}} \oplus \mathbf{G} \tag{5}$$

where ⊗ denotes the elementwise multiplication. Specifically, each convolution operation under this study was followed by a batch normalization operation. The increases in channel numbers and decreases in spatial dimensions were implemented through Res-LCB and max-pooling operations, respectively.

#### *2.3. Optimization*

As the most widely utilized loss in the multiclass classification task, softmax CE loss was applied to optimize the parameters of CMI-Net. The formulation of softmax CE loss was defined as

$$L\_{\rm CE}(z) = -\sum\_{i=1}^{\mathcal{C}} y\_i \log(p\_i) \tag{6}$$

$$\text{with } p\_{\bar{l}} = \frac{e^{z\_{\bar{l}}}}{\sum\_{j=1}^{C} e^{z\_{\bar{j}}}},\tag{7}$$

where *C* and *z* = [*z*1, . . . , *zC*] are the total number of classes and the predicted logits of the network, respectively. In addition, *y<sup>i</sup>* . . . {0, 1}, 1 ≤ *i* ≤ *C* is the one-hot groundtruth label. However, the models based on softmax CE loss often suffer from inferior classification performance, especially for monitory classes, due to the imbalanced data distribution [23]. Therefore, we further introduced an effective loss function to supervise the training of CMI-Net and alleviate the class imbalance problem, namely, CB focal loss.

CB focal loss, which added the CB term to the focal loss function, focused more on not only samples of minority classes, diminishing their influence from being overwhelmed during optimization, but also samples that were hard to distinguish. The CB term was related to the inverse effective number of samples per class, and focal loss added a modulating factor to the sigmoid CE loss to reduce the relative loss for well-classified samples and focused more on difficult samples. The CB focal loss was presented as

$$L\_{\mathbb{C}B\_{FL}}(z) = \frac{1}{E\_{n\_{y}}} L\_{FL}(z) = -\frac{1-\beta}{1-\beta^{n\_{y}}} \sum\_{i=1}^{c} \left(1 - p\_{i}^{t}\right)^{\gamma} \log\left(p\_{i}^{t}\right) \tag{8}$$

$$\text{with } p\_i^t = \frac{1}{1 + e^{-z\_i^t}} \,\text{s}\tag{9}$$

$$z\_{i}^{t} = \begin{cases} \quad z\_{i\prime} & \text{if } i = y. \\ \quad -z\_{i\prime} & \text{otherwise.} \end{cases} \tag{10}$$

where *n<sup>y</sup>* and *En<sup>y</sup>* represent the actual number and the effective number of the groundtruth label *y*, respectively. The hyperparameter *β* ∈ [0, 1) controlled how fast *En<sup>y</sup>* grows as *n<sup>y</sup>* increases, and *γ* ≥ 0 smoothly adjusted the rate at which easy samples were downweighted [26]. The value of *β* was set to 0.9999, and the search space of the hyperparameter *γ* was set to {0.5, 1.0, 2.0} [25] in this study. In particular, CB loss and focal loss rebalanced the loss function based on class-level and sample-level reweighting, respectively. Thus, we also utilized class-level reweighted losses, including cost-sensitive cross-entropy loss (CS\_CE loss) [24], class-balanced cross-entropy loss (CB\_CE loss) [25], and sample-level reweighted losses, including focal loss [26] and adaptive class suppression loss (ACS loss) [27], to validate the effectiveness of the CB focal loss.

#### *2.4. Evaluation Metrics*

The comprehensive performance of the equine activity classification model was indicated by the following four evaluation metrics, which are defined in Equations (11)–(14). Each indicator value was multiplied by 100 as the result to reflect the difference in indicator values more clearly.

$$\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}'} \tag{11}$$

$$\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}'} \tag{12}$$

$$\text{F1} - \text{Score} = \frac{\text{2TP}}{\text{2TP} + \text{FP} + \text{FN}} \tag{13}$$

$$\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}'} \tag{14}$$

where TP, FP, TN, and FN are the number of true positives, false positives, true negatives, and false negatives, respectively. In particular, the overall precision, recall, and F1-score were calculated by using a macro-average [37].

#### *2.5. Implementation Details*

To attain subject-dependent results, the LOOCV method was used, in which four subjects were chosen for training, one for validation, and one for testing each time and rotated in a circular manner. During training, the loss function was added by an L2 regularization term with a weight decay of 0.1 to avoid overfitting. An Adam optimizer with an initial learning rate of 1 <sup>×</sup> <sup>10</sup>−<sup>4</sup> was employed, and the learning rate decreased by 0.1 times every 20 epochs. The number of epochs and batch size were set to 100 and 256, respectively. The best model with the highest validation accuracy was saved and verified using test data. To evaluate the classification performance of our CMI-Net, we compared it against various existing methods, including three machine learning methods (i.e., NB, DT, and SVM) and two deep learning methods used in equine activity recognition (i.e., CNN and ConvNet7) [14,15], based on the same public dataset. Specifically, the hand-crafted features used in machine learning were the same as those used by Kamminga et al. [7]. To further explore the performance of our CMIM, we ran the network without CMIM and with it inserted after the 1st, 2nd, and 3rd max-pooling layers to obtain four different

variants, i.e., Variant0, Variant1, Variant2, and Variant3, respectively. The softmax CE loss was used as the loss function for all variants. All experiments were executed using the PyTorch framework on an NVIDIA Tesla V100 GPU. The developed source code will be available at https://github.com/Max-1234-hub/CMI-Net from 1 September 2021.

#### **3. Results and Discussion**

Overall, experiments conducted on the public dataset demonstrated that our proposed CMI-Net outperformed the existing algorithms. Ablation studies were then carried out to verify the effectiveness of CMIM and that applying the CMIM in the upper layer of CMI-Net could obtain better performance. Different loss functions were adopted to validate that CB focal loss performed better than any class-level or sample-level reweighted loss used alone, and it effectively improved the overall precision, recall, and F1-score, although the overall accuracy decreased due to the imbalanced dataset used. Furthermore, recognition performance analysis was presented to help us probe the predicted performance on each activity using our CMI-Net with CB focal loss. The details are described as follows.

#### *3.1. Comparison with Existing Methods*

The comparison results of our CMI-Net with three machine learning methods (i.e., NB, DT, and SVM) and two deep learning methods (i.e., CNN and ConvNet7) [14,15] are illustrated in Table 1. The results revealed that the CMI-Net with softmax CE loss outperformed the machine learning algorithms with higher precision, recall, F1-score, and accuracy of 79.74%, 79.57%, 79.02%, and 93.37%, respectively. The reason for this superior performance was the convolution and pooling operations in CNN, which could achieve automated feature learning and aggregate more complex and general patterns without any domain knowledge [38]. The other CNN-based method [15] obtained inferior precision of 72.07% and accuracy of 82.94% compared to DT and SVM. This result is consistent with the "No Free Lunch" theorem [39] because this CNN-based method [15] was developed using leg-mounted sensor data. In addition, our CMI-Net with softmax CE loss performed better than ConvNet7 [14], which obtained lower precision, recall, F1-score, and accuracy of 79.03%, 77.79%, 77.90%, and 91.27%, respectively. This was attributed to the ability of our architecture to effectively capture the complementary information and inhibit unrelated information of multi-modal data through deep multi-modality interaction. In addition, CMI-Net with CB focal loss (*γ* = 0.5) enabled the values of precision, recall, and F1-score to increase by 2.76%, 4.16%, and 3.92%, respectively, compared with CMI-Net with softmax CE loss. This revealed that the adoption of CB focal loss effectively improved the overall classification performance.

**Table 1.** Classification performance comparison with existing methods. The best two results for each metric are highlighted in bold.

