Multihead-Res-SE Residual Network with Attention for Human Activity Recognition

Kang, Hongbo; Lv, Tailong; Yang, Chunjie; Wang, Wenqing

doi:10.3390/electronics13173407

Open AccessArticle

Multihead-Res-SE Residual Network with Attention for Human Activity Recognition

by

Hongbo Kang

,

Tailong Lv

,

Chunjie Yang

and

Wenqing Wang

^*

School of Automation, Xi’an University of Posts & Telecommunications, Xi’an 710100, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(17), 3407; https://doi.org/10.3390/electronics13173407

Submission received: 6 May 2024 / Revised: 22 June 2024 / Accepted: 18 July 2024 / Published: 27 August 2024

(This article belongs to the Special Issue Deep Learning-Based Object Detection/Classification)

Download

Browse Figures

Versions Notes

Abstract

:

Human activity recognition (HAR) typically uses wearable sensors to identify and analyze the time-series data they collect, enabling recognition of specific actions. As such, HAR is increasingly applied in human–computer interaction, healthcare, and other fields, making accurate and efficient recognition of various human activities. In recent years, deep learning methods have been extensively applied in sensor-based HAR, yielding remarkable results. However, complex HAR research, which involves specific human behaviors in varied contexts, still faces several challenges. To solve these problems, we propose a multi-head neural network based on the attention mechanism. This framework contains three convolutional heads, with each head designed using one-dimensional CNN to extract features from sensory data. The model uses a channel attention module (squeeze–excitation module) to enhance the representational capabilities of convolutional neural networks. We conducted experiments on two publicly available benchmark datasets, UCI-HAR and WISDM, to evaluate our model. The results were satisfactory, with overall recognition accuracies of 96.72% and 97.73% on their respective datasets. The experimental results demonstrate the effectiveness of the network structure for the HAR, which ensures a higher level of accuracy.

Keywords:

human activity recognition; deep learning; residual block; squeeze-and-excitation module; multichannel CNN; attention mechanism

1. Introduction

Human activity recognition (HAR) involves extracting meaningful features from human activity data, learning and understanding performed activities, and identifying activity types [1]. The field of HAR combines knowledge from various disciplines, including machine learning, smart biotechnology, smart wearable computing technology, and computer vision [2]. Understanding how users interact with their environment is crucial, especially in sectors like health monitoring, sports tracking, navigation and positioning, and various Internet of Things applications, making HAR a vibrant and sought-after research domain [3].

Over recent decades, a variety of different HAR systems, including cameras, sensors, and Wi-Fi wireless signals, among others, have been developed. In comparison to other types of sensors, those designed for use in HAR are more suitable due to their robustness in test environments, their ability to function in the absence of optimal lighting conditions, and their lack of invasiveness in human privacy. The increasing prevalence of portable devices, such as smartphones and smartwatches, has led to a surge in the utilization of sensor-based methodologies for HAR applications [1]. The portability and processing capabilities of smartphones have rendered them indispensable in our daily lives. Consequently, sensor data generated by smartphone devices have emerged as a dominant research focus, exhibiting clear advantages over other sensors [4].

Traditional machine learning methods require manual design of complex feature extraction and dimensionality reduction. Statistical or structural features, such as mean, median, and standard deviation, are typically used. Obtaining the most relevant set of manual features requires domain expertise. Although handcrafted approaches perform well when dealing with limited training datasets, they become increasingly inadequate in dealing with the complexity of multisensor data as the volume of data increases. Traditional methods have been developed for dealing with relatively simple problems; they are therefore less suited to the complexities of data analysis [5]. Traditional machine learning techniques often fail to adequately capture the relationship between time and space when dealing with time series data and rely on statistical modeling to analyze time series data, such as linear regression and autoregressive models. However, these models tend to ignore both the temporal and spatial dimensions, leading to unsatisfactory predictions.

With the popularity of wearable devices such as smartphones and smartwatches, the use of HAR technology has become more widespread and deeper. The general process of HAR includes [6]: the collection of motion data using sensors, preprocessing of the data, action segmentation, extraction of features, and action classification. Most of the research on HAR uses machine learning methods, including SVM, Bayes, and decision trees.

Deep learning methods have shown great potential and value in the field of human activity recognition due to their high efficiency and classification accuracy [7]. In today’s technological advancement, deep learning has become a research hotspot in the field of HAR, demonstrating its unique advantages and potential [8]. By utilizing sensor-based data, deep learning methods are able to automatically extract high-level features, greatly improving the accuracy of activity recognition. This approach not only reduces the time-consuming feature engineering phase, but also provides deep learning-based HAR systems with higher classification accuracy than traditional machine learning methods.

The major contributions of our work are as follows:

We propose a multi-channel neural network architecture constructed through the integration of convolutional layers and residual networks. The residual module is tailored to extract crucial features, while the incorporation of a squeeze–excitation attention module (SEM) facilitates the modeling of prolonged data sequences and the classification of activities.
The abundant representational capabilities of convolutional neural networks (CNNs), including weight sharing and local connectivity, are utilized for initial feature extraction at various scales through multi-head convolutional networks. The ultimate classifications are then mapped using the global average pooling (GAP) technique to improve the smoothness of the transformation and bolster the model’s robustness.
To enhance the effectiveness of our model, we introduced the residual module as a foundational framework for advanced action recognition, integrating the attention mechanism (SEM). This incorporation facilitates the modeling of extensive data sequences, the efficient extraction of spatio-temporal features, and the classification of complex human activities.
To evaluate the robustness and generalization capability of our model, we conducted experiments on two publicly accessible datasets: UCI-HAR and WISDM. Our model exhibited superior performance compared to baseline experimental models, significantly enhancing the recognition accuracy of analogous behaviors.

The remainder of this paper is organized as follows: Section 2 furnishes an overview of the related literature pertaining to human activity recognition (HAR) and deep learning methodologies. Section 3 delineates the architecture of the proposed HAR framework. Section 4 expounds upon the two public datasets employed in the experiments, the experimental setup, and the results. Section 5 encapsulates conclusions and deliberates on prospective avenues for future research.

2. Related Work

Human activity recognition (HAR) has evolved as a demanding research subject that seeks to identify persons’ actions based on relevant activity data. Initially, researchers explored HAR using machine learning approaches. For example, Margarito et al. [9] mounted accelerometers to respondents’ wrists and collected acceleration data to categorize typical physical activities. Notably, the bulk of HAR investigations have used machine learning approaches. However, these algorithms usually require human assistance to evaluate and categorize data. To address this, Ronao and Cho [10] devised a model with alternating convolutional and pooling layers that extracts characteristics from raw sensor data automatically and effectively, allowing for more accurate predictions of human activity.

Deep learning offers remarkable capabilities for self-learning and adaptation, making it a powerful tool for automatically extracting high-level features, a task that poses challenges for traditional machine learning approaches [11]. This inherent adaptability empowers deep learning methods to efficiently tackle complex real-world problems. The field of human activity recognition (HAR) has seen a surge in the adoption of deep learning techniques, yielding impressive results. Zeng et al. introduced a convolutional neural network (CNN)-based approach that effectively captures local dependencies and scale invariance within signals. Their method includes a novel partial weighting technique tailored for accelerometer signals, resulting in further performance enhancements [12]. Yang et al. [13] employed a 1D convolution (Conv1D) technique within a shared time window to consolidate and distribute weights across multiple sensor data streams. In [14], the authors designed a mobile phone sensor-based HAR model using a CNN. Wang and Liu [15] proposed a hierarchical Long Short-Term Memory (LSTM) architecture for human activity recognition, leveraging the temporal dynamics inherent in sequential data. Furthermore, CNNs have been extensively utilized to extract temporal features in HAR, often yielding superior performance. Bianchi et al. [16] introduced a CNN model comprising four convolutional layers and one fully connected layer, achieving notable results in HAR tasks.

Mohib et al. [17] proposed a novel approach for human activity recognition (HAR) by employing a stacked Long Short-Term Memory (LSTM) network, utilizing smartphone sensor data as input. Building upon this, other researchers have explored hybrid architectures by combining convolutional neural networks (CNNs) with LSTM or Bidirectional LSTM (BiLSTM) units to enhance the capabilities of deep learning models for HAR tasks. In a recent development, a multi-class wearable user recognition framework was introduced, showcasing the effectiveness of a hybrid CNN-LSTM model [18]. This innovative fusion model outperformed both standalone CNN and LSTM models, highlighting the synergistic benefits of combining these architectures. Chen et al. [19] utilized a one-dimensional CNN-LSTM network to extract features from lengthy sequences of data, incorporating an attention mechanism for improved performance. This strategy enables the model to focus on relevant information within the input sequences, enhancing feature extraction and classification accuracy. Furthermore, Challa et al. [20] proposed a hybrid network architecture that integrates convolutional neural networks (CNNs) with Bidirectional Long Short-Term Memory (BiLSTM) units. Their approach aims to boost the accuracy and efficiency of HAR recognition tasks. Experimental results validate the significant performance gains achieved by this hybrid model compared to traditional machine learning algorithms, underscoring the effectiveness of deep learning techniques in HAR applications.

The attention mechanism is a sequence modeling approach that captures important information in long sequences by simulating how humans pick out information of interest from sequences. In contrast to RNNs, the attention mechanism enables the model to capture the dependencies between each position in the sequence, which is especially important for dealing with complex sequences. Ma [21] proposed the AttnSense model, which combines attention with CNN and RNN networks to capture the dependencies of perceived signals in the spatial and temporal domains. Initially, temporal attention was applied to the hidden layer of the LSTM to highlight relevant features and combine multiple sensor-modal combiners according to their importance. The Transformer model incorporates a multi-head attention mechanism. The main concept is to utilize a self-attentive mechanism to capture contextual associations at different positions in the input sequence and assign weight to significant information to converge. Khan and Ahmad [22] propose a framework of three lightweight convolutional heads, each using a one-dimensional CNN to extract features from sensed data. The lightweight multi-head model was introduced with an attentional mechanism using a squeeze–excitation-excitation module in order to enhance the representational capabilities of the CNNs, allowing for the automatic selection of salient features and the suppression of unimportant ones. Zhang et al. [23] proposed a Bidirectional Long Short-Term Memory network (BiLSTM) based on attention mechanism and residual connectivity for human activity recognition. By combining a one-dimensional convolutional neural network (1DCNN), ResBLSTM, and the attention mechanism, the method is able to effectively capture the temporal features of action sequences and improve the recognition accuracy of similar and transitional actions.

These approaches provide effective strategies for human activity recognition based on wearable sensing devices. To address the challenges associated with complex HAR, we utilized a deep neural network that combines a residual network and an attention module. The paper proposes an approach that incorporates additional attention into the system through squeeze-and-excite mechanisms. Experimental results on public datasets demonstrate the effectiveness and stability of this approach.

3. Proposed Architectures

3.1. Network Infrastructure

This paper presents the basic architecture of deep learning, which aims to solve complex HAR problems. The first step is to collect human activity data with sensors. In the second step, the data from the smartphone sensor is segmented using a sliding window of a fixed size. The sequences are then inputted into the CNN model with attention to obtain an effective feature representation. The sequences are then inputted into the CNN model with attention to obtain an effective feature representation. In the third step, the processed data are fed into the proposed neural network model for feature extraction and classification using the classification layer activated by softmax. Finally, the model performance was evaluated by standard evaluation metrics such as accuracy, precision, and F1 score. Figure 1 shows the overall structure of the framework.

The proposed Multi-Res-SE model is composed of three main parts. The first part consists of three parallel initial blocks. The second part consists of three residual and squeeze–excitation (Res-SE) blocks connected in series with each initial block. The third component is the classification module, which is responsible for classifying human activities using softmax classifiers.

The model’s head comprises three convolutional heads, each utilizing one convolutional block. The convolutional block uses Conv1d to extract features from the input time series data. The convolutional layer is followed by batch normalization (BN) layer and rectified linear unit (ReLU) activation function, which helps to reduce the problem of vanishing or exploding gradients and overcome overfitting. The convolution output is transferred to the following residual network layer. Introducing branching into the Res-SE structure can effectively solve the common problems of information loss and gradient vanishing. The Res-SE module comprises a number of layers, including convolution, BN, ReLU, squeeze–excitation attention module (SEM), and bypass join. The final result is then processed using the softmax algorithm, which classifies the result to obtain the final result.

3.2. Residual Module

As the network becomes deeper, research focuses on overcoming barriers to information and gradient transmission. The Microsoft research team constructed residual networks [24]. The main idea is that optimizing the residual mapping is easier than optimizing the original unreferenced mapping. So, the residual network is parameterless, and the shortcut is never closed. Additionally, all information is acquired through an additional residual function.

Residual networks have an important advantage in that they are easier to train. This is due to the gradient being able to pass more directly through the layers using an additive operator, allowing them to bypass some layers that would otherwise be constrained. It enables improved training and deeper networks. This is because the remaining connections do not hinder the gradient, and still contribute to refining the output of several stacked layers that consist of these remaining connections. At the beginning of the set of residual connections, there is a bottleneck where the next layer is no longer residual. BN layers are often used to normalize and restrict the feature space represented by the layer.

ResNet addresses this challenge by integrating shortcut residual connections between convolutional layers, thereby facilitating the flow of gradients throughout the network. Based on this, a lightweight residual module is constructed, which comprises two convolutional layers. The BN layer [25] and ReLU [26] layer follow the first convolutional layer, while the SEM follows the second convolutional and BN layers. Its structure is shown in Figure 2, and can be defined as

y^{h + 1} = F (y^{h}, σ^{h}) + E (y^{h}),

(1)

where F is a function known as the residual function.

σ^{h} = {σ_{j k}^{h} | 1 \leq j \leq n, 1 \leq k \leq m}

is a set of two-layer weights associated with k residual blocks, and

σ_{j k}^{h}

is the weight of k neurons in the first layer to j neurons in the second layer. The function E represents an independent mapping. Each block accepts an input

y^{h}

and separately produces a forward stream of output

y^{h + 1}

.

3.3. Attention Mechanisms

The attention mechanism was initially developed to enable selective attention to distinct spatial regions within an image, thereby enhancing the model’s ability to recognize and comprehend intricate visual content. This contrasts with the previous approach of processing the entire image indiscriminately. The focus has shifted over time. In recent years, it has been applied in many fields, and has demonstrated positive results in various scenarios, including [27,28]. Mohammed et al. [29] proposed a human activity recognition algorithm based on the Multi-ResAtt model, which achieves more accurate recognition by considering the before and after the relationship of activities by introducing the attention mechanism, which effectively improves the accuracy of activity recognition. Wang et al. [30] proposed a two-channel network model based on residual blocks, an efficient channel attention module (ECA) [31], and a gated recurrent unit (GRU). The model can model long-term data sequences, effectively extract spatio-temporal features, and achieve the effectiveness of different sensor data combinations in human activity recognition.

Based on this concept, an attention network is used that focuses on key features and filters out important representations for recognition. The attention layer’s task is to perform this filtering. A network with an attention mechanism is better equipped to process more information and save computational costs, allowing for faster learning compared to a network without one. To improve existing intelligent activity recognition methods, we consider combining this attention with CNNs.

Hu et al. [32] proposed the Squeeze-and-Excitation module in 2017, which aims to increase the representational power of the network by modeling the relationships between channels. SE is a computationally efficient architectural unit that enables the network to recalibrate features by enhancing optimal ones and suppressing unnecessary ones.

Given the input feature map

T \in R^{C \times H \times W}

, the SEM generates channel statistics using global average pooling. Formally, T is squeezed by the spatial dimension

H \times W

to generate the statistic

Z \in R^{C}

. Subsequently, the C element of Z is computed as

Z_{C} = F_{s q} (T) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} T (i, j)

(2)

where

F_{s q}

is the squeeze function, and H and W represent the height and width of the input feature map. The aggregation information obtained from the squeezing operation is then passed to the excitation operation using a simple gating mechanism with a sigmoid activation function is used as

s = F_{e x} (Z, W) = σ (g (Z, W)) = σ (W_{2} δ (W_{1} Z))

(3)

where

δ

denotes the ReLU activation function,

σ

is a Sigmoid activation function,

W_{1} \in R^{C / r \times C}

,

W_{2} \in R^{C \times C / r}

, and r is the reduction ratio.

The final output of the block is obtained by rescaling u with activations s as

\tilde{x_{C}} = F_{s c a l e} (u_{C}, s_{C}) = s_{C} \cdot u_{C}

(4)

where

F_{s c a l e} (u_{C}, s_{C})

indicate the channel-wise multiplication between the scalar

s_{C}

and feature map

u_{C} \in R^{H \times W}

and

\tilde{X} = [\tilde{x_{1}}, \tilde{x_{2}}, \dots, \tilde{x_{C}}]

.

Figure 3 shows the SEM, which consists of two parts: a squeeze function that aggregates the summary information of each feature map, and an excitation function that modulates the relevance of each feature map according to its size. Squeezing extracts only the most relevant information from each channel using global average pooling.

4. Experimental Results

4.1. Datasets

The University of California Irvine Human Action Recognition (UCI-HAR) dataset [33] was released by the Laboratory of Nonlinear Complex Systems, University of Genoa, Italy. The dataset was collected from 30 healthy volunteers aged between 19 and 48 years old. A smartphone was attached to the waist of each participant during the data collection process. The data were sampled at a frequency of 50 Hz and processed using a sliding window method with a window length of 128 and an overlap rate of 50%. The dataset consisted of 10,299 samples, each containing 128 data points. The dataset consists of three types of data: acceleration, linear acceleration and angular velocity. The training set constitutes 70% of the data, while the remaining 30% forms the test set. This division ensures a thorough evaluation of the performance of the model on unseen data.

The Wireless Sensor Data Mining (WISDM) dataset is a publicly available resource for human activity recognition. It was published by the Wireless Sensor Data Mining Laboratory [34], and contains sensor data from 36 volunteers who were moving in a specific environment. The participants placed their mobile phones in the pockets of their trousers to perform six actions. The data were collected using a tri-axial accelerometer in the smartphones, with a total of 1,098,207 samples at a sampling frequency of 20 Hz. The data were segmented using the sliding window method with a window length of 128 and an overlap rate of 50 per cent. The segmentation method yielded 17,158 data samples, each comprising 128 data points. The dataset was divided into an 80% training set and a 20% test set for the purposes of model training and testing. This division ensures that the model can be evaluated on previously unseen data, and generalizes well to new instances.

4.2. Pre-Processing

Due to the non-uniform distribution inherent in the raw data, data normalization entails adjusting the mean to 0 and the standard deviation to 1. Moreover, meticulous attention was devoted to data segmentation to bolster the efficacy of the proposed model. The raw data encompasses a time series depicting diverse user activities. However, in practical model applications, the data about users for prediction remains undisclosed. To better assess the validity of the proposed model, special attention was paid to the segmentation of the dataset. The source data for the model is a time series of user IDs’ actions, but the projected user data are unknown when implemented in practice. Segmentation of the raw data through a sliding window divides the dataset randomly into a training set and a test set in a certain proportion, which will result in the possibility that some samples of the same user activity may appear in both the training set and the test set, and dividing the dataset in this way will improve the accuracy of the proposed model.

To use the data for feature extraction and activity recognition, the time series signals must first be split into sequences based on the window size. In this approach, a fixed-length sliding window is employed due to its technical simplicity and computational efficiency. In the case of the UCI-HAR dataset, the length of the sliding window was set to 128, while for the WISDM dataset, the length of the sliding window was set to 90, with an overlap of 50% in both cases.

4.3. Testbench

The experiment was conducted using a Windows 10 Professional 64-bit operating system for deep learning model training. The hardware configuration included an Intel Core i7-7700HQ CPU @ 2.80 GHz and 16 GB RAM. In terms of software configuration, the models in this paper use the Keras deep learning framework, Python 3.9 language, running on TensorFlow 2.7.4 to develop and train network models.

The network model built in this paper is shown in Figure 1, where the sensor data are preprocessed and fed into the model for training. The input samples are initially processed by the convolutional layer of the channel, which accepts three-dimensional input data. The first dimension represents the number of samples, the second dimension represents the size of the sliding window (128), and the third dimension represents the number of original features (nine for the UCI-HAR dataset and three for the WISDM dataset). The three convolutional layers have kernel sizes of 3, 5, and 7, a stride of 1, and 64 filters. The P_Stride of the Max Pooling layer is set to 2. The residual block’s two convolutional layers have a kernel size of 5 and a stride of 1. The first convolutional layer has 32 filters, while the second has 64.

Hyperparameter optimization is a critical component of deep learning and machine learning projects that ensures optimal model performance. It is an iterative and trial-and-error procedure that seeks to improve a model’s performance by fine-tuning its internal configuration settings. These internal configuration settings, such as the learning rate, batch size, and regularization factor, have a substantial influence on the model’s training pace and overall performance. Importantly, hyperparameter optimization is a complicated operation whose efficiency is influenced by a number of parameters, including dataset properties, model design, and computing resource constraints. As a result, in actual applications, we must pick and adapt the optimization approach based on the individual scenario in order to attain the optimum hyperparameter configuration and model performance. To determine the ideal hyperparameter combinations, we employed a large number of parameter combinations that encompassed a wide variety of search spaces and parameter ranges, and looked to the literature [22,23,35] for hyperparameter settings on the same datasets.

By testing and comparing the model performance under various techniques, we may conclude, to some extent, that the following hyperparameter combinations result in outstanding model performance. The proposed model made use of the following hyperparameters: (i) epochs, (ii) batch size, (iii) learning rate, (iv) optimization, and (v) loss function. Due to memory constraints on the computer, for the UCI-HAR dataset, the batch size for training was set to 240; for the WISDM dataset, the batch size for training was set to 260, and both epochs were set to 200. After 30 epochs, if no progress in the validation loss was seen, we implemented a call back of early stopping to bring the training process to an end. The learning rate was initially set to 0.001. To reduce the error, use the Adam optimizer. The cross-entropy is used as a loss function to measure the difference between the model predictions and the true labels.

4.4. Evaluation Metrics

Common evaluation metrics for classification models include accuracy, precision, recall, and F1-Score. These metrics will be used to evaluate the proposed model.

The experiment’s results are evaluated based on recognition accuracy and F1-Score. Overall recognition accuracy represents the percentage of correctly recognized samples out of the total number of samples, while various types of activity recognition accuracy represent the percentage of correctly recognized samples in each category out of the total number of samples in that category. The F1-Score is the reconciled mean of precision and recall.

Accuracy: the ratio of the number of samples correctly classified by the classifier to the total number of samples for a given test dataset is the rate of correctly identified samples.

Accuracy = \frac{TP + TN}{TP + FP + FN + TN}

(5)

where TP means the number of true positives, TN means the number of true negatives, FP means the number of false positives, and FN means the number of false negatives. Precision: the ratio of the number of correctly identified positive samples from identified samples to the total number of samples identified as positive.

Precision = \frac{TP}{TP + FP}

(6)

Recall: also known as sensitivity, it is the proportion of samples classified as positive out of the total number of positive samples

Recall = \frac{TP}{TP + FN}

(7)

F1-Score: it is a measure of a model’s accuracy on a dataset for evaluating binary classification systems, and is a reconciled average of precision and recall.

F 1 - Score = 2 \times \frac{Precision \times Recall}{Precision + Recall}

(8)

4.5. Analysis of Results

As neural networks exhibit a degree of randomness, this paper presents the final results of the model as an average of several experiments. The proposed model was tested on two public datasets to evaluate its effectiveness. The accuracy of the model in various types of activity recognition is shown in Table 1.

To provide a clearer representation of the proposed network model’s performance on the UCI-HAR and WISDM datasets, separate confusion matrix plots were created to display their recognition results. Both confusion matrices have six rows and six columns. The left-hand side represents the true activity categories of the sample, while the bottom represents the predicted activity categories of the sample. The confusion matrix displays the number of correctly classified samples in the upper half of the squares on the main diagonal, while the lower half shows the percentage accuracy of correct predictions. The integer values in the upper half of the remaining squares indicate the number of misclassified samples, while the percentage values in the lower half indicate the error rate in identifying other activities.

For the UCI-HAR dataset, the classification results in Table 1 show that the proposed model performs well on the baseline UCI-HAR dataset, achieving more than 93% accuracy in recognizing the six activities in the dataset, which are all better than the comparative experimental models, and 4% more accurate than the non-attentive residual model. The paper’s model demonstrates a maximum improvement of 4.59% and 3.56% in overall recognition accuracy and F1 score, respectively, compared to the experimental model.

The confusion matrix is given in Figure 4. There was a high rate of misclassification of seated activities, with 5.1% misclassified as standing and 4.5% of standing cases predicted as seated. The reason may be that sitting and standing are very similar static activities. Moreover, it is evident that the proposed model accurately identifies the majority of downstairs activity, with only one instance being misclassified as walking activity. However, upstairs activities have a significantly higher rate of misclassification at 3.4%, compared to downstairs activities.

In the WISDM dataset, the network model proposed in this paper achieves an accuracy of more than 94% for all six activities, which is 5% more accurate than the non-attentive residual model, and provides a maximum improvement of 5.64% in overall recognition accuracy and 6.38% in F1 score over the comparative experimental model, respectively. Although sitting and standing are both static activities and occur less frequently than other activities, our model accurately classifies most instances of sitting and standing. We also observe from the confusion matrix that our model has successfully recognized most jogging and walking instances. However, the rate of misclassification instances was higher for the dynamic activities of ascending and descending stairs compared to other activities. Figure 5 shows that 3.5% of downstairs instances were misclassified as upstairs, and 5.8% of upstairs instances were confused with downstairs activities. There are several misclassifications of walking and jogging due to their striking similarity in lower limb motions.

As is shown in Table 2, we begin by comparing it with baseline classification models (CNN and LSTM). Next, two hybrid deep learning approaches are compared with the proposed model: CNN–LSTM and Res-LSTM. It should be noted that both the recognition accuracy and the F1 score of our model were better than those of the other models, and that our model achieved the highest accuracy (97.73%) and the highest F1-score (97.61%). The residual network with the addition of the SEM improved the overall recognition accuracy on the dataset by 1.63% and 1.71% compared to the residual network Res-LSTM with no attention added. The results demonstrate that the network model presented in this paper exhibits superior stability in all types of activity recognition results compared to the comparative experimental models. Additionally, it is more effective in recognizing confusing activity categories.

Overall, the proposed multi-head model, which is built on CNN architecture and includes an attention mechanism in each head, efficiently extracts the best characteristics from sensor data. As a consequence, UCI-HAR and WISDM datasets achieve greater accuracy than non-attentive models and other cutting-edge HAR approaches.

4.6. Ablation Studies

4.6.1. Impact of Residual Blocks

In order to investigate the impact of the proposed added residual blocks on the model performance, we performed an ablation experiment using on both UCI-HAR and WISDM datasets.

The experiment uses a CNN without residual block as the baseline architecture. The results of the experiment are shown in Table 3. The CNN has the lowest recognition performance, which may be attributed to the information leakage that prevents successful capture of spatial relationships. In contrast, the model using the residual linkage module performs better. The F1-Score generated using the UCI-HAR dataset and the WISDM dataset improved by 1.92% and 2.15%, respectively.

4.6.2. Effect of the SEM

We ran an extra ablation experiment on our proposed model to see how well SEM captures spatial patterns in time series data. In this experiment, we employed a modified network design with no SEM as a baseline. As shown in Table 4, the F1 scores derived using SEM improved by 1.89% and 1.58% on the two datasets, respectively.

In summary, a multi-channel CNN network model based on residual blocks and squeeze–excitation modules can effectively extract optimal features from sensors. Compared with other state-of-the-art HAR methods, higher F1-Scores were obtained for the UCI-HAR and WISDM datasets.

5. Conclusions and Future Works

This paper introduces a network model for multi-head human activity recognition based on residual networks. The residual connectivity ensures the effectiveness of information transfer, which enables the improved recognition of similar human activities. The addition of additional attention (the SEM) to the system allows for a simpler and lighter network structure, capable of modeling long-term time series data. The attention mechanism enables the automatic extraction of relevant information while disregarding irrelevant details. The model extracts spatial features from the input data using a one-dimensional convolutional neural network. CNN applies convolutional operations to capture patterns and structures in the data. They also use pooling layers with appropriate step sizes to reduce the length of the time series while retaining relevant information. To evaluate our approach, experiments were conducted on two benchmark datasets: the UCI-HAR and the WISDM. Experiments show that our proposed model with attention has good performance on two public datasets compared to other non-attentive networks, further improving the model’s ability to distinguish between different actions.

The approach suggested in this study is primarily intended for sensor-based HAR, whereas the lightweight real-time HAR method based on computer vision and WiFi requires additional investigation. In the future, we want to apply modules to HAR in certain settings. Furthermore, future research will focus on lowering computing complexity and increasing real-time interaction. Our future work will focus on refining the model by optimizing hyperparameters to decrease the model’s size and computational demands. We plan to integrate spatial and channel attention mechanisms into the CNN to further enhance recognition accuracy. This study aims to explore the design of shorter and more efficient deep learning architectures. This will help to minimize computational and parameter overheads, resulting in greater efficiency. Additionally, we will investigate more effective methods of parameter tuning. Although grid searches are feasible, the search range must be manually adjusted, and the values are always constant. To ensure an adaptive search process and neural network architecture, automatic adjustments such as reshaping, adding, and removing layers should be considered. In addition, we plan to add spatial attention and channel attention mechanisms to the CNN network in the future to improve recognition accuracy. Performance improvements can be achieved by developing more complex and efficient lightweight deep learning networks and innovative data representations using time-frequency analysis.

Author Contributions

Conceptualization, T.L.; methodology, T.L.; software, T.L.; validation, T.L.; investigation, H.K.; resources, C.Y.; data curation, W.W.; writing—original draft preparation, T.L.; writing—review and editing, W.W., H.K. and C.Y.; data visualization, T.L.; supervision, W.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Shaanxi Provincial Science and Technology Department (Grant No. 2018ZDXM-GY-039).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The UCI-HAR dataset used in this work is available at https://archive.ics.uci.edu/dataset/240/human+activity+recognition+using+smartphones, accessed on 5 April 2024. The WISDM dataset used in this work is available at https://archive.ics.uci.edu/dataset/507/wisdm+smartphone+and+smartwatch+activity+and+biometrics+dataset, accessed on 10 April 2024.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Nweke, H.F.; Teh, Y.W.; Al-Garadi, M.A.; Alo, U.R. Deep learning algorithms for human activity recognition using mobile and wearable sensor networks: State of the art and research challenges. Expert Syst. Appl. 2018, 105, 233–261. [Google Scholar] [CrossRef]
Sousa Lima, W.; Souto, E.; El-Khatib, K.; Jalali, R.; Gama, J. Human activity recognition using inertial sensors in a smartphone: An overview. Sensors 2019, 19, 3213. [Google Scholar] [CrossRef]
Kim, Y.W.; Joa, K.L.; Jeong, H.Y.; Lee, S. Wearable IMU-based human activity recognition algorithm for clinical balance assessment using 1D-CNN and GRU ensemble model. Sensors 2021, 21, 7628. [Google Scholar] [CrossRef] [PubMed]
Cornacchia, M.; Ozcan, K.; Zheng, Y.; Velipasalar, S. A survey on activity detection and classification using wearable sensors. IEEE Sens. J. 2016, 17, 386–403. [Google Scholar] [CrossRef]
Sena, J.; Barreto, J.; Caetano, C.; Cramer, G.; Schwartz, W.R. Human activity recognition based on smartphone and wearable sensors using multiscale DCNN ensemble. Neurocomputing 2021, 444, 226–243. [Google Scholar] [CrossRef]
Yousefi, B.; Loo, C.K. Biologically-inspired computational neural mechanism for human action/activity recognition: A review. Electronics 2019, 8, 1169. [Google Scholar] [CrossRef]
Gil-Martín, M.; San-Segundo, R.; Fernandez-Martinez, F.; Ferreiros-López, J. Improving physical activity recognition using a new deep learning architecture and post-processing techniques. Eng. Appl. Artif. Intell. 2020, 92, 103679. [Google Scholar] [CrossRef]
Gao, W.; Zhang, L.; Teng, Q.; He, J.; Wu, H. DanHAR: Dual attention network for multimodal human activity recognition using wearable sensors. Appl. Soft Comput. 2021, 111, 107728. [Google Scholar] [CrossRef]
Margarito, J.; Helaoui, R.; Bianchi, A.M.; Sartor, F.; Bonomi, A.G. User-independent recognition of sports activities from a single wrist-worn accelerometer: A template-matching-based approach. IEEE Trans. Biomed. Eng. 2015, 63, 788–796. [Google Scholar] [CrossRef]
Ronao, C.A.; Cho, S.B. Human activity recognition with smartphone sensors using deep learning neural networks. Expert Syst. Appl. 2016, 59, 235–244. [Google Scholar] [CrossRef]
Demrozi, F.; Pravadelli, G.; Bihorac, A.; Rashidi, P. Human activity recognition using inertial, physiological and environmental sensors: A comprehensive survey. IEEE Access 2020, 8, 210816–210836. [Google Scholar] [CrossRef]
Zeng, M.; Nguyen, L.T.; Yu, B.; Mengshoel, O.J.; Zhu, J.; Wu, P.; Zhang, J. Convolutional Neural Networks for human activity recognition using mobile sensors. In Proceedings of the 6th International Conference on Mobile Computing, Applications and Services, Austin, TX, USA, 6–7 November 2014; pp. 197–205. [Google Scholar]
Yang, J.; Nguyen, M.N.; San, P.P.; Li, X.; Krishnaswamy, S. Deep convolutional neural networks on multichannel time series for human activity recognition. In Proceedings of the IJCAI, Buenos Aires, Argentina, 25–31 July 2015; Volume 15, pp. 3995–4001. [Google Scholar]
Wan, S.; Qi, L.; Xu, X.; Tong, C.; Gu, Z. Deep learning models for real-time human activity recognition with smartphones. Mob. Netw. Appl. 2020, 25, 743–755. [Google Scholar] [CrossRef]
Wang, L.; Liu, R. Human activity recognition based on wearable sensor using hierarchical deep LSTM networks. Circuits Syst. Signal Process. 2020, 39, 837–856. [Google Scholar] [CrossRef]
Bianchi, V.; Bassoli, M.; Lombardo, G.; Fornacciari, P.; Mordonini, M.; De Munari, I. IoT wearable sensor and deep learning: An integrated approach for personalized human activity recognition in a smart home environment. IEEE Internet Things J. 2019, 6, 8553–8562. [Google Scholar] [CrossRef]
Ullah, M.; Ullah, H.; Khan, S.D.; Cheikh, F.A. Stacked lstm network for human activity recognition using smartphone data. In Proceedings of the 2019 8th European Workshop on Visual Information Processing (EUVIP), Roma, Italy, 28–31 October 2019; pp. 175–180. [Google Scholar]
Mekruksavanich, S.; Jitpattanakul, A. Biometric user identification based on human activity recognition using wearable sensors: An experiment using deep learning models. Electronics 2021, 10, 308. [Google Scholar] [CrossRef]
Chen, Z.; Wu, M.; Cui, W.; Liu, C.; Li, X. An attention based CNN-LSTM approach for sleep-wake detection with heterogeneous sensors. IEEE J. Biomed. Health Inform. 2020, 25, 3270–3277. [Google Scholar] [CrossRef]
Challa, S.K.; Kumar, A.; Semwal, V.B. A multibranch CNN-BiLSTM model for human activity recognition using wearable sensor data. Vis. Comput. 2022, 38, 4095–4109. [Google Scholar] [CrossRef]
Ma, H.; Li, W.; Zhang, X.; Gao, S.; Lu, S. AttnSense: Multi-level attention mechanism for multimodal human activity recognition. In Proceedings of the IJCAI, Macao, China, 10–16 August 2019; pp. 3109–3115. [Google Scholar]
Khan, Z.N.; Ahmad, J. Attention induced multi-head convolutional neural network for human activity recognition. Appl. Soft Comput. 2021, 110, 107671. [Google Scholar] [CrossRef]
Zhang, J.; Liu, Y.; Yuan, H. Attention-based Residual BiLSTM Networks for Human Activity Recognition. IEEE Access 2023, 11, 94173–94187. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, Lille, France, 7–9 July 2015; pp. 448–456. [Google Scholar]
Jin, X.; Xu, C.; Feng, J.; Wei, Y.; Xiong, J.; Yan, S. Deep learning with s-shaped rectified linear activation units. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar]
Zhai, S.; Zhang, W.; Cheng, D.; Bai, X. Text Classification Based on Graph Convolution Neural Network and Attention Mechanism. In Proceedings of the 2022 5th International Conference on Artificial Intelligence and Pattern Recognition, Chengdu, China, 19–21 August 2022; pp. 137–142. [Google Scholar]
Mekruksavanich, S.; Jitpattanakul, A.; Sitthithakerngkiet, K.; Youplao, P.; Yupapin, P. Resnet-se: Channel attention-based deep residual network for complex activity recognition using wrist-worn wearable sensors. IEEE Access 2022, 10, 51142–51154. [Google Scholar] [CrossRef]
Al-Qaness, M.A.; Dahou, A.; Abd Elaziz, M.; Helmi, A. Multi-ResAtt: Multilevel residual network with attention for human activity recognition using wearable sensors. IEEE Trans. Ind. Inform. 2022, 19, 144–152. [Google Scholar] [CrossRef]
Wang, X.; Shang, J. Human activity recognition based on two-channel residual–GRU–ECA module with two types of sensors. Electronics 2023, 12, 1622. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Anguita, D.; Ghio, A.; Oneto, L.; Parra, X.; Reyes-Ortiz, J.L. A public domain dataset for human activity recognition using smartphones. In Proceedings of the Esann, Bruges, Belgium, 24–26 April 2013; Volume 3, p. 3. [Google Scholar]
Sousa Lima, W.; de Souza Bragança, H.L.; Montero Quispe, K.G.; Pereira Souto, E.J. Human activity recognition based on symbolic representation algorithms for inertial sensors. Sensors 2018, 18, 4045. [Google Scholar] [CrossRef]
Shi, W.; Fang, X.; Yang, G.; Huang, J. Human activity recognition based on multichannel convolutional neural network with data augmentation. IEEE Access 2022, 10, 76596–76606. [Google Scholar] [CrossRef]
Coelho, Y.; Rangel, L.; dos Santos, F.; Frizera-Neto, A.; Bastos-Filho, T. Human activity recognition based on convolutional neural network. In XXVI Brazilian Congress on Biomedical Engineering: CBEB 2018, Armação de Buzios, RJ, Brazil, 21–25 October 2018; Springer: Singapore, 2019; Volume 2, pp. 247–252. [Google Scholar]
Zhao, Y.; Yang, R.; Chevalier, G.; Xu, X.; Zhang, Z. Deep residual bidir-LSTM for human activity recognition using wearable sensors. Math. Probl. Eng. 2018, 2018, 7316954. [Google Scholar] [CrossRef]
Mutegeki, R.; Han, D.S. A CNN-LSTM approach to human activity recognition. In Proceedings of the 2020 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), Fukuoka, Japan, 19–21 February 2020; pp. 362–366. [Google Scholar]
Gupta, A.; Semwal, V.B. Multiple task human gait analysis and identification: Ensemble learning approach. In Emotion and Information Processing: A Practical Approach; Springer: Cham, Switzerland, 2020; pp. 185–197. [Google Scholar]
Ordóñez, F.J.; Roggen, D. Deep convolutional and lstm recurrent neural networks for multimodal wearable activity recognition. Sensors 2016, 16, 115. [Google Scholar] [CrossRef] [PubMed]
Xia, K.; Huang, J.; Wang, H. LSTM-CNN architecture for human activity recognition. IEEE Access 2020, 8, 56855–56866. [Google Scholar] [CrossRef]

Figure 1. Network model.

Figure 2. Res-SE module structure.

Figure 3. Squeeze-and-Excite module.

Figure 4. Confusion matrix for UCI-HAR dataset.

Figure 5. Confusion matrix for WISDM dataset.

Table 1. Accuracy of various types of activity recognition on UCI-HAR and WISDM datasets.

Dataset	Activity	Accuracy of Various Types of Activity Recognition/%
Dataset	Activity	CNN	LSTM	CNN-LSTM	Res-LSTM	Proposed
UCI-HAR	Walking	99.51	99.35	99.65	99.33	99.60
	WalkingUp	97.21	98.64	97.51	96.52	96.60
	WalkingDown	89.36	90.75	97.34	90.47	98.17
	Sitting	77.63	79.61	74.54	87.56	93.08
	Standing	78.34	77.28	91.35	93.82	95.11
	Laying	99.64	99.57	99.50	99.27	99.81
WISDM	Downstairs	86.31	85.44	72.45	92.44	96.48
	Jogging	99.10	93.92	97.47	95.66	99.35
	Sitting	88.21	90.15	95.31	89.67	98.02
	Standing	81.67	83.36	68.64	90.31	98.79
	Upstairs	71.46	75.28	85.49	93.18	94.12
	Walking	95.84	96.43	96.91	98.64	99.21

Table 2. Accuracy and F1-Scores on UCI-HAR and WISDM datasets with different algorithms.

Dataset	Model	Accuracy/%	F1-Score/%
UCI-HAR	CNN [36]	91.71	91.93
	LSTM [37]	90.80	90.80
	CNN-LSTM [38]	92.13	92.10
	Res-LSTM [39]	91.60	91.50
	Proposed	96.72	96.67
WISDM	CNN [40]	93.67	93.80
	LSTM [35]	95.85	95.53
	LSTM-CNN [41]	95.85	95.53
	Res-LSTM [39]	96.10	95.90
	Proposed	97.73	97.61

Table 3. Impact of residual modules.

Model	F1-Score/%
Model	UCI-HAR	WISDM
CNN without residual blocks	94.82	95.21
Our model using residual blocks	96.74	97.36

Table 4. Impact of the SEM.

Model	F1-Score/%
Model	UCI-HAR	WISDM
No SEM has been added	95.67	96.34
The SEM has been added	97.56	97.92

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kang, H.; Lv, T.; Yang, C.; Wang, W. Multihead-Res-SE Residual Network with Attention for Human Activity Recognition. Electronics 2024, 13, 3407. https://doi.org/10.3390/electronics13173407

AMA Style

Kang H, Lv T, Yang C, Wang W. Multihead-Res-SE Residual Network with Attention for Human Activity Recognition. Electronics. 2024; 13(17):3407. https://doi.org/10.3390/electronics13173407

Chicago/Turabian Style

Kang, Hongbo, Tailong Lv, Chunjie Yang, and Wenqing Wang. 2024. "Multihead-Res-SE Residual Network with Attention for Human Activity Recognition" Electronics 13, no. 17: 3407. https://doi.org/10.3390/electronics13173407

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multihead-Res-SE Residual Network with Attention for Human Activity Recognition

Abstract

1. Introduction

2. Related Work

3. Proposed Architectures

3.1. Network Infrastructure

3.2. Residual Module

3.3. Attention Mechanisms

4. Experimental Results

4.1. Datasets

4.2. Pre-Processing

4.3. Testbench

4.4. Evaluation Metrics

4.5. Analysis of Results

4.6. Ablation Studies

4.6.1. Impact of Residual Blocks

4.6.2. Effect of the SEM

5. Conclusions and Future Works

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI