Next Article in Journal
Long-Term Exposure to Greenspace and Cognitive Function during the Lifespan: A Systematic Review
Previous Article in Journal
Difficult Therapeutic Decisions in Gorham-Stout Disease–Case Report and Review of the Literature
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Hierarchical Ensemble Deep Learning Activity Recognition Approach with Wearable Sensors Based on Focal Loss

School of Computer and Information Engineering, Chuzhou University, Chuzhou 239000, China
*
Author to whom correspondence should be addressed.
Int. J. Environ. Res. Public Health 2022, 19(18), 11706; https://doi.org/10.3390/ijerph191811706
Submission received: 31 July 2022 / Revised: 5 September 2022 / Accepted: 13 September 2022 / Published: 16 September 2022

Abstract

:
Abnormal activity in daily life is a relatively common symptom of chronic diseases, such as dementia. There will probably be a variety of repetitive activities in dementia patients’ daily life, such as repeated handling of objects and repeated packing of clothes. It is particularly important to recognize the daily activities of the elderly, which can be further used to predict and monitor chronic diseases. In this paper, we propose a hierarchical ensemble deep learning activity recognition approach with wearable sensors based on focal loss. Seven basic everyday life activities including cooking, keyboarding, reading, brushing teeth, washing one’s face, washing dishes and writing are considered in order to show its performance. Based on hold-out cross-validation results on a dataset collected from elderly volunteers, the average accuracy, precision, recall and F1-score of our approach are 98.69%, 98.05%, 98.01% and 97.99%, respectively, in identifying the activities of daily life for the elderly.

1. Introduction

Elderly people may suffer from the consequences of dementia. Dementia may cause a decrease in the ability to speak, write and perform complex functional tasks, such as preparing a meal.
Most common types of dementia can be identified by a change in daily activities such as sleep disturbances, difficulty walking and an inability to complete tasks. Such changes can provide key information about the memory, mobility and cognition of a person. For instance, an inhabitant suffering from Alzheimer’s may forget his lunch, or go to the toilet frequently. The best markers of cognitive decline may not necessarily be detected based on a person’s activities at any single point in time, but rather by monitoring the trend over time and the variability of change in a duration. Therefore, it is important to recognize and monitor the activities that can better detect the health status of the elderly. In recent years, with the development of microelectronics and low-power wireless technology, the cost of wearable sensor devices has been greatly reduced. In addition, wearable devices have the advantages of small size, low power consumption, easy integration, and high recognition accuracy of human activities. Wearable devices can collect human activity data, which provides the possibility for activity recognition without affecting the comfort of daily activities.
Early researches focused on using different machine learning models to recognize users’ activity, such as the HMM [1], naive Bayes classifier [2] and decision tree [3]. However, manual feature selection not only requires a wealth of medical knowledge, but the process requires trial and error, which consumes much time and effort. These will lead to a low recognition accuracy. Recently, deep learning has been successfully applied in image classification [4] and image description [5]. For example, researchers have implemented a wearable sensor activity recognition system based on deep learning [6], which extracts the hidden features of sensor data automatically, captures complex activity details and improves the accuracy and robustness of activity recognition.
However, the numbers of different human activity-sensing data are often unbalanced. Some categories have more samples than others. For example, typing or writing activities have more samples than washing dishes. In this case, the trained model will be biased towards one category, which has more data. Thus, it will cause the minority categories to be misclassified, and even treat them as noise. In other words, because each epoch’s categories are unbalanced, the model is more and more accurate in classifying the samples of the majority category; meanwhile the recognition on the minority category is getting worse. So, the accuracy rate cannot be used as the key indicator for evaluating the model.
In addition, human beings’ daily activities are complex. On the one hand, the distribution of activity data in the same category is different because of a person’s different exercise habits at different times. On the other hand, the sensor has various heterogeneities that make the sensitive information of human activities unable to keep synchronized after the fusion of multiple sensor data. Furthermore, a person’s different categories of activities have similarities. To sum up, the traditional single model cannot guarantee accurate recognition performance.
To address these problems, this paper designs and proposes a hierarchical ensemble deep learning activity recognition scheme. This scheme provides wearable sensors to patients for both wrists, and a variety of human daily activity data are collected by the sensors. Then, after data preprocessing and analyzing, a hierarchical ensemble deep learning activity recognition scheme based on focal loss is designed for the imbalance of dataset categories, and testing of the trained model. The contributions of this paper can be summarized in the following aspects:
(1)
This paper analyzes the sensitivity of a wearable inertial sensor on the wrist to human activity. For the same sensor, the data generated by different activities are quite different, and for different sensors, the data generated by the same action are relatively different.
(2)
In view of the complexity and imbalance of human daily activity data, after preprocessing the data, this paper proposes a deep hierarchical ensemble learning model based on focal loss, and designs an elderly daily activity recognition system based on wearable sensors.
(3)
This paper employs real experimental data to evaluate the performance of the proposed method and compares it with some state-of-the-art methods in the literature. Furthermore, this paper evaluates the impact of some key hyperparameters using experimental data.
This paper is divided into five sections. Section 1 is the introduction. Section 2 includes the previous studies that have been carried out so far. The proposed scheme is examined in Section 3. The experimental results and analysis are described in Section 4, and the conclusion is discussed in Section 5.

2. Related Work

With the development of a wireless sensor network and the gradual popularization of wearable sensors, it is worthwhile to build activity recognition systems based on wearable wireless sensors. Activity recognition systems have been widely used and scientifically studied by many scholars and institutions. Researchers attach sensors to the human body’s key parts, and use acceleration sensors to measure the acceleration data of each part continuously. After that, these data are sent to a base station through the Bluetooth wireless network. Usually, the base station is a sensor that is connected to a computer or mobile phone. Therefore, these sensor data provide effective support for in-depth research on activity recognition.
In the early days, different machine learning methods were mainly used to identify wearable-based human activities. The common methods include: KNN, HMM, SVM, RF, XGBoost, etc. For example, Lee and Cho [7] used a hierarchical hidden Markov model to identify five types of activities, including standing, walking, running, going upstairs and going downstairs. Data for these activities were acquired via a three-axis accelerometer on a smartphone. Kwapisz et al. [8] placed smartphones in the front pockets of users’ pants and collected 29 users’ accelerometer daily activities data, including walking, jogging, stair climbing, sitting and standing. They used these data to extract 6 different features and used 4 classifiers for identification. The recognition rate reached more than 90%. Sun et al. [9] proposed a sports activity recognition scheme based on SVM, which placed smartphones in 6 different pockets, collected 7 sports activity data, and trained an SVM activity recognition classifier. The total F-score reached 94.8% given the pocket position.
The process of activity recognition needs large amounts of domain knowledge and extracted features with trial and error. This process represents a major expenditure of time and effort. In recent years, with the development and application of deep learning technology [10], there has been a lot of related work in the field of activity recognition.
Jiang et al. [11] constructed an activity feature map through the signal sequence of the accelerometer and the gyroscope. Then they used the deep CNN network to learn the optimal features of multiple dimensions automatically, and achieved a better recognition effect. Ronao et al. [12] used time-series sensor data to predict activities, confirming the effectiveness of 2D-CNN for activity recognition. Ravi et al. [13] collected activity data with low-power wearable devices. They processed the time series through short-time Fourier transform (STFT) spectrograms, then designed a deep learning-based human activity recognition architecture, and finally achieved accurate real-time classification. Amroun et al. [14] collected four types of activity data, including standing, sitting, lying down and walking, to extract the best feature descriptors of activities, and identified human activities through the CNN model, with a recognition accuracy rate of over 98%. Reference [15] designed a LSTM network, then performed experimental evaluation on three standard benchmark (Opportunity, PAMAP2, Skoda) datasets, and finally achieved better recognition results. The above systems all used a single model for activity recognition. However, existing studies have shown that the integrated model has better performance [16].
To learn hierarchical features, Ref. [17] adopted RBMs and multi-layer RBMs are used to capture local and multimodal features for human action recognition. Ordóñez et al. [18] used wearable sensors to build convolutional and recurrent network architectures to extract behavioral features automatically and improved system performance. Chen et al. [19] designed an integrated ELM algorithm based on smartphone sensors. The algorithm identified human activities such as walking, going upstairs, going downstairs, sitting, standing and walking, and the recognition accuracy reached 97.35%. Reference [20] proposed a lightweight and efficient integrated incremental learning activity recognition system based on the heterogeneous activity recognition datasets of multiple users and sensing devices. After model testing, the results showed a 35% improvement in accuracy.
To address the problem of unbalanced data categories, there are mainly two methods. On the one hand, a data-level method that operates on the training set and changes its class distribution. For example, the reference [21] simply replicated selected samples randomly from the minority class to solve the problem of data class imbalance, and the reference [22] adopted a clustering-based oversampling method. First, they clustered the dataset, then oversampled each cluster. On the other hand is a classifier (algorithmic)-level method that keeps the training dataset distribution and adjusts the training or inference algorithm. For example, to keep the sample classes balanced, OHEM [23] proposed an idea that selected more minority class samples in each mini-batch iteration. The reference [24] reduced the weight of the negative samples of the minority class in the training process by weighting the instances, focusing on the hard-to-classify and misclassified samples.
The comparison of related work is shown in Table 1. We can find that most of the related work mainly uses the data collected by the sensors in the smartphone for activity recognition. In contrast to existing work, we mainly focus on wearable sensor-based activity recognition at home.
Aiming at the complexity of human daily activities and the imbalance of data categories, this paper designs a human activity recognition architecture based on hierarchical ensemble learning that applies the focal loss algorithm to the system and improves the recognition effectively.

3. System Framework

In order to identify the daily activities of the elderly, in this paper, wearable sensors were worn on both wrists of the volunteers to collect raw data, and for the class imbalanced dataset, an activity recognition network for the elderly based on hierarchical ensemble deep learning architecture was designed. The specific module design is shown in Figure 1.
In the activity recognition system, when the class samples are not balanced the trained model will be biased towards the class with more instances, resulting in the misclassification of the minority class samples. In this paper, the focal loss algorithm is applied to the activity recognition system, which can reduce the impact of sample imbalance.

3.1. Formal Description of Data

Before the dataset is inputted into the training model, the training data needs to be reconstructed into the data format required by the time series prediction model. For example, the size of the image input is fixed to h × w × c, where h, w, and c are the height, width and number of the images, respectively. In this section we describe in detail the pipeline for data preprocessing and the method for signal representation.
As shown in Figure 1, the sensor IMU signals at different body positions are synchronized with timestamps, and then, the signal sequence is sampled using a time sliding window with a width of T timestamps and the step size between the two windows is ∆t; after sampling, the dataset is represented as D = D 1 , y 1 , , D n , y n , , D N , y N , and the nth data is represented as D n = d n 1 , d n 2 , , d n s , d n S ,   n ϵ 1 , , N , where S is the total number of IMU sensors at different body positions, d n s represents the sample set of discrete time series IMU signals from the sth sensor, and y n is the activity class label. More specifically, d n s = d n , 1 s , d n , 2 s , , d n , t s , , d n , T s is a discrete-time data sequence over T timestamps, each element can be expressed as
d n , t s = a n , t x , a n , t y , a n , t z a n , t : a c c e l e r a t i o n , w n , t x , w n , t y , w n , t z w n , t : a n g u l a r   v e l o c i t y   , a g n , t x , a g n , t y , a g n , t z   a g n , t : a n g l e   ,
where a, g, and ag represent the sensor readings of acceleration, angular velocity, and angle, respectively.

3.2. Wavelet Transform

In order to better represent the inertial signal, capture time and frequency information, we decompose the original signal into high-frequency components and low-frequency components, and obtain each layer of frequency signal information, because the human activity signals collected by wearable sensors are nonlinear and non-stationary. Therefore, it is very suitable to use the wavelet decomposition method [25] to analyze the signal.
Let the input signal be x, with the scale j, the wavelet coefficient x , ψ j , k and the scale coefficient x , ϕ j , k can be obtained after decomposition, where k = 0 , 1 , , N j 1 , that is to convolve the input signal with the given filters h and g at the same time
h t = 2 j / 2 ψ 2 j t
g t = 2 j / 2 ϕ 2 j t
Here, ψ(∙) represents the wavelet function, and g(∙) represents the scaling function, by discarding high-frequency components (details) and preserving low-frequency components to obtain a smooth output.

3.3. Hierarchical Ensemble Deep Learning Architecture

In order to extract the deep features of activities, we propose a novel hierarchical ensemble of neural networks. The architecture firstly extracts the features of each sensor data, considering that comprehensive analysis of correlations across each sensor data is essential for learning sensitivity features of activities. Hence, we extract features and learn the correlations across each sensor data through the fusion layer.

3.3.1. Single-Channel Sensor Signal Feature Extraction

By combining wave transform with the LSTM network in extracting the features of each sensing activity window, the time characteristics of each channel are acquired. Then, the 1D convolutional neural network (CNN) was used to extract local spatial features, as shown in Figure 2.

LSTM Layer

The cell status of LSTM can only be changed by a specific gate. A typical LSTM contains a forget gate, input gate, and output gate, which are represented by f t , i t ,and O t respectively. Where, cell state, input and output are vectors represented by C t , x t and h t respectively.
f t = σ W f h t 1 + U f x t + b f
i t = σ W i h t 1 + U i x t + b i
a t = tanh W c h t 1 + U c x t + b c
c t = f t c t 1 + i t a t
o t = σ W o h t 1 + U o x t + b o
h t = o t tan h c t
The forget gate determines whether to delete the contents of the cell state. The input gate decides what information will be stored in the memory cell. The forget gate and input gate determine the contents of the new cell state. The input of the output gate is determined by the previous output vector h t 1 and the current input vector x t . Where a t represents the information to be input to the memory, W f , U f ,   W i , U i , W c , U c , W o , U o represent the weight, and b f , b i , b c , b o represent the offset, σ x = 1 + e x 1 .

1D-CNN Layer

With a one-dimensional sensor signal, a 1D kernel is used in a temporal convolution. A kernel can be viewed as a filter or a feature detector in the 1D domain. The method of extracting the feature map by using the one-dimensional convolution operation is as follows:
x j l + 1 τ = σ b j l + f = 1 F l K j f l τ x f l τ = σ b j l + f = 1 F l p = 1 p l K j f l p x f l τ p
where, x j l represents the jth feature map of l layer. σ is the nonlinear activation function. F l represents the number of feature map at the l layer. K j f l is the kernel convolved over feature map f in layer l to create the feature map j in layer l + 1 . p l represents the length of the convolution kernel at the l layer, and b l is the offset vector.
In the process of model training, in order to reduce the internal covariate shift, a batch normalization layer is set behind each activation layer [26]. With a one-dimensional signal of kth sensing channel, we get x k output through 1D-CNN. In the small batch processing, there are γ activation values, which can be represents as B = x 1 γ k , by batch normalization layer. Thus, the output is defined by:
x ^ k = B N x 1 γ k
where, x ^ k represents the output through a batch normalization layer of 1D-CNN layer in kth sensing channel. We set the max-pooling layer of size 2 for the data flows, which is the output of batch normalization layer. x j l as the input to the pooling layer, represents jth feature map of the l th layer.
x j l + 1 = M a x P o o l i n g x j l

Fusion Layer

In order to extract the correlation feature between each sensor channel, the output vectors of each channel in the fusion layer are combined, as shown in the following formula, where C i represents the splice result of the ith sensor vector, then
C i = x 1 i x 2 i x n i
where x k i represents the output of the kth channel of the ith sensor that flows through 1D-CNN layer, and ⊙ represents the splicing of vectors.

3.3.2. Feature Fusion Extraction of Multi-Sensor Signals

After the fusion of the feature data stream extracted from each sensor, the fusion features of each sensor data are firstly extracted through the 2D-CNN network. Then, the feature data extracted from multiple sensors are further fused, and the relevant features of each sensor are extracted again through the 2D-CNN network, as shown in Figure 3.

2D-CNN Layer

For the fusion data of each sensor, the one-dimensional time data stream is first converted to the two-dimensional time data stream, and the 2D convolution kernel is used for convolution in the two-dimensional space. Multiple convolution kernels are set between the convolution layers, and multiple feature mappings are learned in the feature map of the previous layer. Let C j l represent the jth feature map of the l layer, then
C j l + 1 τ = σ b j l + f = 1 F l K j f l τ C f l τ = σ b j l + f = 1 F l m ϵ S l K j f l m C f l τ m
where, σ is the nonlinear activation function, F l represents the number of feature maps at the l layer, K j f l is the convolution kernel of the f-th feature map in layer l to create the j-th feature map in layer l + 1 , S l represents the feature map set in layer l , and b l is the offset vector.
In the process of model training, in order to reduce the internal covariate shift, a batch normalization layer is set behind each activation layer [26]. We selected the continuous range of feature mapping as the pooling area, and set the max-pooling layer with a size of 2 × 2.

Fusion Layer

The human activity data of each sensor are correlated. In order to obtain the correlation features among the sensor activity data and extract the sensitivity fusion features of human activity, the fusion layer was used to fuse the sensor features.
C = C 1 C 2 C N
where, C i represents the output of the i-th sensor fusion feature through the 2D-CNN layer, C represents the matrix after the fusion of each sensor feature, and represents the splicing of the matrix.

3.4. Loss Function

In the process of model training, the ultimate goal of training is to minimize the difference between the predicted labels and the actual labels. In general, cross-entropy loss is used to measure the correlation between labels, as shown in the following formula. The point with the minimum loss is the point with the maximum correlation between the predicted labels and the real labels.
l o s s = i = 1 M y i log p i
where, M represents the number of categories, y represents the one-hot vector, y i is 1 if the predicted label and the real label are identical, otherwise, it is 0. The output p of the model is a vector with length M , p i represents the probability of predicting the real label i .
When the samples of classes are unbalanced, the trained model will be biased towards the classes with more samples, leading to the wrong classification of a few sample classes. Therefore, in this paper, focal loss is used as a loss function to reduce the impact of sample imbalance in the hierarchical ensemble deep learning activity recognition model.
l o s s = i = 1 M y i α i 1 p i γ i log p i
where, the hyper-parameter α i represents the equilibrium factor of class i , and the hyper-parameter γ i represents the adjustable focusing parameter, which can adjust degree of reduction in the weight of easily classified samples. The greater the γ i is, the greater the reduction degree of the weight will be.
In the process of the activity recognition, we collect data from multiple wearable sensors and propose a novel hierarchical ensemble of neural networks, which apply a focal loss algorithm to the activity recognition system for sample imbalance scenarios. The model can reduce the influence of sample imbalance. The training processing of hierarchical ensemble of neural networks model is described in Algorithm 1.
Algorithm 1 Hierarchical ensemble deep learning model based on focal loss
input: raw wearable sensor data
 output: activities
 1: encode the raw data as a numeric vector;
 2: wavelet transform;
 3: normalize the numeric vector;
 4: /* Model training*/
 5: while the loss does not converge do
   forward propagation;
   use Softmax to get predicted labels;
   calculate the focal-loss loss-function;
   backpropagation;
   gradient descent updates all parameters;
  end

4. Experiment

In this section, experimental settings, data collection and analysis of experimental results are introduced. Tensorflow and Keras Python DL libraries were mainly used to realize the algorithm. The specific settings and results are as follows.

4.1. Experimental Settings

4.1.1. The Overview of Experiments

In order to verify the effectiveness of the proposed approach HAR-FL, we designed two types of experiments. One experiment used a dataset that we recruited elderly volunteers to collect, and the other experiment used a public dataset [27].
We took HAR-CE as the benchmark in the following experiments. There is only one difference between HAR-FL and HAR-CE, and that is the adopted loss function, i.e., the loss function adopted by our proposed method HAR-FL was focal loss and the loss function of HAR-CE was cross-entropy loss.

4.1.2. Neural Networks Models

For the two approaches described in Section 4.1.1, we used the same structure of a deep ensemble neural networks model. Specifically, we set the learning rate as 2 × 10 4 and set batch_size as 128 and used the Adam optimizer. The layers and parameters are shown in Table 2 and Table 3.
Table 2 shows the layers and parameter settings of the first part of the deep ensemble neural networks model, which is named Layer1. In the experiment, each sensor was divided into nine channels, and each channel was set with 1 LSTM layer, 1D-CNN layer, a batch normalization layer (BN layer) and a maximum pooling layer. Then the nine channels were inputted into the fusion layer. All channel features of each sensor were fused.
In this paper, we collected data from two sensors, the features of each sensor fusion were inputted into the 2D-CNN, BN layer and maximum pool. We set up the fusion layer. All the sensor features of the fusion were then inputted into the 2D-CNN layer, BN layer and maximum pool, and extract feature of the sensor fusion. The layers and parameter settings are as shown in Table 3.
Table 4 shows the layers and parameter settings of the regression layer in the third part of the model, which contained five dense layers and a dropouts layer. The output layer was obtained from the Softmax layer (a dense layer with Softmax activation function).

4.2. Data Collection and Processing

As discussed in Section 1, our goal in this paper is to identify the daily activities of the elderly to support the monitoring of their health. In order to verify the performance of the activity recognition approach, we recruited ten elderly people in the community, and equipped them with our wearable sensors to collect daily activity data. In the future, we will continue to recruit more elderly people for data collection to further expand our dataset. Elderly volunteers were armed with attitude sensors (model number: BWT61CL), on both wrists and collected the raw sensor data of seven different actions including cooking, playing the keyboard, reading a book, brushing their teeth, washing their face, washing the dishes, writing, etc.
Actually, there are many sensors available for recognizing human activities, such as the attitude sensor, triaxial accelerometer and gyroscope sensor. The differences among them are listed in Table 5.
As shown in Table 4, with one attitude sensor (BWT61CL), we can collect the data of the 3-axis acceleration, 3-axis angular velocity (gyroscope) and 3-axis angle sensors simultaneously. Therefore, we selected the attitude sensor for our work.
Specifically, the model number of the attitude sensor used in our work is BWT61CL and the manufacturer is WitMotion Shenzhen Co., Ltd. (Shenzhen, China). According to the product introduction from the manufacturer’s official website (https://wit-motion.cn, accessed on 31 July 2022), the accuracy of the attitude sensor is guaranteed by the sensor manufacturer’s research and development facilities, e.g., all finished items were calibrated through the world’s top-level triaxial nonmagnetic turntable, ensuring the X Y Z angle’s accuracy. Each sensor of the original data contains nine dimensions (acceleration: 3D, angular velocity: 3D, angle: 3D). When volunteers perform some activities, the sensor on the wrist can receive data in real-time through the host computer. The data collection time for each action is about 7 min.
The specific data collection and processing process was as follows:
Prior to inputting the training datasets into the model, we set each sensor data window size as 50, the step size of the sliding window was 25, and we divided the sensor data flow for the same size. Each sample was a matrix, whose size was 50 (about 5 s) × 2 (motion sensor) × 9 (nine-axis sensor data). The data, which were reconstructed into the required time-series format of the model, were used as the input of the hierarchical ensemble neural network model.
Table 6 shows classes imbalance, where S1 represents the benchmark case with uniformly distributed classes, S2–S5 denotes four cases, in which the number of samples in two classes is 200, and the number of samples in the rest classes is 708.

4.3. Analysis of Experimental Results

4.3.1. Analysis of Experimental Results with Private Dataset

Figure 4 and Figure 5, respectively, show the curves of the accuracy of training and validation sets with epochs under different types of imbalances based on different loss functions. The adjustable focusing parameter γ is 2, and the balance weight α is 0.25. It can be seen from the curves of the two figures that, with the epochs getting larger, the overall trend of train-accuracy and val accuracy tends to 1. The convergence rate of the training set accuracy of HAR-CE is faster than that of HAR-FL. Except for S1, the convergence rate of the accuracy of HAR-CE with validation set is faster than that of HAR-FL.
Table 7 shows the results of the precision, recall and F1-score of HAR-CE and HAR-FL under different classes of imbalance. It can be seen from the results in the table that HAR-FL is significantly better than HAR-CE, and the experimental results under balanced data classes are better than those under unbalanced data classes.
Figure 6, Figure 7 and Figure 8 show the histogram of the results of various metrics of HAR-CE and HAR-FL. It can be seen from the figure that the results of various metrics of HAR-FL of most classes are better than those of HAR-CE except for some classes. In the case of balanced S1, the difference between HAR-CE and HAR-FL is small. In the case of unbalanced S2, HAR-FL is significantly better than HAR-CE.
By adjusting the γ value in the focal loss function, the weight of easily classic samples and hard classic samples in the loss function can be dynamically adjusted, so that the model can focus more on hard classic samples. In the case of S2, the balance weight α is set to 0.25, the learning rate is 2 × 10 4 , and the Adam optimizer is used. When batch_size is set to 128, Figure 9 and Figure 10 show the curve of accuracy of training set and validation set with a different epoch when parameter γ is different. It can be seen from the figure that the overall trend of accuracy of the training set and validation set tends to 1 as the epoch increases, and the performance when γ is 2 is better than that when γ is other values.
Precision, recall and the F1-score were used to evaluate the model performance under different values of γ. The experimental results are shown in Figure 11. It can be seen from the figure that when the parameter γ is 2, the model performance of each metric is significantly better than that when γ is other values.
In summary, when the datasets are class balanced, there is little difference between the performance of each metric of HAR-CE and HAR-FL. In the case of class imbalance, the performance of HAR-FL is significantly better than that of HAR-CE. When the balance weight α is 0.25 and γ is 2, the performance of HAR-FL is the best.

4.3.2. Analysis of Experimental Results with Public Dataset

In order to comprehensively evaluate the performance of our proposed approach, the heterogeneous dataset (DH) [27] was used to verify the model. The DH dataset is composed of the data which was collected by eight smartphones for six daily activities (‘Biking‘, ‘Sitting’, ‘Standing’, ‘Walking’, ‘StairsUp’, ‘StairsDown’) of nine users. The original data contains six dimensions (accelerometer: 3D, gyroscope: 3D). To ensure consistency, they collected each activity data for 5 min. The specific dataset attributes are shown in Table 8.
In this paper, we selected from data DH that collected by users “b” and “e” carrying four mobile phones: “NexUS4_1”, “NexUS4_2”, “S3mini_1” and “S3mini_2”. Table 9 shows the table of class distribution. S1 represents the benchmark situation containing uniformly distributed classes, S2 represents the situation when the activity is “upstairs”; the number of the active data windows is 2560. S3 represents the distribution that the number of the active data windows is 2559 when the activity is “riding a bike”.
Figure 12 and Figure 13 respectively show the curves of the accuracy of training and validation sets with epochs under different classes numbers of distribution of dataset DH. The adjustable focusing parameter γ is 2, and the balance weight α is 0.25. It can be seen from the curves of the two figures that, with the increase of epochs, the overall trend of train-accuracy and val accuracy tends to a stable value. When epochs are in the range of 0–5, the convergence rate is fast; when epochs are in the range of 5–50, the convergence rate is slow and tends to be stable.
Table 10 shows the results of the accuracy, recall, and F1-score of HAR-CE and HAR-FL under different classes of imbalance. It can be seen from the results in the table that HAR-FL is significantly better than HAR-CE. When the sample distribution is S2, the indicators of S2 outperformed the other distributions mainly due to a reduction in the number of classes that are more difficult to classify.
Table 11 presents the results of various indicators of HAR-CE and HAR-FL. It can be seen from the table that the results of various indicators of HAR-FL in most classes are better than those of HAR-CE. When the class is “Stairsup” or “Stairsdown”, each performance indicator is significantly lower than the other classes. It can be determined that the data with activities of “upstairs” or “downstairs” have high similarity.

5. Conclusions

It is particularly important to recognize the daily activities of the elderly, which can be further used to predict and monitor chronic diseases, such as dementia. According to the unbalanced features of data classes, in this paper we propose a hierarchical ensemble deep learning activity recognition approach with wearable sensors based on focal loss. In our approach, wearable sensor devices are worn on both wrists to collect a variety of human daily activity data.
The experimental results show that, for the activity data with imbalanced classes, the hierarchical ensemble deep learning model based on focal loss has a good effect in recognition activity.
The daily activities of the elderly have different effects on sensors worn on different parts. Therefore, how to balance the influence factors of each sensor on activities is the focus of our future work.

Author Contributions

Conceptualization, H.C. and T.Z.; methodology, T.Z. and H.C.; software, T.Z., Y.Z.; formal analysis, H.C. and T.Z.; investigation, Y.Z., Y.B. and S.Z.; writing—original draft preparation, T.Z.; writing—review and editing, H.C. and Y.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Natural Science Foundation of the Department of Education, Anhui Province, China, Grant number KJ2020B09, Key Program Natural Science Foundation of the Department of Education, Anhui Province, China, Grant number KJ2021A1068, National Natural Science Foundation of Anhui Province, China, Grant number 2108085MF208, and Science and Technology Project of Chuzhou City, Anhui Province, China, Grant number 2021ZD012.

Institutional Review Board Statement

The study was approved by the Institutional Review Board of School of Computer and Information Engineering, Chuzhou University (CHZU-CSCI-2022-002) for studies involving humans.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study, in the collection, analyses, or interpretation of data, in the writing of the manuscript, or in the decision to publish the results.

References

  1. Xie, B.; Wu, Q. Hmm-based tri-training algorithm in human activity recognition with smartphone. In Proceedings of the 2012 IEEE 2nd International Conference on Cloud Computing and Intelligence Systems, Hangzhou, China, 30 October–1 November 2012. [Google Scholar]
  2. Sarcevic, P.; Kincses, Z.; Pletl, S. Comparison of different classifiers in movement recognition using WSN-based wrist-mounted sensors. In Proceedings of the 2015 IEEE Sensors Applications Symposium (SAS), Zadar, Croatia, 13–15 April 2015. [Google Scholar]
  3. Fan, L.; Wang, Z.; Wang, H. Human activity recognition model based on decision tree. In Proceedings of the 2013 International Conference on Advanced Cloud and Big Data, Nanjing, China, 13–15 December 2013. [Google Scholar]
  4. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  5. Donahue, J.; Anne Hendricks, L.; Guadarrama, S.; Rohrbach, M.; Venugopalan, S.; Saenko, K.; Darrell, T. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
  6. Lane, N.D.; Georgiev, P. Can deep learning revolutionize mobile sensing? In Proceedings of the 16th International Workshop on Mobile Computing Systems and Applications, New York, NY, USA, 12–13 February 2015. [Google Scholar]
  7. Lee, Y.S.; Cho, S.B. Activity recognition using hierarchical hidden markov models on a smartphone with 3D accelerometer. In Proceedings of the International Conference on Hybrid Artificial Intelligence Systems, Melacca, Malaysia, 5–8 December 2011. [Google Scholar]
  8. Kwapisz, J.R.; Weiss, G.M.; Moore, S.A. Activity recognition using cell phone accelerometers. ACM Sigkdd Explor. Newsl. 2011, 12, 74–82. [Google Scholar] [CrossRef]
  9. Sun, L.; Zhang, D.; Li, B.; Guo, B.; Li, S. Activity recognition on an accelerometer embedded mobile phone with varying positions and orientations. In Proceedings of the International Conference on Ubiquitous Intelligence and Computing, Xi’an, China, 26–29 October 2010. [Google Scholar]
  10. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
  11. Jiang, W.; Yin, Z. Human activity recognition using wearable sensors by deep convolutional neural networks. In Proceedings of the 23rd ACM International Conference on Multimedia, Brisbane, Australia, 26–30 October 2015. [Google Scholar]
  12. Ronao, C.A.; Cho, S.B. Human activity recognition with smartphone sensors using deep learning neural networks. Expert Syst. Appl. 2016, 59, 235–244. [Google Scholar] [CrossRef]
  13. Ravi, D.; Wong, C.; Lo, B.; Yang, G.Z. Deep learning for human activity recognition: A resource efficient implementation on low-power devices. In Proceedings of the 2016 IEEE 13th International Conference on Wearable and Implantable Body Sensor Networks (BSN), San Francisco, CA, USA, 14–17 June 2016. [Google Scholar]
  14. Amroun, H.; Temkit, M.H.; Ammi, M. Best feature for CNN classification of human activity using IOT network. In Proceedings of the 2017 IEEE International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData), Exeter, UK, 21–23 June 2017. [Google Scholar]
  15. Guan, Y.; Plötz, T. Ensembles of deep lstm learners for activity recognition using wearables. In Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Maui, HI, USA, 11–15 September 2017. [Google Scholar]
  16. Kuncheva, L.I. Combining Pattern Classifiers: Methods and Algorithms; John Wiley & Sons: Hoboken, NJ, USA, 2014. [Google Scholar]
  17. Radu, V.; Lane, N.D.; Bhattacharya, S.; Mascolo, C.; Marina, M.K.; Kawsar, F. Towards multimodal deep learning for activity recognition on mobile devices. In Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing, Heidelberg, Germany, 12–16 September 2016. [Google Scholar]
  18. Ordóñez, F.J.; Roggen, D. Deep convolutional and lstm recurrent neural networks for multimodal wearable activity recognition. Sensors 2016, 16, 115. [Google Scholar] [CrossRef] [PubMed]
  19. Chen, Z.; Jiang, C.; Xie, L. A novel ensemble ELM for human activity recognition using smartphone sensors. IEEE Trans. Ind. Inform. 2018, 15, 2691–2699. [Google Scholar] [CrossRef]
  20. Sundaramoorthy, P.; Gudur, G.K.; Moorthy, M.R.; Bhandari, R.N.; Vijayaraghavan, V. HARNet: Towards on-device incremental learning using deep ensembles on constrained devices. In Proceedings of the 2nd International Workshop on Embedded and Mobile Deep Learning, Munich, Germany, 15 June 2018. [Google Scholar]
  21. Wang, K.J.; Makond, B.; Chen, K.H.; Wang, K.M. A hybrid classifier combining SMOTE with PSO to estimate 5-year survivability of breast cancer patients. Appl. Soft Comput. 2014, 20, 15–24. [Google Scholar] [CrossRef]
  22. Jo, T.; Japkowicz, N. Class imbalances versus small disjuncts. ACM Sigkdd Explor. Newsl. 2004, 6, 40–49. [Google Scholar] [CrossRef]
  23. Hradsky, O.; Ohem, J.; Zarubova, K.; Mitrova, K.; Durilova, M.; Kotalova, R.; Nevoral, J.; Zemanova, I.; Dryak, P.; Bronsky, J. Disease activity is an important factor for indeterminate interferon-γ release assay results in children with inflammatory bowel disease. J. Pediatr. Gastroenterol. Nutr. 2014, 58, 320–324. [Google Scholar] [CrossRef] [PubMed]
  24. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
  25. Borovykh, A.; Bohte, S.; Oosterlee, C.W. Conditional time series forecasting with convolutional neural networks. arXiv 2017, arXiv:1703.04691. [Google Scholar]
  26. Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015. [Google Scholar]
  27. Stisen, A.; Blunck, H.; Bhattacharya, S.; Prentow, T.S.; Kjærgaard, M.B.; Dey, A.; Sonne, T.; Jensen, M.M. Smart devices are different: Assessing and mitigatingmobile sensing heterogeneities for activity recognition. In Proceedings of the 13th ACM Conference on Embedded Networked Sensor Systems, Seoul, Korea, 1–4 November 2015. [Google Scholar]
Figure 1. Activity recognition network architecture based on deep ensemble learning.
Figure 1. Activity recognition network architecture based on deep ensemble learning.
Ijerph 19 11706 g001
Figure 2. Single-channel sensor signal feature extraction.
Figure 2. Single-channel sensor signal feature extraction.
Ijerph 19 11706 g002
Figure 3. Feature extraction of multi-channel sensor signals.
Figure 3. Feature extraction of multi-channel sensor signals.
Ijerph 19 11706 g003
Figure 4. The curve of accuracy with training set.
Figure 4. The curve of accuracy with training set.
Ijerph 19 11706 g004
Figure 5. The curve of accuracy with validation set.
Figure 5. The curve of accuracy with validation set.
Ijerph 19 11706 g005
Figure 6. The precision comparison of HAR-CE and HAR-FL in each class.
Figure 6. The precision comparison of HAR-CE and HAR-FL in each class.
Ijerph 19 11706 g006
Figure 7. The recall comparison of HAR-CE and HAR-FL in each class.
Figure 7. The recall comparison of HAR-CE and HAR-FL in each class.
Ijerph 19 11706 g007
Figure 8. The F1-score comparison of HAR-CE and HAR-FL in each class.
Figure 8. The F1-score comparison of HAR-CE and HAR-FL in each class.
Ijerph 19 11706 g008
Figure 9. The curve of accuracy of training set with different epoch when parameter γ is different.
Figure 9. The curve of accuracy of training set with different epoch when parameter γ is different.
Ijerph 19 11706 g009
Figure 10. The curve of accuracy of validation set with different epoch when parameter γ is different.
Figure 10. The curve of accuracy of validation set with different epoch when parameter γ is different.
Ijerph 19 11706 g010
Figure 11. The comparison of model performance under different γ.
Figure 11. The comparison of model performance under different γ.
Ijerph 19 11706 g011
Figure 12. The curves of the accuracy of DH training sets with epochs.
Figure 12. The curves of the accuracy of DH training sets with epochs.
Ijerph 19 11706 g012
Figure 13. The curves of the accuracy of DH validation sets with epochs.
Figure 13. The curves of the accuracy of DH validation sets with epochs.
Ijerph 19 11706 g013
Table 1. The comparison of related work.
Table 1. The comparison of related work.
ReferenceMain ContributionsSensorClasses
[7]The real-time activity recognition application on a smartphone with the Google Android platformsmartphonestand, walk, stair up/down, run, shopping, taking bus, moving (by walk)
[8]The activity recognition model permits users to gain useful knowledge about the habits of millions of users passively just by having them carry cell phonessmartphonewalking, jogging, climbing stairs, sitting, and standing
[12]Proposed a deep convolutional neural network (convnet) is to perform HAR using smartphone sensors by exploiting the inherent characteristics of activities and 1D time-series signalssmartphonewalking, upstairs, downstairs, sitting and standing, lying
[14]Evaluating what is the best descriptor to recognize human activity using Convolutional Neural Network in a non-controlled environment using a network of smart objectssmartphonestanding, sitting, lying and walking
[15]Developed modified training procedures for LSTM networks and combine sets of diverse LSTM learners into classifier collectiveswearable sensorsclose/open dishwasher, close/open drawer, close/open door, close/open fridge, toggle switch, drink from cup, clean table
[17]Investigating the opportunity to use deep learning to perform this integration of sensor data from multiple sensorssmartphonesitting, standing, walking, climbing stairs, descending stairs, biking
[18]Proposed a generic deep framework for activity recognition based on convolutional and LSTM recurrent unitswearable sensorsclose/open dishwasher, close/open drawer, close/open door, close/open fridge, toggle switch, drink from cup, clean table
[19]Introduced a novel ensemble ELM algorithm for human activity recognition using smartphone sensorssmartphonesitting, standing, lying, walking, walking
upstairs, and downstairs
Table 2. Layer 1 construction.
Table 2. Layer 1 construction.
Layers#Feature MapsFeature Map Size#Parameters
LSTM32284352
1D-CNN8281288
BN82832
Max-pooling1D8140
Concatenate72140
Reshape114 × 720
Table 3. Layer 2 construction.
Table 3. Layer 2 construction.
Layers#Feature MapsFeature Map Size#Parameters
2D-CNN814 × 7280
BN814 × 7232
Max-pooling2D87 × 360
Concatenate167 × 360
2D-CNN327 × 364640
BN327 × 36128
Max-pooling2D324 × 180
Table 4. Regression layer construction.
Table 4. Regression layer construction.
Layers#Feature MapsFeature Map Size#Parameters
Flatten123040
Dense-1164147,520
Dense-21322080
Dense-3116528
Dropout1160
Dense-418136
Dense-51763
Table 5. The comparison of different sensors.
Table 5. The comparison of different sensors.
SensorsAttitude Sensor (BWT61CL)Triaxial
Accelerometer
Gyroscope Sensor
3-Axis Acceleration
3-Axis Angular Velocity (Gyroscope)
3-Axis Angle
Table 6. Classes distribution.
Table 6. Classes distribution.
ClassesS1S2S3S4S5
Cooking708200708708200
Keyboarding708200708708708
Reading708708200708708
Brushing teeth708708200708708
Washing face708708708200708
Washing dishes708708708200708
Writing708708708708200
Table 7. Comparison of HAR-CE and HAR-FL.
Table 7. Comparison of HAR-CE and HAR-FL.
SamplesHAR-FLHAR-CE
PrecisionRecallF1-ScorePrecisionRecallF1-Score
S10.98700.98690.98680.97990.97980.9797
S20.97590.97460.97470.95390.95170.9511
S30.97230.97200.97150.95490.95430.9534
S40.98500.98470.98460.97290.97200.9710
S50.98230.98220.98210.96360.96320.9625
Table 8. Heterogeneity dataset (DH) characterized by their respective attributes.
Table 8. Heterogeneity dataset (DH) characterized by their respective attributes.
ActivitiesDevicesFSUsers
[”Biking”, ”Sitting”,
”Walking”, ”StairsUp”,
“StairsDown”, “Standing”]
Nexus 4200[a,b,c,d,e,f,g,h,i]
Samsung S3150
Samsung S3 Mini100
Samsung S+50
Table 9. The class distribution of DH dataset.
Table 9. The class distribution of DH dataset.
ClassesS1S2S3
Class
Distribution
stand793279327932
sit808980898089
walk10,22510,22410,225
stairsup751925607519
stairsdown 660766076607
bike958095802559
Table 10. The comparison results of HAR-CE and HAR-FL with DH data set.
Table 10. The comparison results of HAR-CE and HAR-FL with DH data set.
SamplesHAR-FLHAR-CE
PrecisionRecallF1-ScorePrecisionRecallF1-Score
S10.96400.96410.96400.95690.95710.9570
S20.97200.97170.97180.96600.96560.9657
S30.94740.94650.94660.93880.93620.9363
Table 11. The distribution of indicators for each class of HAR-FL and HAR-CE with DH dataset.
Table 11. The distribution of indicators for each class of HAR-FL and HAR-CE with DH dataset.
ClassesHAR-FLHAR-CE
PrecisionRecallF1-ScorePrecisionRecallF1-Score
stand0.99500.99650.99700.99650.99600.9962
sit1.0000.99900.99971.00000.99950.9998
walk0.96260.96600.95900.96180.94600.9588
stairsup0.90890.91280.90670.90080.90210.9055
stairsdown0.87770.86600.87830.87470.86440.8710
bike0.98580.98800.98940.97440.98710.9864
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Zhao, T.; Chen, H.; Bai, Y.; Zhao, Y.; Zhao, S. A Hierarchical Ensemble Deep Learning Activity Recognition Approach with Wearable Sensors Based on Focal Loss. Int. J. Environ. Res. Public Health 2022, 19, 11706. https://doi.org/10.3390/ijerph191811706

AMA Style

Zhao T, Chen H, Bai Y, Zhao Y, Zhao S. A Hierarchical Ensemble Deep Learning Activity Recognition Approach with Wearable Sensors Based on Focal Loss. International Journal of Environmental Research and Public Health. 2022; 19(18):11706. https://doi.org/10.3390/ijerph191811706

Chicago/Turabian Style

Zhao, Ting, Haibao Chen, Yuchen Bai, Yuyan Zhao, and Shenghui Zhao. 2022. "A Hierarchical Ensemble Deep Learning Activity Recognition Approach with Wearable Sensors Based on Focal Loss" International Journal of Environmental Research and Public Health 19, no. 18: 11706. https://doi.org/10.3390/ijerph191811706

APA Style

Zhao, T., Chen, H., Bai, Y., Zhao, Y., & Zhao, S. (2022). A Hierarchical Ensemble Deep Learning Activity Recognition Approach with Wearable Sensors Based on Focal Loss. International Journal of Environmental Research and Public Health, 19(18), 11706. https://doi.org/10.3390/ijerph191811706

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop