1. Introduction
Human Action Recognition (HAR) aims to identify the actions performed by humans through the automatic analysis of data coming from different types of sensors. In the last years this method has become widely used in numerous relevant and heterogeneous application fields, from the most commercial to the most assistive ones, such as Ambient Assisted Living (AAL) [
1,
2,
3,
4,
5,
6,
7]. In the latter, HAR provides an array of solutions for improving the quality of individuals’ life, allowing elderly people to live healthier and independently for longer, helping people with disabilities, and supporting caregivers and medical staff [
3,
6,
7]. HAR is mainly based on the analysis of data acquired using several hardware devices (e.g., RGB cameras, RGB-D devices, or inertial sensors) and is carried out with a plethora of Artificial Intelligence (AI) algorithms [
4,
5]. The data acquisition for these purposes can be categorized into visual sensor-based and non-visual sensor-based (e.g., wearable inertial sensors), depending on which devices are used. The visual sensor-based approach using RGB-D systems, such as the Microsoft Kinect, allows collecting RGB, depth, and skeleton information data also providing a rich 3D structural information about the scene. Unfortunately, these data are frequently noisy and need pre-processing procedures for a robust estimation of the body position and the identification of human actions. Furthermore, the data quality is strongly dependent on the position of the subject with respect to the camera, on the movement complexity, and on the density of the furniture in the scene. More reliable data are obtained if the subject is facing the camera, or at least clearly visible in the camera optimal capture volume, with no body segments being occluded by the superimposition of other ones or any other object inside the room. A possible solution to handle these two latter constraints is a multiple camera setup to cover as many areas of the room as possible from different points of view.
As briefly mentioned above, the data for HAR are analyzed to classify and identify human actions with AI techniques, such as machine learning and deep learning models. The machine learning algorithms mostly employed are Support Vector Machine (SVM), Dynamic Time Warping (DTW), Hidden Markov Model (HMM), Random Forest (RF), and some kinds of Artificial Neural Networks (ANN) [
5,
8]. Malekmohamadi et al. compared the results of three different machine learning algorithms (Naïve Bayes (NB), Multi-Layer Perceptron (MLP) and RF) for identifying 13 possible human daily activities performed in front of the camera (i.e., standing, sitting lying down in sleep position, etc.), using Kinect skeletal joints’ coordinates. They obtained, respectively, an average precision value of 84.1% with NB, 98.7% with MLP, and 99.0% with RF [
9]. Alternatively, Akyash et al. proposed a new kernel function based on DTW for SVM classification of eight different human postures (e.g., sit, walk, lay down, etc.) of two different data sets (TST fall detection dataset and UTD-MHAD dataset) [
10,
11]. The data of both datasets just mentioned were collected with the subject positioned in front of the camera. The proposed kernel was applied to each coordinate of every joint. With this method they obtained an overall accuracy classification of 98.8% with TST fall detection dataset and 98.75% with UTD-MHAD dataset [
12]. Su et al. suggested a multi-level hierarchical recognition model, using a custom classification algorithm for processing Microsoft Kinect skeletal joints’ coordinates. At the first level, they used an SVM classifier, and at the second level, an HMM algorithm. With this solution, they aimed at identifying 20 human actions such as bend, hand catch, pick up, and throw, etc., using the MSRAction3D dataset [
13], in which the actors were positioned in front of the camera during the acquisitions. They obtained an average recognition rate of 91.41% [
14]. In the same vein, Ahad et al. trained a SVM classifier for human activities identification (e.g., walk, sit down, stand up, etc.) with kinematics features (3D linear joint positions and angles between bone segments) from a 3D skeletal joints datasets, in which subjects were positioned in front of the camera (UT-Kinect Action 3D, Kinect Activity Recognition Dataset, MSR 3D Action Pairs, Florence 3D, and Office Activity Dataset) [
15,
16,
17,
18,
19]. The number of classes defined varied across 9 to 18, depending on the dataset used. The SVM classifier was trained with a linear kernel function obtaining, for each dataset, the following results in terms of accuracy and precision: 93.91%, 97.51%, 74.78%, 71.58%, and 94.92%, respectively [
20].
Deep learning-based approaches are getting more attention in the HAR domain thanks to the progress they have made in terms of performance in the detection and recognition of human actions, especially in visual sensor-based studies regarding AAL environments. In particular, Convolutional Neural Networks (CNN) have achieved great success for image-based tasks, while Recurrent Neural Networks (RNN) outperformed other approaches on time series. For instance, Long Short-Term Memory (LSTM) networks are frequently used to solve sequence-based problems thanks to their strengths in modeling the dependencies and dynamics in sequential data [
1,
2,
3,
6,
21,
22]. Ahad et al. trained three different deep learning models using temporal statistical features computed through a sliding time window on 3D skeletal joints data from five public datasets and compared their performances with that of the SVM classifier. The first deep model was composed of two LSTM layers, the second one was arranged with one CNN layer followed by an LSTM network (CNNRNN), and the last model was organized with two CNN networks and an LSTM network for the last layer (ConvRNN). The best model for all the datasets used was the ConvRNN architecture, which obtained accuracies ranging 94.7% and 98.1% [
20]. Zhu et al. proposed a new spatial model with end-to-end bidirectional LSTM-CNN (BLSTM-CNN). First, a hierarchical spatial–temporal dependent relational model was used to explore rich spatial–temporal information in the skeleton data. Then, a new framework was implemented to fuse CNN and LSTM. The LSTM was used to extract the temporal features and a standard CNN was used on the output of the LSTM to exploit spatial information. They used two well-known CNN architectures: VGG16 and AlexNET. The proposed models were trained and tested on the NTU RGB + D, SBU interaction, and UTD-MHAD datasets, and the number of classification labels ranged between 8 in the SBU Interaction dataset and 60 for the NTU RGB + D one [
23,
24]. In terms of overall accuracy, the BLSTM-CNN implemented with VGG16 provided the best results on the NTU-RGB + D dataset (87.1% and 93.3% in the cross-subject and cross-view benchmarks, respectively) and UTD-MHAD dataset (93.1%), while the AlexNET implementation was the best algorithm on the SBU Interaction dataset with 98.8% [
25]. Alternatively, Devanne et al. compared two kinds of temporally hierarchical deep learning models to identify human activities of daily living through skeletal data captured with a Kinect V2 sensor. The first model was a conventional LSTM architecture with a single LSTM layer, a fully connected layer and a Softmax layer. The second one was similar but used an additional LSTM layer. They decomposed human activity sequences into a set of short temporal segments with the purpose of classifying 21 types of activity (10 human behaviors in a domestic environment and 11 in an office context). They obtained an overall accuracy of 58.9% regarding the domestic environment and 58.5% for the other one [
26]. Zhu et al. proposed a deep LSTM network with three bidirectional LSTM layers and two feedforward layers. The last LSTM layer was a custom-designed LSTM layer, including dropout in order to prevent data overfitting. They trained and tested the classifier on three different online databases: SBU Kinect Interaction Dataset, HDM05 Dataset, and CMU Dataset [
27,
28]. Depending on the type of dataset used, they had a total of 8, 65, and 45 classes. They obtained an overall accuracy of 90.41%, 97.25%, and 81.04%, respectively, for each dataset [
29]. Liu et al. proposed a tree-structure-based method to explore the kinematic relationship between the skeletal joints. They used these data as input to the first LSTM network layer, whose output was in turn fed to the second LSTM layer and finally a Softmax layer. In the two LSTM layers, a new gate was added to the LSTM block to handle the noise and occlusion in 3D skeleton data. They trained and tested this model with five different online databases (NTU RGB + D Dataset, SBU Inter-action Dataset, UT-Kinect Dataset, and MHAD) and obtained an overall accuracy of 77.7%, 93.3%,97.0%, 95.0%, and 100%, respectively, for each dataset [
30]. On the other hand, Liu et al. proposed a new class of LSTM networks: Global Context-Aware Attention for skeleton-based action recognition, which was capable of selectively focusing on the informative Kinect joints in each frame by using a global context memory cell. The model is structured with a first LSTM layer, which encoded the skeleton sequence and generated an initial global context representation for the action sequence, and a second layer that performed attention over the inputs by using the global context memory cell. They trained the network on five different datasets, i.e., NTU RGB + D, SYSU-3D, UT-Kinect, SBU-Kinect, and MHAD, and achieved the following results in terms of accuracy: 76.1%, 78.6%, 99%, 94.9%, and 100%, respectively [
31].
Working with high-dimensional data increases the difficulty of knowledge discovery and of pattern classification due to the presence of many redundant and irrelevant features. Dimensionality reduction of the problem, achieved by filtering or removing redundant and noisy information, allows to reduce or eliminate irrelevant patterns in the dataset, improving the quality of the data and, therefore, making the process of classification more efficient [
32,
33,
34]. Feature selection is one of the techniques used to achieve dimensionality reduction by finding the smallest possible subset of features which efficiently defines the data for the given problem [
35,
36,
37]. It can be accomplished using different methods, i.e., filter, wrapper, embedded, and the more recent hybrid approach [
37,
38,
39]. The wrapper method selects the optimal features subset evaluating alternative sets by running the classification algorithm on the training data. It uses the classifier estimated accuracy as its metric [
38]. The most used iterative algorithms are the Recursive Feature Elimination with SVM, the Sequential Feature Selection algorithm, and the Genetic Algorithm. Compared to the filter method, it achieves better performance and high accuracy [
36,
38]; nevertheless, it increases computing complexity due to the need to recall the learning algorithm for each feature set considered. Starting from the multimodal output of the Microsoft Kinect system, many sets of features of different nature (color-, depth- and skeleton-based) are used to train the classification models for HAR. To these aims, features usually range from RGB images, depth-based global features such as space-time volume, and silhouette information, to motion kinematic skeleton descriptors such as joint position and motion (velocity and acceleration), joint distances, joint angles, 3D relative geometric relationships between rigid body parts [
40,
41,
42]. The possibility to fuse the different multimodal information obtained by RGB-D cameras has been recently explored giving good results [
43,
44].
In a previous study, we focused on skeleton-based features to classify the three most frequent postures taken by a person in a room during daily life behavior: standing, sitting, and lying down [
45]. We have also considered a further posture, called “dangerous sitting,” representing a subject slumped in a chair with his/her head lying forward or backward as if unconscious. This allowed us to perform the first distinction between routine activities and alarm situations. Therefore, in order to develop a monitoring system able to deal with ecological data, we built a homemade database of skeleton data simultaneously acquired with four Kinect devices placed in different locations of the equipped room. In this way, data are as close as possible to the real daily scenarios in which the subject moves in the room, taking different orientations and different positions with respect to the camera. A subset of 10 features, computed from the Kinect skeletal joints coordinates and then chosen using the ReliefF algorithm, was used to train and to test a two hidden layers MLP neural network, obtaining, on the test set, an average posture classification accuracy of 83.9% [
45]. This promising result was not, however, satisfying for our purpose since the classifier was the core of a more complex safety system aimed to generate an alarm when dangerous situations occur during everyday-life inside a room. Therefore, hoping to increase the MLP performance, in a later study, we proposed a pre-processing algorithm based on velocity threshold and anthropometric constraints [
46]. In this case, the overall accuracy reached by the classifier was 92%, and it increased to 95% when the test data were also averaged in a timing window of 15 frames (corresponding to 0.5 s). This procedure, which performed so well, nevertheless had the weighty drawback of increasing the computational time considerably, making the process not useful for the online demand of the monitoring system. The time required for the pre-processing phase was about 1.031 s, and the frame-by-frame MLP classification was about 0.300 s when considering a sequence of 60 frames. At this point, two different developments could be attempted to improve the accuracy of the classification: 1. a computational optimization of the pre-processing algorithm previously described (not discussed in this context); 2. the tuning of the previously implemented MLP classification model and a search for a new model, more appropriate to manage the raw data noise and the constraints of an online safety system. This latter path is the aim of the present work in which, first, we defined a further class as the transition between two consecutive postures (for example, between sitting and lying down postures and vice-versa) to enable the system to handle the continuous stream of data from the Kinect device; second, we optimized the previous MLP neural network model [
45] selecting a new subset of features using an SVM algorithm, a new set of network hyperparameters and a novel architecture; third, we trained and tested an LSTM sequence network model with a subset of features selected using a genetic algorithm. Both feature selection processes were carried out on the training data and started off from the previously selected set of 10 features [
45]. We have chosen an LSTMs network expecting to take advantage of its ability to produce a frame by frame output yet based on a sequence of data instead of only the current input and thereby having the potential to catch a wide range of dependencies among them. Therefore, using this dynamic network we expect to also be able to classify the data referring to the transitions between two successive postures, e.g., when the subject passes from the standing to the sitting posture or from the sitting to lying posture and thereby configuring our system for usage in a more ecological daily life scenario in which the subject freely moves in the room. We have analyzed different LSTM sequence architectures to find the one that gives a higher level of performance. Each one has been configured to separately classify each data frame in the data sequence in order to have results comparable with those of the optimized MLP. The final step of this work was then to compare the performances of the two optimized algorithms.
4. Discussion
This study is part of a research project aimed at developing a monitoring system allowing frail individuals to live autonomously while being non-invasively monitored for the occurrence of dangerous situations, which may require external intervention. In a previous paper [
45], we defined an MLP network aimed at classifying human postures of a subject performing daily living activities in a mock-up bedroom. Namely, we considered a set of features computed on skeleton data to recognize the ‘standing,’ ‘sitting,’ ‘lying,’ and ‘dangerous sitting’ postures. As previously described, our database was built using four different Kinect devices that simultaneously acquired the subject from four different points of view without a constraint on the position and the orientation of the subject with respect to the camera in order to train and test our model on more ecological data [
47]. The present work develops on our previous paper, expanding the original dataset by including a fifth class, ‘transition’, collecting all transitions between two consecutive postures. We considered such a choice necessary in order to be able to provide the network with a continuous stream of data while reducing the risk of incurring in false positives during such transition movements, e.g., while sitting down. With such new database of classified postures, we performed a new feature selection on the 10 parameters that we chose to describe each captured frame in order to optimize the classification ability of an MLP network. The network hyperparameters were then, in turn, optimized, and we trained the MLP with our training set composed of the data relative to 8 of our 10 acquired subjects. The remaining two subjects’ data made up the test set, on which the MLP network achieved an average 78.4%CC.
The same five-classes dataset was considered to train an LSTM sequence network with the addition of 999 frames, i.e., frames in which the Kinect was not able to identify a proper skeleton. Such noisy frames were frequent within each acquired subject’s data, probably due to the varying orientations and positions of the subjects with respect to the Kinect systems. The presence of such noisy frames is a further element contributing to a dataset representing the intended deployment conditions of the overall monitoring system.
A new feature selection exploiting a genetic algorithm aimed at maximizing a fitness function consisting in the %CC computed over the test set using a reference LSTM sequence network was developed. Such network consisted in two LSTM layers, one 25% dropout layer, a fully connected layer, and a Softmax layer. The outcome of the feature selection led us to use 8 of the 10 available features, which were presented as sequences of 60 frames in mini batches of 32 sequences each. As above, the training set was then built using frames from eight subjects and the remaining to subjects’ data made up the test set. Several LSTM sequence architectures (one LSTM layer, two LSTM layers, one bidirectional LSTM layer, and two bidirectional LSTM layers), using different numbers of neurons in the hidden layer(s) and dropout arrangements, were tested in a hyperparameters optimization algorithm.
As expected, the addition of the ‘transition’ class worsened the classification ability of the MLP network. Indeed, frames that may be very similar to those that are required to be classified in one of the other classes are now being requested to be assigned to the ‘transition’ class. In other words, a transition between sitting and standing, for example, contains frames that are very similar to those belonging to the ‘sitting’ class and to those belonging to the ‘standing’ class. This can be appreciated by examining the last row of the confusion matrix in
Figure 4, showing how a significant percentage of frames that were classified in Class 5 was supposed to be classified in the other classes. Clearly a static network such as the MLP has no means to correctly classify such frames.
The classification accuracy improved when using the LSTM architectures and more thus using bidirectional LSTM sequence architectures, in which half of the neurons are presented with the regular sequence of inputs while the other half is presented with the backward input sequence. Such network architecture, exploiting data shuffling and 45% dropout, reached the best results with a mean classification accuracy of over 85%. The effectiveness of this approach was especially evident both in terms of correct classifications for frames belonging to the ‘transition’ class, in which less than 3% of trials belonging to other classes are now classified (see
Figure 7), and in terms of correct classifications of ‘999’ frames, for which the misclassified percentage was as low as about 9%.
Comparing the behavior of the two networks throughout the five classes shows a higher specificity for the MLP network on all the five classes, yet also a much lower sensitivity than the LSTM on each class. An important contribution to the overall higher %CC obtained by the LSTM is due to the improved ability to correctly classify frames pertaining to the ‘transition’ class, as mentioned, yet a significant improvement also regards the increased sensitivity, F-score, and precision in classifying the ‘dangerous sitting’ posture, which represents a critical condition for the real usage setting of the application, one requiring rising an alarm for the safety of the monitored user. From this standpoint the LSTM mean sensitivity of 0.95 in detecting true positives for the ‘dangerous sitting’ posture represents an important improvement from the 0.76 achieved by the MLP. Indeed, a higher chance of false positives (the specificity decreased from 0.93 for the MLP to 0.85 for the LSTM) represents an acceptable cost for being more certain that dangerous situations will not be missed.
In terms of computation times for data classification, the two suggested models (MLP, 0.3 s and BLSTM 0.03 s) were significantly faster than the previously suggested MLP with pre-processing procedure 1.331 s. However, the accuracy and sensitivity of the 2BLSTM2D model, which does not need a pre-processing phase, reached satisfactory results.
As a final consideration, the results that are reported in the literature and mentioned in the Introduction appear to achieve very high accuracy rates, typically around or over 90%. These are generally higher than those reported in our work and are obtained on a larger number of classes thus that a more in-depth understanding of these experiments in comparison to ours is called for.
The differences between these studies and the one presented in this work are broad and regard both the acquisition protocol, and hence the resulting database, and the goal of the classification approach. The studies reported in the literature aim at recognizing the daily action carried out in a sequence of data frames rather than facing the problem of recognizing one or more specific postures. Therefore, these works employ databases commonly available in the literature and adapt the set of actions to be classified according to the chosen database, without the need to create ad hoc ones. From this point of view, such studies are often focused on an artificial intelligence goal per se, although it may be tailored to specific aims such as monitoring the compliance with a rehabilitation or behavioral protocol, the identification of the daily living actions carried out to provide a measure of the subject’s daily activity, or for identifying changes with respect to his/her routine behavior. Our approach is instead quite specific, as it aims at recognizing individual postures during scenarios of everyday life, independently of the action that caused it (e.g., recognizing the lying down posture and not the falling action). In this setting, building our database using different camera points of view helps us to reproduce the data acquired on a subject freely moving in the room as during natural living conditions. However, this realistic approach increases the amount of noise in our data, threatening the accuracy of the classification results.
Altogether, these important differences lead us to consider our results as hardly comparable to those that can be found in the HAR literature. Nonetheless, we are aware that other deep learning architectures and approaches may lead to even better results than the ones presented in this work.