Spatiotemporal Graph Autoencoder Network for Skeleton-Based Human Action Recognition

Abduljalil, Hosam; Elhayek, Ahmed; Marish Ali, Abdullah; Alsolami, Fawaz

doi:10.3390/ai5030083

Open AccessArticle

Spatiotemporal Graph Autoencoder Network for Skeleton-Based Human Action Recognition

¹

Department of Computer Science, King Abdulaziz University, Jeddah 21589, Saudi Arabia

²

Department of Artificial intelligence, University of Prince Mugrin, Medina 40202, Saudi Arabia

^*

Author to whom correspondence should be addressed.

AI 2024, 5(3), 1695-1708; https://doi.org/10.3390/ai5030083

Submission received: 25 July 2024 / Revised: 13 September 2024 / Accepted: 19 September 2024 / Published: 23 September 2024

(This article belongs to the Special Issue Artificial Intelligence-Based Image Processing and Computer Vision)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Human action recognition (HAR) based on skeleton data is a challenging yet crucial task due to its wide-ranging applications, including patient monitoring, security surveillance, and human- machine interaction. Although numerous algorithms have been proposed to distinguish between various activities, most practical applications require highly accurate detection of specific actions. In this study, we propose a novel, highly accurate spatiotemporal graph autoencoder network for HAR, designated as GA-GCN. Furthermore, an extensive investigation was conducted employing diverse modalities. To this end, a spatiotemporal graph autoencoder was constructed to automatically learn both spatial and temporal patterns from skeleton data. The proposed method achieved accuracies of 92.3% and 96.8% on the NTU RGB+D dataset for cross-subject and cross-view evaluations, respectively. On the more challenging NTU RGB+D 120 dataset, GA-GCN attained accuracies of 88.8% and 90.4% for cross-subject and cross-set evaluations. Overall, our model outperforms the majority of the existing state-of-the-art methods on these common benchmark datasets.

Keywords:

graph convolutional networks; graph autoencoder; deep learning; human activity analysis; skeleton-based human action recognition

1. Introduction

The recognition and analysis of human actions is a critical subfield within the domains of computer vision and deep learning, with the primary objective of automatically detecting and classifying human actions or gestures from video data [1,2,3]. Sophisticated algorithms and models capable of understanding and interpreting the dynamics of human movements are essential for this purpose [4,5]. The recognition and interpretation of human actions play a crucial role in various practical applications, such as video surveillance, healthcare systems, robotics, human–computer interaction, and beyond [6,7,8]. The ability to extract meaningful information from video sequences enables machines to comprehend and respond to human actions, thereby enhancing the efficiency and safety of many domains [9,10].

This emerging field of research leverages the capabilities of deep learning techniques to capture the temporal and spatial features inherent in video data [11,12]. Recent advancements in the use of 3D graph convolutional neural networks (3D GCNs) have further improved the accuracy of action recognition models [13,14]. Notable datasets, including NTU RGB+D [1], NTU RGB+D 120 [2], NW-UCLA, and Kinetics [3], have become established benchmarks for evaluating and analyzing the performance of these techniques, driving further research and innovation in the field.

1.1. Human Action Recognition Based on Skeleton Data

The primary objective of employing action prediction algorithms is to anticipate the classification label of a continuous action based on a partial observation along the temporal dimension [4]. This task of predicting human activities prior to their full execution is regarded as a subfield within the broader scientific area of human activity analysis. This field has garnered considerable scholarly interest due to its extensive array of practical applications across domains such as security surveillance, the observation of human–machine interactions, and medical monitoring [4].

The ability to accurately forecast the unfolding of human actions based on incomplete sensory data holds immense potential for enhancing the capabilities of intelligent systems. By proactively recognizing ongoing activities, predictive HAR models can enable preemptive responses, improved decision-making, and more seamless human–machine coordination. This predictive capability is particularly valuable in time-critical applications, where the early identification of actions can contribute to enhanced safety, efficiency, and situational awareness [4].

Extant biological research has demonstrated that human skeleton data, as depicted in Figure 1, are sufficiently informative to represent human behavior, despite the absence of appearance-related information [5]. This finding is grounded in the inherently three-dimensional spatial context within which human activities are conducted, rendering three-dimensional skeletal data an appropriate means of capturing human activity dynamics [5]. The efficient and convenient real-time acquisition of 3D skeletal information can be achieved through the utilization of affordable depth sensors, such as Microsoft Kinect and Asus Xtion. Consequently, the analysis of 3D skeleton data has become a prominent area of scholarly study within the field of human activity recognition [6,7,8].

The utilization of 3D skeleton data for activity analysis offers several key advantages, including its conciseness, sophisticated representational capacity, and resilience to variations in viewpoint, illumination, and surrounding visual distractions [5,6,7,8]. These properties have contributed to the growing prominence of skeleton-based approaches within the broader landscape of HAR research. Continued advancements in 3D skeletal data acquisition and modeling techniques hold significant promise for further enhancing the accuracy and robustness of activity recognition systems across a wide range of applications.

1.2. Graph Autoencoders

Graph autoencoders (GAEs) are a class of neural network models designed for learning low-dimensional representations of graph-structured data. In recent years, they have gained a notable focus due to their ability to perform tasks in a range of applications, including node classification, link anticipation, and community discovery. GAEs use the power of autoencoders to encode graph nodes into lower-dimensional latent representations and decode them back to the original graph structure. This process involves capturing both the topological structure and node attributes, making them powerful tools for graph representation learning. Notable works in this field include the GraphSAGE model by Hamilton et al. [9] and the variational graph autoencoder (VGAE) proposed by Kipf and Welling [10]. These models provide valuable insights into the development of graph autoencoders for various graph-related tasks.

In addition to these studies, there is an expanding body of research exploring variations and applications of graph autoencoders. These have been adapted for semi-supervised learning, recommendation systems, and anomaly detection. The field of graph autoencoders continues to evolve, offering promising avenues for further research and development.

1.3. Proposed Spatiotemporal Model for Human Action Recognition

The primary objective of this study was to develop a novel and highly accurate algorithm for human motion detection and action recognition. To this end, we focused on leveraging special practical scenarios and leveraging the latest deep learning (DL) technologies, such as graph convolutional networks (GCNs), autoencoders, and one-class classifiers, to construct a highly accurate HAR framework.

The study was conducted using the well-known NTU RGB+D 120 dataset [2], which is an extension of the NTU RGB+D dataset [1] and provides extracted skeletal motion data for 120 distinct motion classes. This dataset was selected due to its comprehensive coverage of human activities and its widespread adoption within the HAR research community.

The core of the proposed approach is a spatiotemporal graph autoencoder (GA-GCN) network architecture, as illustrated in Figure 2. The input to the network is a spatiotemporal graph constructed from the skeletal sequence data, as described in [12]. This graph-structured representation is then fed into the autoencoder for unsupervised learning of the spatial and temporal patterns inherent in the data.

The GA-GCN architecture is based on the channel-wise topology refinement GCN (CTR-GCN) model proposed by Chen et al. [11], which has demonstrated strong performance on various graph-based tasks. To further enhance the learning process, we incorporated additional skip connections to enable information flow from the decoder layers to the encoder layers, thereby facilitating more effective spatiotemporal feature extraction and reconstruction.

Skip connections are architectural features in neural networks that allow information to bypass one or more layers and connect directly to layers further along in the network [15]. This approach helps address issues like vanishing gradients and facilitates the flow of information across the network. By enabling shortcuts between layers, skip connections improve training efficiency and model performance, making it easier for the network to learn complex patterns and maintain feature representations across deeper layers [16].

The verification process involves the application of thresholding to the reconstruction loss, as described in [13], to identify abnormal or anomalous human activities within the input skeletal data. This approach leverages the autoencoder’s ability to learn the normative patterns of human motion, allowing for the detection of deviations from these learned representations. The key contributions of this work are threefold:

The development of a novel spatiotemporal graph-autoencoder network for skeleton-based HAR that effectively captures the complex spatial and temporal dynamics of human movements, offering a significant advancement in feature extraction and representation.
Outperforming most of the existing methods on two widely used skeleton-based HAR datasets.
Achieving notable performance improvements by incorporating additional modalities, as demonstrated in the experimental evaluation presented in Section 4.

2. Related Work

2.1. Graph Convolutional Networks (GCNs)

CNN-based methods have achieved great results in RGB-based processing compared to skeleton-based representations. However, GCN-based methods overcome CNN weaknesses for processing skeleton data. Spectral techniques perform convolution within the spectral domain [14,17,18]. Their application relies on Laplacian eigenbasis. Consequently, these methods are primarily suitable for graphs that share consistent structural configurations.

Convolutions are defined by spatial methods directly on a graph [19,20,21]. Handling various neighborhood sizes is one of the difficulties associated with spatial approaches. The GCN proposed by Kipf et al. [18] is one of many GCN versions available, and it is widely adapted for diverse purposes because of its simplicity. The feature update rule of GCN comprises the transformation of features into the abstract representations step and the feature aggregation step based on the analysis of the graph topology. The same rules were used in this study for feature updates.

2.2. GCN-Based Skeleton Action Recognition

The feature update rule proposed by Kipf et al. [18] has been successfully adapted for skeleton-based action recognition [22,23,24,25,26,27,28]. The GCN-based techniques emphasize skeleton graph modeling due to the significance of graphs in GCNs. Based on the topological variations, GCN-based techniques can be categorized into the following:

Static and dynamic techniques: In static techniques, the topologies of GCNs remain constant throughout the inference process, whereas they are dynamically inferred during the inference process for dynamic techniques.
Topology shared and topology non-shared techniques: Topologies are shared across all channels in topology shared techniques, whereas various topologies are employed in different channels or channel groups in topology non-shared techniques.

In the context of static approaches, Yan et al. [26] proposed a spatial–temporal GCN (ST-GCN) that predefined a fixed topology based on the human body structure. Incorporating multi-scale graph topologies into GCNs was proposed by Liu et al. [22] and Huang et al. [29] to facilitate the modeling of joint relationships across different ranges.

Regarding dynamic approaches, Li et al. [6] suggested an A-links inference component that records action-related correlations. Self-attention methods enhanced topology learning by modeling the correlation between each pair of joints [23,30]. These techniques use regional information to infer the relationships between two joints. Ye et al. [27] proposed a dynamic GCN method that learns correlations between joint pairs by incorporating contextual data from joints. The dynamic methods offer greater generalization capabilities than the static methods due to their dynamic topologies.

In terms of topology sharing, static and dynamic topologies are shared across all channels in topology-sharing procedures. These strategies impose limitations on the performance of models by compelling GCNs to aggregate features across channels with identical topology. Most GCN-based techniques, including the aforementioned static [22,26,29] and dynamic [6,23,27,30] methods, operate in a topology shared manner.

Topology non-shared techniques naturally overcome the drawbacks of topology shared techniques by employing various topologies in different channels or channel groups. The decoupling graph convolutional network (DC-GCN) used various graph representations for different channels [31], but it encountered optimization challenges when constructing channel-wise topologies due to the large number of parameters. In skeleton-based HAR, topology non-shared graph convolutions have rarely been investigated. Chen et al. [11] were the pioneers in developing dynamic channel-wise topologies. This study adopted their basic idea and implemented a graph auto-encoder network, proposing a graph autoencoder GCN (GA-GCN) model.

3. Materials and Methods

3.1. Datasets

This study utilized an existing action recognition dataset known as NTU RGB+D 120 [2], which extends the NTU RGB+D dataset [1]. NTU RGB+D [1] is a sizable dataset containing 56,880 skeletal action sequences, designed for the recognition of human actions. The dataset was collected from 40 volunteers performing samples across 60 action categories. Each video sample contains a single action and involves a maximum of two participants. The volunteer actions were simultaneously recorded using three Microsoft Kinect v2 cameras from various viewpoints. The dataset authors suggested two standard evaluation protocols:

The subject samples are divided into two halves: 20 individuals provide training samples, while the remaining 20 provide testing samples. This standard is named cross-subject (x-sub).
The testing samples are derived from the views of camera 1, while the training samples are derived from the views of cameras 2 and 3. This standard is named cross-view (x-view).

The NTU RGB+D 120 dataset [1] expands upon NTU RGB+D by adding 57,367 skeletal sequences spanning 60 new action classes, making it the largest collection of 3D joint annotations designed for HAR. The dataset was collected from 106 participants performing a total of 113,945 action sequences in 120 classes, recorded using three cameras. The dataset has 32 distinct configurations, each corresponding to a specific environment and background. The dataset authors suggested two standard evaluation protocols:

The subject samples are divided into two halves: 53 individuals provide training data, and the remaining 53 provide testing data. This standard is named cross-subject (c-sub).
The 32 setups were separated into two halves: sequences with even-numbered setup numbers provide training samples, and the remaining sequences with odd-numbered setup numbers provide testing samples. This standard is named cross-setup (x-setup).

The actions in the used datasets are categorized by the datasets’ authors into three main categories: daily actions, mutual actions, and medical conditions [1,2].

3.2. Preliminaries

In this study, a graph is used to represent the human skeleton data. Joints and bones represent the graph’s vertices and edges, respectively. An adjacency matrix denoted as

A = (V; E; X)

is used to represent the graph data, where

V_{1}^{N}

is the set of N vertices (joints) and E denotes the set of edges (edges). The adjacency matrix models the strength of the relationship between

v_{i}

and

v_{j}

. The input features of N vertices are denoted as X and represented in a matrix of size

R^{N x C}

, and

v_{i}

’s feature is denoted as

x_{i} \in R^{C}

. The following formula is used to obtain the graph convolution:

X^{o u t} = \sum_{v = 0}^{N} W X_{j} a_{i j}

(1)

Equation (1) defines the output of the relevant features

X^{o u t}

based on the weight W and adjacency matrix A.

The graph autoencoder network consists of two parts, i.e., the encoder and the decoder. Equation (2) defines the input of the layers X and pooling function

p o o l ()

of the encoder.

X_{i} = p o o l (X_{i - 1})

(2)

Equation (3) defines the input of the layers X, the unpooling function

U n p o o l ()

, and the decode function

d e c o d e ()

of the decoder.

X_{i} = d e c o d e (X_{N - i}, U n p o o l (X_{i - 1}))

(3)

Skeleton data can be acquired using motion capture devices or pose estimation techniques from recorded videos. Video data are often presented as a series of frames, with each frame containing the coordinates of a set of joints. A spatiotemporal graph was constructed by representing the joints as graph vertices and utilizing the inherent relationships in the human body structure and time as graph edges, using 2D or 3D coordinate sequences for body joint representation. The inputs to GA-GCN are the coordinate vectors of joints located at the graph nodes, similar to how image-based CNNs use pixel intensity vectors located on a 2D image matrix. The input data were subjected to several spatiotemporal graph convolution layers, resulting in the generation of more advanced feature mappings on the graph. A basic SoftMax classifier was subsequently used to predict the matching action class. Our proposed model, GA-GCN, was trained using an end-to-end method via back propagation, as shown in Figure 2.

3.3. Spatiotemporal Graph Autoencoder Network for Skeleton-Based HAR Algorithm

This study proposes a potent spatiotemporal graph autoencoder network, named GA-GCN, for skeleton-based HAR. The previous research has shown that using graphs is more efficient for this task [23,32], so we chose to represent the human skeleton as a graph with the nature of each joint. The autoencoder network is composed of 10 fundamental blocks divided into decoder and encoder parts. A global average pooling layer and a softmax classifier are used to predict action labels. A pooling layer was added after each encoder block to minimize the overall number of joints by half, and each block of the decoder is preceded by an unpooling layer to double the joints. The number of input channels and joints in the autoencoder blocks are (64,25)-(64,13)-(64,7)(64,4)-(128,2)-(64,2)-(64,4)-(256,7)-(192,13)-(256,25). Strided temporal convolution reduced the temporal dimensions by half in the fifth and eighth blocks. Figure 2 demonstrates the proposed network pipeline, GA-GCN.

3.4. Spatiotemporal Input Representations

Spatiotemporal representations refer to data or information that reflect the spatial and temporal dimensions. The input of spatiotemporal representations consists of video data represented by skeletal sequences. A resizing process was applied to each sample, resulting in a total of 64 frames.

3.5. Modalities of GA-GCN

The data from eight different modalities: joint, joint motion, bone, bone motion, joint fast motion, joint motion fast motion, bone fast motion, and bone motion fast motion, were combined. Table 1 lists the configuration used for each modality. Basically, the values of the three variables (bone, vel, and fast-motion) were changed to obtain eight different modalities.

Firstly, the data are the same as the values of joints in a frame for all the frames in the dataset; then, if the bone flag is true, the values of data for joints are updated to reflect the difference between the values of bone pairs in each frame. Then, if the vel flag is true, the values of data for joints in the current frame are updated to reflect the difference between the values of joints in the next frame and the current frame. Finally, if the fast-motion flag is true, the values of the data for joints are updated to the average of the values from the previous, current, and next frame.

4. Results

This section details the implementation specifics and the empirical findings of the study.

4.1. Implementation Details

The experiments were conducted using the PyTorch framework and executed on a single NVIDIA A100 Tensor Core GPU. The models were trained using stochastic gradient descent (SGD) with a momentum value of 0.9 and a weight decay value of 0.0004. To enhance the stability of the training process, a warming strategy was implemented during the initial five epochs, as outlined in the study by He et al. [15]. Furthermore, the model was trained for 65 epochs, with the learning rate decreased by 0.1 at epochs 35 and 55. For both the NTU RGB+D and NTU RGB+D 120 datasets, each sample was resized to 64 frames. Additionally, the data pre-processing method described by Zhang et al. [30] was employed.

4.2. Experimental Results

The performance of the GA-GCN model on the NTU RGB+D 60 dataset is illustrated through the confusion matrix, precision, recall, F1-score, and specificity metrics. Figure 3 shows the confusion matrix for the cross-view evaluation, detailing the model’s ability to classify actions correctly across different camera views. Figure 4 illustrates these metrics for the cross-view evaluation, each bar represents a distinct action class, illustrating the model’s performance in terms of precision (positive predictive value), recall (sensitivity), the F1-score (harmonic mean of precision and recall), and specificity (negative predictive value). Figure 5 presents the confusion matrix for the cross-subject evaluation, highlighting the model’s performance across different subjects. Figure 6 presents the precision, recall, F1-score, and specificity for the cross-subject evaluation, highlighting the model’s ability to recognize actions performed by different subjects. These figures reveal areas of high accuracy, as well as common misclassifications, providing a comprehensive overview of GA-GCN’s effectiveness for all 60 classes.

Our GA-GCN model exhibits notable limitations with specific classes in the NTU RGB+D 60 dataset with cross-view and evaluation; the worst classes in recognition accuracy are writing, reading, type on a keyboard, take off a shoe, put on a shoe, and play with phone/tablet. These actions are challenging due to their subtle hand movements and similarities with other activities, leading to lower recognition accuracy. For example, writing and reading involve subtle hand gestures that are difficult to distinguish, while type on a keyboard shows similar hand movements to other actions. The actions take off a shoe and put on a shoe are particularly problematic due to their subtlety and similarity.

Figure 4 displays the precision, recall, and F1-score metrics for the GA-GCN model across all 60 action classes in the NTU RGB+D 60 dataset under the cross-view setting. Each bar represents a distinct action class, illustrating the model’s performance in terms of precision (positive predictive value), recall (sensitivity), and the F1-score (harmonic mean of precision and recall).

Table 2 shows the improvement achieved by adding four additional modalities and ensembling the results when conducting the experiment on the NTU RGB+D dataset using the cross-view standard. Table 3 presents the summary of the experimental findings of the proposed GA-GCN model on the NTU RGB+D dataset, evaluated under the cross-subject and cross-view settings. Table 4 summarizes the experimental results of the GA-GCN model on the NTU RGB+D 120 dataset, tested under the cross-subject and cross-set scenarios.

5. Discussion

This section presents the details of the ablation studies conducted to demonstrate the performance of the proposed spatiotemporal graph autoencoder convolutional network, GA-GCN. Furthermore, the GA-GCN model proposed in this study is compared with other state-of-the-art methods using two benchmark datasets.

The efficacy of the GA-GCN was assessed using ST-GCN [26] as the baseline method, which falls under the static topology shared graph convolution approach. To ensure a fair comparison, the residual connections were incorporated into ST-GCN as the fundamental building units, and the module of temporal modeling outlined in Section 3 was utilized.

5.1. Comparison of GA-GCN Modalities

Table 2 shows how the accuracy increased when the results of the four modalities (joint, joint motion, bone, and bone motion) were ensembled, compared to the accuracy of a single modality. Subsequently, four more modalities were added, and it was observed that the accuracy further increased compared to the accuracy when just the original four modalities were used. The eight modalities are as follows: joint, joint motion, bone, bone motion, joint fast motion, joint motion fast motion, bone fast motion, and bone motion fast motion. The experimental findings indicate that the accuracy of the model increased by 1.0% when the results of the four additional modalities involving fast motion were ensembled, compared to the accuracy obtained with the original four modalities.

5.2. Comparison with the State-of-the-Art

Many state-of-the-art techniques have employed multimodality fusion frameworks. To ensure a fair comparison, the same structure as in [27,37] was used. In particular, data from the eight different modalities mentioned above were combined for comparison purposes. In Table 3 and Table 4, the model developed in this study is compared to cutting-edge techniques evaluated using the NTU RGB+D and NTU RGB+D 120 datasets, respectively. The proposed technique outperforms most of the existing state-of-the-art methods when evaluated based on these two common datasets.

6. Conclusions and Future Work

In this work, we proposed a novel skeleton-based HAR algorithm, named GA-GCN. Our algorithm leverages the power of the spatiotemporal graph autoencoder network to achieve high accuracy. The incorporation of the autoencoder enhances the model’s ability to capture complex spatial and temporal dynamics in human movements. Compared to the other graph convolution methods, GA-GCN exhibits a greater representation capability. Furthermore, we added four input modalities to enhance the performance even further, achieving notable performance improvements, as demonstrated in the experimental evaluation presented in Section 4. The GA-GCN was evaluated on two widely used datasets, NTU RGB+D and NTU RGB+D 120, and it outperformed most of the existing state-of-the-art methods. Additional experiments on more datasets can be considered as potential future work. Furthermore, extra graph edges can be added between significant nodes for specific actions to improve HAR performance.

Author Contributions

Conceptualization, H.A. and A.E.; methodology, H.A., A.E., A.M.A. and F.A.; software, H.A.; validation, H.A.; formal analysis, H.A., A.E., A.M.A. and F.A.; investigation, H.A., A.E., A.M.A. and F.A.; resources, A.E, A.M.A. and F.A.; data curation, H.A.; writing—original draft preparation, H.A.; writing—review and editing, H.A., A.E., A.M.A. and F.A.; supervision, A.E., A.M.A. and F.A.; All authors have read and agreed to the published version of the manuscript.

Funding

This Project was funded by the Deanship of Scientific Research (DSR) at King Abdulaziz University, Jeddah, under grant no. (GPIP: 1734-611-2024). The authors, therefore, acknowledge with thanks DSR for technical and financial support.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset was downloaded from: https://rose1.ntu.edu.sg/dataset/actionRecognition/ accessed on 23 August 2022.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Shahroudy, A.; Liu, J.; Ng, T.T.; Wang, G. NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Liu, J.; Shahroudy, A.; Perez, M.; Wang, G.; Duan, L.Y.; Kot, A.C. NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2684–2701. [Google Scholar] [CrossRef] [PubMed]
Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; et al. The kinetics human action video dataset. arXiv 2017, arXiv:1705.06950. [Google Scholar]
Liu, J.; Shahroudy, A.; Wang, G.; Duan, L.Y.; Kot, A.C. Skeleton-based online action prediction using scale selection network. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 1453–1467. [Google Scholar] [CrossRef] [PubMed]
Johansson, G. Visual perception of biological motion and a model for its analysis. Percept. Psychophys. 1973, 14, 201–211. [Google Scholar] [CrossRef]
Li, M.; Chen, S.; Chen, X.; Zhang, Y.; Wang, Y.; Tian, Q. Actional-structural graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3595–3603. [Google Scholar]
Liu, J.; Shahroudy, A.; Xu, D.; Kot, A.C.; Wang, G. Skeleton-based action recognition using spatio-temporal LSTM network with trust gates. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 3007–3021. [Google Scholar] [CrossRef] [PubMed]
Liu, J.; Wang, G.; Duan, L.Y.; Abdiyeva, K.; Kot, A.C. Skeleton-based human action recognition with global context-aware attention LSTM networks. IEEE Trans. Image Process. 2017, 27, 1586–1599. [Google Scholar] [CrossRef] [PubMed]
Hamilton, W.; Ying, Z.; Leskovec, J. Inductive representation learning on large graphs. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Kipf, T.N.; Welling, M. Variational graph auto-encoders. arXiv 2016, arXiv:1611.07308. [Google Scholar]
Chen, Y.; Zhang, Z.; Yuan, C.; Li, B.; Deng, Y.; Hu, W. Channel-wise topology refinement graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 13359–13368. [Google Scholar]
Cai, Y.; Ge, L.; Liu, J.; Cai, J.; Cham, T.J.; Yuan, J.; Thalmann, N.M. Exploiting spatial-temporal relationships for 3d pose estimation via graph convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2272–2281. [Google Scholar]
Malik, J.; Elhayek, A.; Guha, S.; Ahmed, S.; Gillani, A.; Stricker, D. DeepAirSig: End-to-End Deep Learning Based in-Air Signature Verification. IEEE Access 2020, 8, 195832–195843. [Google Scholar] [CrossRef]
Bruna, J.; Zaremba, W.; Szlam, A.; Lecun, Y. Spectral Networks and Locally Connected Networks on Graphs. In Proceedings of the International Conference on Learning Representations (ICLR), Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Srivastava, R.K.; Greff, K.; Schmidhuber, J. Highway networks. arXiv 2015, arXiv:1505.00387. [Google Scholar]
Defferrard, M.; Bresson, X.; Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; Volume 29. [Google Scholar]
Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
Duvenaud, D.K.; Maclaurin, D.; Iparraguirre, J.; Bombarell, R.; Hirzel, T.; Aspuru-Guzik, A.; Adams, R.P. Convolutional networks on graphs for learning molecular fingerprints. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Volume 28. [Google Scholar]
Niepert, M.; Ahmed, M.; Kutzkov, K. Learning convolutional neural networks for graphs. In Proceedings of the International Conference on Machine Learning. PMLR, New York, NY, USA, 19–24 June 2016; pp. 2014–2023. [Google Scholar]
Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Liu, Z.; Zhang, H.; Chen, Z.; Wang, Z.; Ouyang, W. Disentangling and unifying graph convolutions for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 143–152. [Google Scholar]
Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12026–12035. [Google Scholar]
Tang, Y.; Tian, Y.; Lu, J.; Li, P.; Zhou, J. Deep progressive reinforcement learning for skeleton-based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5323–5332. [Google Scholar]
Veeriah, V.; Zhuang, N.; Qi, G.J. Differential recurrent neural networks for action recognition. In Proceedings of the IEEE international Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4041–4049. [Google Scholar]
Yan, S.; Xiong, Y.; Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
Ye, F.; Pu, S.; Zhong, Q.; Li, C.; Xie, D.; Tang, H. Dynamic gcn: Context-enriched topology learning for skeleton-based action recognition. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 55–63. [Google Scholar]
Zhao, R.; Wang, K.; Su, H.; Ji, Q. Bayesian graph convolution lstm for skeleton based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6882–6892. [Google Scholar]
Huang, Z.; Shen, X.; Tian, X.; Li, H.; Huang, J.; Hua, X.S. Spatio-temporal inception graph convolutional networks for skeleton-based action recognition. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 2122–2130. [Google Scholar]
Zhang, P.; Lan, C.; Zeng, W.; Xing, J.; Xue, J.; Zheng, N. Semantics-guided neural networks for efficient skeleton-based human action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1112–1121. [Google Scholar]
Cheng, K.; Zhang, Y.; Cao, C.; Shi, L.; Cheng, J.; Lu, H. Decoupling gcn with dropgraph module for skeleton-based action recognition. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXIV 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 536–553. [Google Scholar]
Chen, Y.; Dai, X.; Liu, M.; Chen, D.; Yuan, L.; Liu, Z. Dynamic convolution: Attention over convolution kernels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11030–11039. [Google Scholar]
Li, S.; Li, W.; Cook, C.; Zhu, C.; Gao, Y. Independently recurrent neural network (indrnn): Building a longer and deeper rnn. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5457–5466. [Google Scholar]
Li, C.; Zhong, Q.; Xie, D.; Pu, S. Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. arXiv 2018, arXiv:1804.06055. [Google Scholar]
Si, C.; Chen, W.; Wang, W.; Wang, L.; Tan, T. An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1227–1236. [Google Scholar]
Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Skeleton-based action recognition with directed graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7912–7921. [Google Scholar]
Cheng, K.; Zhang, Y.; He, X.; Chen, W.; Cheng, J.; Lu, H. Skeleton-based action recognition with shift graph convolutional network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 183–192. [Google Scholar]
Song, Y.F.; Zhang, Z.; Shan, C.; Wang, L. Stronger, faster and more explainable: A graph convolutional baseline for skeleton-based action recognition. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1625–1633. [Google Scholar]
Korban, M.; Li, X. Ddgcn: A dynamic directed graph convolutional network for action recognition. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XX 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 761–776. [Google Scholar]
Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Decoupled spatial-temporal attention network for skeleton-based action-gesture recognition. In Proceedings of the Asian Conference on Computer Vision, Kyoto, Japan, 30 November–4 December 2020. [Google Scholar]
Plizzari, C.; Cannici, M.; Matteucci, M. Skeleton-based action recognition via spatial and temporal transformer networks. Comput. Vis. Image Underst. 2021, 208, 103219. [Google Scholar] [CrossRef]
Chen, Z.; Li, S.; Yang, B.; Li, Q.; Liu, H. Multi-scale spatial temporal graph convolutional network for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 2–9 February 2021; pp. 1113–1122. [Google Scholar]
Trivedi, N.; Sarvadevabhatla, R.K. PSUMNet: Unified Modality Part Streams are All You Need for Efficient Pose-based Action Recognition. In Proceedings of the Computer Vision—ECCV 2022 Workshops; Proceedings, Part V. Springer: Berlin/Heidelberg, Germany, 2023; pp. 211–227. [Google Scholar]
Liu, J.; Shahroudy, A.; Xu, D.; Wang, G. Spatio-temporal lstm with trust gates for 3d human action recognition. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part III 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 816–833. [Google Scholar]
Ke, Q.; Bennamoun, M.; An, S.; Sohel, F.; Boussaid, F. Learning clip representations for skeleton-based 3d action recognition. IEEE Trans. Image Process. 2018, 27, 2842–2855. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Human skeleton with circles representing joints and lines representing bones.

Figure 2. Network architecture schematic overview of the proposed spatiotemporal graph autoencoder network for HAR based on skeleton data. The top figure is an overview of the entire pipeline, and the lower figure shows the input channels and joints of the graph autoencoder parts and provides details of the pipeline layers. The input to this pipeline is a spatiotemporal graph that combines multiple poses from the sample video. This spatiotemporal graph is fed into the graph autoencoder network to produce the final output, which is the probability of each class. The layers in the encoder part are skip connected and concatenated with layers in the decoder part (these are indicated by red lines in the above diagram).

Figure 3. Confusion Matrix of GA-GCN on Cross-View Evaluation of NTU RGB+D 60 Dataset.

Figure 4. Precision, Recall, F1-Score, and Specificity of GA-GCN on Cross-View Evaluation of NTU RGB+D 60 Dataset.

Figure 5. Confusion Matrix of GA-GCN on Cross-Subject Evaluation of NTU RGB+D 60 Dataset.

Figure 6. Precision, Recall, F1-Score, and Specificity of GA-GCN on Cross-Subject Evaluation of NTU RGB+D 60 Dataset.

Table 1. Different modality configuration flags used in the training process.

Modality	Bone	Vel	Fast-Motion
joint	FALSE	FALSE	FALSE
joint motion	FALSE	TRUE	FALSE
bone	TRUE	FALSE	FALSE
bone motion	TRUE	TRUE	FALSE
joint fast motion	FALSE	FALSE	TRUE
joint motion fast motion	FALSE	TRUE	TRUE
bone fast motion	TRUE	FALSE	TRUE
bone motion fast motion	TRUE	TRUE	TRUE

Table 2. Comparison of accuracies when ensembling the modalities and adding four more modalities to GA-GCN for cross-view on NTU RGB+D experiment.

Methods	Accuracy (%)
GA-GCN joint modality	95.14
GA-GCN joint motion modality	93.05
GA-GCN bone modality	94.77
GA-GCN bone motion modality	91.99
GA-GCN after ensemble joint, joint motion, bone, and bone motion modalities in our machine	96.51
GA-GCN joint fast motion modality	94.63
GA-GCN joint motion fast motion modality	92.61
GA-GCN bone fast motion modality	94.41
GA-GCN bone motion fast motion modality	91.54
GA-GCN after ensemble joint fast motion, joint motion fast motion, bone fast motion, and bone motion fast motion modalities in our machine	96.36
GA-GCN with 8 modalities joint, joint motion, bone, bone motion, joint fast motion, joint motion fast motion, bone fast motion, and bone motion fast motion	96.8

Table 3. Comparative analysis of classification accuracy with cutting-edge techniques using NTU RGB+D dataset.

Methods	NTU-RGB+D
Methods	X-Sub (%)	X-View (%)
Ind-RNN [33]	81.8	88.0
HCN [34]	86.5	91.1
ST-GCN [26]	81.5	88.3
2s-AGCN [23]	88.5	95.1
SGN [30]	89.0	94.5
AGC-LSTM [35]	89.2	95.0
DGNN [36]	89.9	96.1
Shift-GCN [37]	90.7	96.5
DC-GCN+ADG [31]	90.8	96.6
PA-ResGCN-B19 [38]	90.9	96.0
DDGCN [39]	91.1	97.1
Dynamic GCN [27]	91.5	96.0
MS-G3D [22]	91.5	96.2
CTR-GCN [11]	92.4	96.8 *
DSTA-Net [40]	91.5	96.4
ST-TR [41]	89.9	96.1
4s-MST-GCN [42]	91.5	96.6
PSUMNet [43]	92.9	96.7
GA-GCN	92.3	96.8

* This is the accuracy reported by [11], but we achieved 96.5 in our same setup with the same dataset.

Table 4. Comparative analysis of classification accuracy with cutting-edge techniques using NTU RGB+D 120 dataset.

Methods	NTU-RGB+D 120
Methods	X-Sub (%)	X-Set (%)
ST-LSTM [44]	55.7	57.9
GCA-LSTM [8]	61.2	63.3
RotClips+MTCNN [45]	62.2	61.8
ST-GCN [26]	70.7	73.2
SGN [30]	79.2	81.5
2s-AGCN [23]	82.9	84.9
Shift-GCN [37]	85.9	87.6
DC-GCN+ADG [31]	86.5	88.1
MS-G3D [22]	86.9	88.4
PA-ResGCN-B19 [38]	87.3	88.3
Dynamic GCN [27]	87.3	88.6
CTR-GCN [11]	88.9	90.6
DSTA-Net [40]	86.6	89.0
ST-TR [41]	82.7	84.7
4s-MST-GCN [42]	87.5	88.8
PSUMNet [43]	89.4	90.6
GA-GCN	88.8	90.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Abduljalil, H.; Elhayek, A.; Marish Ali, A.; Alsolami, F. Spatiotemporal Graph Autoencoder Network for Skeleton-Based Human Action Recognition. AI 2024, 5, 1695-1708. https://doi.org/10.3390/ai5030083

AMA Style

Abduljalil H, Elhayek A, Marish Ali A, Alsolami F. Spatiotemporal Graph Autoencoder Network for Skeleton-Based Human Action Recognition. AI. 2024; 5(3):1695-1708. https://doi.org/10.3390/ai5030083

Chicago/Turabian Style

Abduljalil, Hosam, Ahmed Elhayek, Abdullah Marish Ali, and Fawaz Alsolami. 2024. "Spatiotemporal Graph Autoencoder Network for Skeleton-Based Human Action Recognition" AI 5, no. 3: 1695-1708. https://doi.org/10.3390/ai5030083

APA Style

Abduljalil, H., Elhayek, A., Marish Ali, A., & Alsolami, F. (2024). Spatiotemporal Graph Autoencoder Network for Skeleton-Based Human Action Recognition. AI, 5(3), 1695-1708. https://doi.org/10.3390/ai5030083

Article Menu

Spatiotemporal Graph Autoencoder Network for Skeleton-Based Human Action Recognition

Abstract

1. Introduction

1.1. Human Action Recognition Based on Skeleton Data

1.2. Graph Autoencoders

1.3. Proposed Spatiotemporal Model for Human Action Recognition

2. Related Work

2.1. Graph Convolutional Networks (GCNs)

2.2. GCN-Based Skeleton Action Recognition

3. Materials and Methods

3.1. Datasets

3.2. Preliminaries

3.3. Spatiotemporal Graph Autoencoder Network for Skeleton-Based HAR Algorithm

3.4. Spatiotemporal Input Representations

3.5. Modalities of GA-GCN

4. Results

4.1. Implementation Details

4.2. Experimental Results

5. Discussion

5.1. Comparison of GA-GCN Modalities

5.2. Comparison with the State-of-the-Art

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI