Next Article in Journal
Versatile DMA Engine for High-Energy Physics Data Acquisition Implemented with High-Level Synthesis
Previous Article in Journal
Infrared Image Pre-Processing and IR/RGB Registration with FPGA Implementation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Improved Graph Convolutional Network with Enriched Graph Topology Representation for Skeleton-Based Action Recognition

1
Faculty of Information Technology, Applied Science Private University, Amman 11931, Jordan
2
Department of Information Technology, The University of Jordan, Amman 11931, Jordan
3
College of Engineering and Technology, American University of the Middle East, Egaila 54200, Kuwait
4
Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
*
Authors to whom correspondence should be addressed.
Electronics 2023, 12(4), 879; https://doi.org/10.3390/electronics12040879
Submission received: 10 January 2023 / Revised: 6 February 2023 / Accepted: 7 February 2023 / Published: 9 February 2023

Abstract

:
Lately, skeleton-based action recognition has drawn remarkable attention to graph convolutional networks (GCNs). Recent methods have focused on graph learning because graph topology is the key to GCNs. We propose to align graph learning on the channel level by introducing graph convolution with enriched topology based on careful channel-wise correlations, namely the attentive channel-wise correlation graph convolution (ACC-GC). For the model to learn channel-wise enriched topologies, ACC-GC learns a shared graph topology spanning many channels and enhances it with careful channel-wise correlations. Encoding the intra-correlation between various nodes within each channel, boosting informative channel-wise correlations, and suppressing trivial ones generates attentive channel-wise correlations. Our enhanced ACC-GCN is created by substituting our ACC-GC for the GC in a standard GCN. Extensive experiments on NTURGB60 and Northwestern-UCLA datasets demonstrate that our proposed ACC-GCN performs comparably to state-of-the-art methods while reducing the computational cost.

1. Introduction

Action recognition, which has a wide range of applications in various areas, including video surveillance [1], human–robot interaction [2,3], and medical services [4], is one of the most alluring areas of computer vision. Due to several factors, such as (i) the vast variations in the visual and motion appearance of people and actions, (ii) the variations within classes of the same actions, and (iii) the similarities between classes of different actions, identifying the precise type of human motion in videos is particularly challenging.
Recent skeleton-based action recognition models mainly learn human action representations by exploiting deep learning networks [5,6,7,8,9]. Among the deep models, the graph convolutional network (GCN) has shown superior performance over handcrafted features, recurrent neural networks (RNNs) [10,11], and convolution neural networks (CNNs) [12,13,14]. The key factor contributing to the supremacy of GCN is its ability to depict intricate interrelationships and dependencies between nodes (joints) in the skeletal graph [15]. Graph topology, represented by the static adjacency matrix A R N x N is the key to GCN. N represents the number of nodes (joints), and A describes the representation of physical connections between joints in the skeleton graph. For any two joints i, j in A, if the two joints are physically connected, then A i j = 1; otherwise, A i j = 0.
Recent GCN-based methods use an unfavorable static graph topology across all layers and input samples for skeleton-based action recognition. For instance, it impacts the modeling of the multilayered semantic information represented. Various tasks require cooperation between non-connected joints, which is missing from the static graph. Clapping and playing the piano depends on the coordination of the right- and left-hand joints, which are physically unconnected. Additionally, it is preferable for different samples connected to various action classes to have various topologies. Therefore, to improve the representative capability of GCN, a dependency link between non-connected joints must be drawn in a graph for each sample [16].
Due to their greater generalization capacity, creating dynamic graphs to get around the shortcomings of static graphs is drawing a lot of interest. There have been several attempts to use dynamic graphs for skeleton-based action recognition. For instance, the context-encoding network (CeN), a dynamic GCN, was presented to automatically learn the skeleton topology [14]. Along with learning contextual characteristics from the remaining joints, which are combined and integrated globally, CeN also learns the dependency between any two joints. As a result, several graphs are built for various input samples and graph convolutional layers with varying depths. In the same context, several approaches [12,13,17,18] learn different graphs for different samples by utilizing non-local operations [19]. Even though these approaches enhance the expressiveness power of the model, they are considered dynamic-shared graph topology approaches since the same topology is shared among different channels. Different channels capture different kinds of motion data. As a result, the correlations between joints in various channels are varied, which prevents the model from capturing more insightful features when applying the same topology to all channels.
Rarely are topology non-shared GCs-based methods for skeleton-based action recognition investigated. Among few works of non-shared graph topology are [20,21]. In [20], the authors import the idea of decoupling aggregation from CNN to GCN to increase the expressiveness of spatial aggregation in GCN. Specifically, they divide channels into (g) decoupling sets, with a trainable adjacent matrix for each set. Moreover, different graph topologies are learned via the channel-wise topology refinement graph, which aggregates features in various channels in [21].
In this paper, motivated by finding that each feature channel focuses on a specific pattern, we aim to align the graph learning on the channel level. We specifically suggest a graph convolution with an enriched graph topological structure for skeleton-based action recognition. A careful channel-wise enriched topology-based graph convolution is what we specifically suggest (ACC-GC). Our ACC-GC is designed to learn a shared graph topology over a range of channels and enhance it using careful channel-wise correlations, allowing the model to learn channel-wise enriched topologies. Each channel’s intra-correlation between several joints is encoded by ACC-GC, enhancing and suppressing useful channel-wise correlations. In this manner, the channel attention and the intra-correlation are tied in a single structure. While channel attention can improve the features from the standpoint of channels during the feature extraction stage, intra-correlation modeling enables the model to employ channel-wise information to update the shared graph, which is a parametrized adjacency matrix. The networks are compelled to find and improve the informative channel-wise features that collect distinct information due to dynamically learning various topologies for various channels. The principal contributions of this paper are as follows:
  • This paper proposes aligning the graph learning on the channel level by introducing a graph convolution with an enriched topology based on attentive channel-wise correlations (ACC-GCs). By integrating ACC-GC in GCN, we obtain an enhanced GCN with an enriched topology representation (ACC-GCN) for skeleton-based action recognition.
  • This paper explores the advantage of integrating our ACC-GC configuration over dynamic topology non-shared GCN-based models.
  • By performing extensive experiments, we demonstrate the effectiveness of our proposed ACC-GCN using two large-scale human action recognition datasets, mainly the NTURGB60 and Northwestern-UCLA datasets.
The rest of this paper is organized as the following: Section 2 shows the most related work to the proposed work. The ACC-GCN method is presented in Section 3, and Section 4 illustrates the experimental setting. Finally, this work is concluded in Section 5.

2. Related Work

This section introduces the prominent works in the skeleton-based action recognition and the attention aspect. In detail, this section is divided into four subsections. First, we introduce deep neural networks (DNNs) based approaches for skeleton-based action recognition, followed by the graph convolutional network (GCN) based approaches, Transformer-based approaches, and finally, the attention to deep learning.

2.1. DNN-Based Approaches

Skeleton-based action recognition has received remarkable attention recently. Recent approaches which utilize deep neural networks (DNNs) [5,6] have shown better performance compared with traditional approaches that utilize hand-crafted features [22,23,24]. Deep neural networks models include the recurrent neural network, such as LSTM [25] and GRU [26], convolutional neural network (CNN) and graph neural network (GCN). RNN-based approaches are used to model the temporal dynamics of skeleton sequence by aggregating temporal information sequentially [10,27]. In contrast, CNN-based approaches model spatiotemporal information jointly and use a pseudo-image to model the skeleton data, where the three coordinates of each skeleton joint (x, y, z) are seen as three channels of the image [28,29,30]. These attempts to apply CNN to action recognition have certainly shown success. However, the loss of skeleton topological information occurs when the irregularly structured skeleton data are transformed into regularly structured images.

2.2. GCN-Based Models

GCNs are a family of DNNs designed to handle graphs with erratic structures, such as skeletal data. Recent GCN-based approaches [31,32,33], which have shown outstanding performance for skeleton-based action recognition, are those that model skeleton data as graphs. For instance, authors in [34] argued that the recent GCN-based models have paid less attention to embedding the intricacies of human behavior information into the latent representations of human action. Hence, a GCN-based model, namely InfoGCN, was proposed to fill this gap in the literature, where a learning framework combines a novel learning objective and an encoding method for skeleton-based action recognition. The authors built an information bottleneck-based learning objective to direct the model to learn informative but condensed latent representations.
Depending on whether the graph topology is dynamically adjusted during inference, GCN-based approaches are either static-based approaches or dynamic-based approaches. A fixed graph topology based on the physical connections of the human body is defined in static-based approaches, as proposed in [35]. In contrast, the topology is dynamically adjusted during inference with the help of the non-local operation in the dynamic-based approaches [13,18]. Each one of the previously mentioned categories can be shown as either topology-shared or topology non-shared approaches in Figure 1. In topology-shared approaches, the graph topology is shared among different channels, while in topology non-shared approaches, the topology is different in each channel.

2.3. Transformer-Based Methods

The joints of the skeleton sequence are modeled using the Transformer-based approaches by using the idea of self-attention [36]. Most techniques typically calculate the correlation of joints in the space and time domains to reduce the computational cost. For instance, the spatio-temporal cross (STAR)-Transformer [37] is a Transformer-based model that encodes two cross-modal features as a distinctive vector. By combining the input video and skeleton sequence, global grid tokens are output as video frames, and skeleton sequences are generated as joint map tokens.

2.4. Attention in Deep Learning

Attention, a complex cognitive function, has undoubtedly become one of the deep learning field’s most semantic ideas. It is inspired by human biological systems, which tend to highlight distinguishing features when dealing with large amounts of data. On an abstract level, humans tend to selectively focus on a subset of information when and where it is required while ignoring other information. For example, when viewing a scene, humans typically do not pay attention to all of the details. Instead, they concentrate on certain parts of the scenes as needed. Recently, various computer vision applications have extensively utilized different attention mechanisms [38,39,40,41,42,43]. Channel attention is a type of attention mechanism that has attracted remarkable attention in recent years. It allows the networks to learn what should be focused on. Squeeze-and-excitation (SE) [44] network is a channel-attention network that adaptively re-calibrates channel-wise feature responses by explicitly modeling inter-dependencies between channels. SE adds a content-aware mechanism to weight each channel adaptively. Selective kernel (SK) [45] utilizes many branches with different kernel sizes, which are then fused via Softmax to enhance the effectiveness of object recognition. Furthermore, a channel hard attention mechanism is proposed in [46]. By enabling the network to focus on its filters repeatedly, it uses the capability of sequential processing to improve classification performance.

3. Proposed Method

In the context of skeleton-based action recognition, we first review the spatial graph convolution in standard GCN in this section (Section 3.1). Then, we show the details of our enhanced graph convolution (Section 3.2). Finally, the overall model is introduced in Section 3.3.

3.1. GCN for Action Recognition Using Skeletons

The graph is initially depicted by vertices and edges as follows: G = (V, E, A). In the skeleton graph, the graph is composed of V joints that are connected using E edges. The adjacency matrix A encodes joint connections. The graph convolutional network (GCN) consists of several spatial graph convolutional blocks (SGC-blocks) and temporal convolutional blocks (TGC-blocks) that are both responsible for explicitly exploring both the spatial and temporal correlations between human joints. Specifically, SGC blocks can be formulated as follows:
Y = k k v Λ k 1 2 A k Λ k 1 2 X W
The input feature maps and learnable kernels are denoted by X and W, respectively. According to [35], k represents the number of spatial configurations. A represents the adjacency matrix, while Λ k is the diagonal degree matrix of A. The temporal convolutional blocks are traditional convolutional layers with a kernel size k of t × 1. Learning spatiotemporal features requires constructing a GCN that alternately stacks SGC-blocks and TGC-blocks.

3.2. Attention-Based Correlation-Driven Graph Convolution (ACC-GC)

Our attentive channel-wise enriched topology-based graph convolution (ACC-GC) takes X R C x T x V as an input, where the symbols C, T, and V denote the number of channels, the number of frames and the number of joints, respectively. ACC-GC consists of four parts, mainly graph modeling, channel-wise enhancement, topology augmentation, and feature aggregation, as illustrated in Figure 2.
The concept behind our suggested architecture is to obtain various topologies in each channel to gather discriminative data crucial for overall human motion recognition. This can be done via several steps. First, we learn the intra-correlations between different joints within each channel to dynamically obtain channel-wise correlations. Then, we enhance informative channel-wise correlations and suppress trivial ones in the channel enhancement stage. The attentive channel-wise correlations from the enhancement stage are used to update the learned shared topology in the topology augmentation stage. The shared topology serves as a generalized correlation between joints and is essentially a parameterized adjacency matrix shared by all channels. We aggregate the features in each channel with the relevant topology after getting several graph topologies for each channel to produce the final result. The following is a detailed explanation of each part separately, along with Figure 3.
Graph modeling IIn this stage, we utilize two 1 × 1 2D convolution layers, mainly ϕ , and ψ , to reduce feature channels for efficiency as illustrated in Equations (2) and (3):
x r 1 = ϕ X , x R C / r X T X V
x r 2 = ψ X , x R C / r X T X V
Both x r 1 and x r 2 denote the channel-reduced features. (r) is the reduction ratio. In our experiments, we utilize r = 8 and 16.
Next, we utilize a global average pooling layer to compact the temporal information, as our goal is to model spatial features, and the temporal layouts are insignificant at this stage:
S 1 = P o o l ( x r 1 ) , S 2 = P o o l ( x r 2 )
where S 1 and S 2 are the tensors summarized over the temporal dimension. Then, we simultaneously calculate the correlations between pairs of joints in each channel. Given a pair of joints ( v i , v j ) and their corresponding features ( x i , x j ), the correlation c ( . ) is represented by:
c ( S 1 i , S 2 j ) = σ ( f ( S 1 i , S 2 j ) )
where f is a correlation function, and σ is a nonlinear function. The experiments utilize two modeling methods (matrix multiplication and element-wise subtraction) as correlation functions and sigmoid and Tanh as non-linear functions. C ( . ) now holds a non-linear transformation of these correlations as the channel-specific topological relationship between v i and v j .
Channel Enhancement: In this stage, our GC is required to recognize and enhance the discriminative channel-wise correlations that capture specific information at this step that capture differentiated information. To achieve that, we generate channel-wise statistics using global average pooling (GAP) on the spatial dimension. Formally, a statistic e R C is generated by shrinking the correlated tensor c ( . ) through its spatial dimensions:
e = G A P ( c ( x ) )
Then we utilize transformation on the features, obtaining transformed features to calculate the attentive weight vector, which is obtained by using the sigmoid function and a 1 × 1 convolution (conv):
A = σ ( c o n v e ) , A R C / r x 1 x 1
To excite different channel-wise correlations, we perform channel-wise multiplication between the attentive weights (A) and the feature map (c) to re-weight c(.), obtaining the excited channel-wise correlations (z) tensor:
z = c + c · A A R C / r x V x V
Topology augmentation: Now, we conduct 1 × 1 conv to retain the original channel dimension. Next, we augment the parameterized graph (G) with the excited channel-wise correlations (z) resulting from the channel enhancement stage using the concatenation operation (Cat) as illustrated in Figure 3:
T = C o n c a t ( G , c o n v ( z ) )
where T denotes the channel-wise topologies.
Feature aggregation: Feature aggregation takes the channel-wise topologies ( T ) and the transformed feature map x t r a n s as inputs. Channel-wise topology T is the result of the topology augmentation stage, and x t r a n s is obtained using a transformation function. We conduct a 1 × 1 conv ( ζ ) as a transformation function in order to transform the input feature into a high-level representation. We utilize x t r a n s and ( T ) to obtain a channel graph for each channel by aggregating features channel-wise. We implement the aggregation using batch matrix multiplication to obtain the final output feature.

3.3. Our Enhanced Graph Convolution Network

The proposed graph convolutional network comprises spatial-GC, represented by ACC-GC and temporal-GC stacked together in one unit. Our final model is composed of 10 units, and the numbers of the output channels for the ten units are (64, 64, 64, 64, 128, 128, 128, 256, 256, and 256). Our temporal-GC follows [13] in the architecture. However, only some branches are used for efficiency.

4. Experimental Settings

4.1. Datasets

Two benchmarks are used to evaluate the proposed approach: NTURGB60 [47], and Northwestern-UCLA [48].
NTURGB60: NTURGB is composed of 56,880 data samples conducted by 40 different subjects. Each skeleton frame consists of (x, y, z) coordinates of 25 joint points. Samples are captured using three Microsoft Kinect v2 cameras from different views. This work proposes two settings: Cross-Subject (X-Sub) and Cross-View (X-View). In Cross-Subject, the 40 subjects are divided into training and test sets. The training set includes 40,320 samples, while the testing set includes 16,560 samples. In Cross-View, the samples captured by camera one are dedicated to the testing set, and the samples that are captured using cameras 2 and 3 are dedicated to the training set. The training set includes 37,920 samples in cross-view, while the testing set includes 18,960 samples.
Northwestern-UCLA: This dataset consists of 1494 video clips from ten different classes. A different subject carries out each action. We use the same protocol as [20], with training data from the first two cameras and validation samples from the third camera.

4.2. Implementation Details

The experiments are conducted using the Pytorch platform [49] on a single RTX 2080 TI GPU. The model is trained for 70 epochs with an initial learning rate (lr) of 0.1, which decays by a factor of 0.1 at epochs 34 and 54. In detail, lr is set to 0.01 at epoch 34 and 0.001 from epoch 54 to the last epoch. As our classification loss function, we use the cross-entropy loss function, and stochastic gradient descent with momentum (0.9) is our optimization approach. For stability issues, we adopt a warmup strategy [50] in the first five epochs. We set the batch size to 64. For prepossessing, we follow [12].

4.3. Ablation Study

To highlight the effectiveness of our proposed ACC-GCN model, we conducted extensive experiments on the NTU60 dataset. First, we tested our model with different implementations to verify the effectiveness of each component of the model, as illustrated in Section 4.3.1. Moreover, we built our model with varying stream structures, mainly: the joint stream, the bone stream, the joint-motion, and bone-motion streams. Then, we calculated the accuracy of each model separately, as illustrated in Section 4.3.2.

4.3.1. Effect of the Model Components

Different channels frequently concentrate on various data patterns. We constructed five neural networks with varying configurations and run several experiments on the NTU60 to show how well attention-based channel-wise correlations can be used to construct channel-wise graph structures. C(.) and E(.) denote the type of correlation function (either matrix multiplication or element-wise subtraction) and whether the enhancement stage is conducted or not. Act(.) denotes the type of activation function.
We compared the results with our baseline, a shared graph convolutional-based topology structure [5]. For a fair comparison, we integrated the temporal-GC described in Section 3.3. Table 1 shows the performance gains achieved by our ACC-GCN with the cross-subject benchmark on the NTU RGB+D dataset. We first evaluated the strong baseline model. As mentioned in Table 1, our best ACC-GCN improves the accuracy over the baseline by 1.6 with fewer extra parameters. Further, the best result is obtained when using element-wise subtraction as a correlation function, conducting enhancement, and using Tanh as an activation function.

4.3.2. Effect of Multi-Stream Structure

Skeleton-based action recognition is one of the fields that benefit from the multi-stream structure [18,51,52], showing significant improvement in the recognition accuracy. In this section, we evaluate the importance of the multi-stream structure in our model. We introduce our model performance using four streams, mainly joint, bone, joint motion, and bone motion, following [51]. The same model is used for all the streams, and the final prediction is generated by fusing the weighted sum scores of the four streams, as shown in Figure 4.
In Table 2, we present the result of the multi-stream structure using the best configuration from Table 1, which is model (D). As mentioned in Table 2, the contribution of each stream varies. Hence, we utilize a weighted sum to fuse the result of the four streams and obtain the final result. For example, assigning a high weight to the bone motion stream will result in performance loss because its contribution to the overall multi-stream structure is smaller than that of every joint and bone. We need to slightly reduce the weight of the bone motion stream to achieve better performance.

4.4. Visualize Learned Topology

The learned topologies on the NTU RGB+D dataset are shown in Figure 5. The graphs show that our augmentation strategy improves the common topology, resulting in more distinct features. Furthermore, the variety of refined channel-wise topologies demonstrates the capability of our method to learn unique topologies based on different motion data for different channels.

4.5. Improving over GCN-Based Models

This section explores the advantage of integrating the ACC-GC configuration over dynamic topology non-shared GCN-based models. In detail, we replaced the graph convolution (GC) in two graph-based action recognition models, mainly ST-GCN [35], and 2s-AGCN [18] with our ACC-GC configuration.
As shown in Table 3, the accuracies increase in both recognition models by 1% and 1.1% in ST-GCN [35] and 2s-AGCN [18], respectively, when replacing the GCs by our ACC-GC configuration, which validates the effectiveness of the ACC-GC configuration.

4.6. Comparison with the State of the Art

On the NTURGB dataset, we contrast our ACC-GCN model with various cutting-edge baselines, as shown in Table 4. We compare the data from UCLA and Northwestern in Table 5. On the NTU-RGB dataset, our model obtains top-1 accuracy of 92% and 96.5% in the x-sub and x-view settings, respectively. It achieves 96.1% on the Northwestern-UCLA dataset. Our results on both datasets exceed almost all the previous methods. In contrast, it is on par with the best method. Regarding the model complexity, our model is the less complex model with GFLOPs of 1.93G, while the state-of-the-art method utilizes 2.9G, which makes our model outperform the state-of-the-art in terms of complexity, resulting in a faster model. Similarly, on the Northwestern-UCLA dataset, our ACC-GCN outperforms the other methods regarding model complexity. Notably, our method is the first to apply channel enhancement alongside the dynamic modeling of the graph topology, which shows a strong learning ability according to our experiments, demonstrating the effectiveness of our approach.

5. Conclusions

The proposed ACC-GCN model can obtain a useful, informative feature representation to improve model accuracy. ACC-GCN aligns graph learning on the channel level by introducing a graph convolution with enriched topology based on careful channel-wise correlations. For the model to learn channel-wise enriched topologies, ACC-GC learns a shared graph topology spanning many channels and enhances it with detailed channel-wise correlations. Encoding the intra-correlation between various nodes within each channel, boosting informative channel-wise correlations and suppressing trivial ones generates attentive channel-wise correlations. As a result, the model is compelled to find and improve the instructive channel-wise characteristics that capture differentiated information. Our techniques also examined several graph topology representation configurations, which will serve as a roadmap for topology learning. Its state-of-the-art performance on two datasets demonstrates the suggested model’s efficacy. Finally, we believe that our idea and results open interesting new perspectives to design efficient and effective human action recognition models. However, we believe that the model complexity of GCN-based models, including ours, is somehow high, which would guide us to utilize Transformers in future work.

Author Contributions

Conceptualization, T.A.; Investigation, A.Y.S.; Methodology, T.A.; Writing—original draft, T.A., O.H. and M.A.; Writing—review and editing, T.A., N.M. and A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Chang, Y.; Tu, Z.; Xie, W.; Yuan, J. Clustering driven deep autoencoder for video anomaly detection. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 329–345. [Google Scholar]
  2. Zengeler, N.; Kopinski, T.; Handmann, U. Hand gesture recognition in automotive human–machine interaction using depth cameras. Sensors 2019, 19, 59. [Google Scholar] [CrossRef] [PubMed]
  3. Liu, J.; Shahroudy, A.; Xu, D.; Wang, G. Spatio-temporal lstm with trust gates for 3d human action recognition. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 816–833. [Google Scholar]
  4. Tu, Z.; Xie, W.; Qin, Q.; Poppe, R.; Veltkamp, R.C.; Li, B.; Yuan, J. Multi-stream CNN: Learning representations based on human-related regions for action recognition. Pattern Recognit. 2018, 79, 32–43. [Google Scholar] [CrossRef]
  5. Yang, Z.; Li, Y.; Yang, J.; Luo, J. Action recognition with spatio–temporal visual attention on skeleton image sequences. IEEE Trans. Circuits Syst. Video Technol. 2018, 29, 2405–2415. [Google Scholar] [CrossRef]
  6. Li, C.; Xie, C.; Zhang, B.; Han, J.; Zhen, X.; Chen, J. Memory attention networks for skeleton-based action recognition. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 4800–4814. [Google Scholar]
  7. Tu, Z.; Zhang, J.; Li, H.; Chen, Y.; Yuan, J. Joint-bone Fusion Graph Convolutional Network for Semi-supervised Skeleton Action Recognition. arXiv 2022, arXiv:2202.04075. [Google Scholar] [CrossRef]
  8. Ali, A.; Zhu, Y.; Zakarya, M. Exploiting dynamic spatio-temporal graph convolutional neural networks for citywide traffic flows prediction. Neural Netw. 2022, 145, 233–247. [Google Scholar] [CrossRef]
  9. Si, C.; Chen, W.; Wang, W.; Wang, L.; Tan, T. An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1227–1236. [Google Scholar]
  10. Zhang, P.; Lan, C.; Xing, J.; Zeng, W.; Xue, J.; Zheng, N. View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2117–2126. [Google Scholar]
  11. Du, Y.; Wang, W.; Wang, L. Hierarchical recurrent neural network for skeleton based action recognition. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1110–1118. [Google Scholar] [CrossRef]
  12. Zhang, P.; Lan, C.; Zeng, W.; Xing, J.; Xue, J.; Zheng, N. Semantics-guided neural networks for efficient skeleton-based human action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1112–1121. [Google Scholar]
  13. Liu, Z.; Zhang, H.; Chen, Z.; Wang, Z.; Ouyang, W. Disentangling and unifying graph convolutions for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 143–152. [Google Scholar]
  14. Ye, F.; Pu, S.; Zhong, Q.; Li, C.; Xie, D.; Tang, H. Dynamic gcn: Context-enriched topology learning for skeleton-based action recognition. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 55–63. [Google Scholar]
  15. Pen, L.Z.; Xian Xian, K.; Yew, C.F.; Hau, O.S.; Sumari, P.; Abualigah, L.; Ezugwu, A.E.; Shinwan, M.A.; Gul, F.; Mughaid, A. Artocarpus Classification Technique Using Deep Learning Based Convolutional Neural Network. In Classification Applications with Deep Learning and Machine Learning Technologies; Springer: Berlin/Heidelberg, Germany, 2022; pp. 1–21. [Google Scholar]
  16. Abd Elaziz, M.; Dahou, A.; Abualigah, L.; Yu, L.; Alshinwan, M.; Khasawneh, A.M.; Lu, S. Advanced metaheuristic optimization techniques in applications of deep neural networks: A review. Neural Comput. Appl. 2021, 33, 1–21. [Google Scholar] [CrossRef]
  17. Li, B.; Li, X.; Zhang, Z.; Wu, F. Spatio-temporal graph routing for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 8561–8568. [Google Scholar]
  18. Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12026–12035. [Google Scholar]
  19. Buades, A.; Coll, B.; Morel, J.M. A non-local algorithm for image denoising. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 2, pp. 60–65. [Google Scholar]
  20. Cheng, K.; Zhang, Y.; Cao, C.; Shi, L.; Cheng, J.; Lu, H. Decoupling GCN with DropGraph Module for Skeleton-Based Action Recognition. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 536–553. [Google Scholar]
  21. Chen, Y.; Zhang, Z.; Yuan, C.; Li, B.; Deng, Y.; Hu, W. Channel-wise topology refinement graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 13359–13368. [Google Scholar]
  22. Garcia-Hernando, G.; Kim, T.K. Transition forests: Learning discriminative temporal transitions for action recognition and detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 432–440. [Google Scholar]
  23. Xia, L.; Chen, C.C.; Aggarwal, J.K. View invariant human action recognition using histograms of 3d joints. In Proceedings of the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Providence, RI, USA, 16–21 June 2012; pp. 20–27. [Google Scholar]
  24. Yu, G.; Liu, Z.; Yuan, J. Discriminative Orderlet Mining for Real-Time Recognition of Human-Object Interaction. In Proceedings of the ACCV, 12th Asian Conference on Computer Vision, Singapore, 1–5 November 2014. [Google Scholar]
  25. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar]
  26. Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv 2014, arXiv:1406.1078. [Google Scholar]
  27. Zhang, P.; Xue, J.; Lan, C.; Zeng, W.; Gao, Z.; Zheng, N. Adding Attentiveness to the Neurons in Recurrent Neural Networks. IEEE Trans. Image Process. 2019, 29, 1061–1073. [Google Scholar]
  28. Du, Y.; Fu, Y.; Wang, L. Skeleton based action recognition with convolutional neural network. In Proceedings of the 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), Kuala Lumpur, Malaysia, 3–6 November 2015; pp. 579–583. [Google Scholar]
  29. Ke, Q.; Bennamoun, M.; An, S.; Sohel, F.; Boussaid, F. A new representation of skeleton sequences for 3d action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3288–3297. [Google Scholar]
  30. Li, B.; Dai, Y.; Cheng, X.; Chen, H.; Lin, Y.; He, M. Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep CNN. In Proceedings of the 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Hong Kong, China, 10–14 July 2017; pp. 601–604. [Google Scholar]
  31. Bruna, J.; Zaremba, W.; Szlam, A.; LeCun, Y. Spectral networks and locally connected networks on graphs. arXiv 2013, arXiv:1312.6203. [Google Scholar]
  32. Defferrard, M.; Bresson, X.; Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; Volume 29. [Google Scholar]
  33. Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
  34. Chi, H.g.; Ha, M.H.; Chi, S.; Lee, S.W.; Huang, Q.; Ramani, K. InfoGCN: Representation Learning for Human Skeleton-Based Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–20 June 2022; pp. 20186–20196. [Google Scholar]
  35. Yan, S.; Xiong, Y.; Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
  36. Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 87–110. [Google Scholar] [CrossRef] [PubMed]
  37. Ahn, D.; Kim, S.; Hong, H.; Ko, B.C. STAR-Transformer: A Spatio-Temporal Cross Attention Transformer for Human Action Recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 3330–3339. [Google Scholar]
  38. Ma, L.; Xie, H.; Liu, C.; Zhang, Y. Learning Cross-Channel Representations for Semantic Segmentation. IEEE Trans. Multimed. 2022, 1. [Google Scholar] [CrossRef]
  39. Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
  40. Zhao, H.; Jia, J.; Koltun, V. Exploring self-attention for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10076–10085. [Google Scholar]
  41. Zhang, Z.; Lan, C.; Zeng, W.; Jin, X.; Chen, Z. Relation-Aware Global Attention for Person Re-Identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
  42. Zhang, J.; Xie, Z.; Sun, J.; Zou, X.; Wang, J. A cascaded R-CNN with multiscale attention and imbalanced samples for traffic sign detection. IEEE Access 2020, 8, 29742–29754. [Google Scholar] [CrossRef]
  43. Niu, B.; Wen, W.; Ren, W.; Zhang, X.; Yang, L.; Wang, S.; Zhang, K.; Cao, X.; Shen, H. Single image super-resolution via a holistic attention network. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 191–207. [Google Scholar]
  44. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
  45. Li, X.; Wang, W.; Hu, X.; Yang, J. Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 510–519. [Google Scholar]
  46. Stollenga, M.F.; Masci, J.; Gomez, F.; Schmidhuber, J. Deep networks with internal selective attention through feedback connections. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, USA, 8–13 December 2014; Volume 27. [Google Scholar]
  47. Shahroudy, A.; Liu, J.; Ng, T.T.; Wang, G. NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1010–1019. [Google Scholar]
  48. Wang, J.; Nie, X.; Xia, Y.; Wu, Y.; Zhu, S.C. Cross-View Action Modeling, Learning, and Recognition. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 2649–2656. [Google Scholar]
  49. Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic Differentiation in Pytorch. 2017. Available online: https://openreview.net/forum?id=BJJsrmfCZ (accessed on 9 January 2023).
  50. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  51. Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Skeleton-Based Action Recognition With Multi-Stream Adaptive Graph Convolutional Networks. IEEE Trans. Image Process. 2020, 29, 9532–9545. [Google Scholar] [CrossRef]
  52. Li, W.; Liu, X.; Liu, Z.; Du, F.; Zou, Q. Skeleton-Based Action Recognition Using Multi-Scale and Multi-Stream Improved Graph Convolutional Network. IEEE Access 2020, 8, 144529–144542. [Google Scholar] [CrossRef]
  53. Vemulapalli, R.; Arrate, F.; Chellappa, R. Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 588–595. [Google Scholar] [CrossRef]
  54. Song, S.; Lan, C.; Xing, J.; Zeng, W.; Liu, J. An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, AAAI’17, San Francisco, CA, USA, 4–9 February 2017; pp. 4263–4270. [Google Scholar]
  55. Liu, M.; Liu, H.; Chen, C. Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognit. 2017, 68, 346–362. [Google Scholar] [CrossRef]
  56. Tang, Y.; Tian, Y.; Lu, J.; Li, P.; Zhou, J. Deep progressive reinforcement learning for skeleton-based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5323–5332. [Google Scholar]
  57. Li, M.; Chen, S.; Chen, X.; Zhang, Y.; Wang, Y.; Tian, Q. Actional-structural graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3595–3603. [Google Scholar]
  58. Cheng, K.; Zhang, Y.; He, X.; Chen, W.; Cheng, J.; Lu, H. Skeleton-Based Action Recognition with Shift Graph Convolutional Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
  59. Song, Y.F.; Zhang, Z.; Shan, C.; Wang, L. Stronger, faster and more explainable: A graph convolutional baseline for skeleton-based action recognition. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1625–1633. [Google Scholar]
  60. Song, Y.F.; Zhang, Z.; Shan, C.; Wang, L. Richly activated graph convolutional network for robust skeleton-based action recognition. IEEE Trans. Circuits Syst. Video Technol. 2020, 31, 1915–1925. [Google Scholar]
  61. Trivedi, N.; Sarvadevabhatla, R.K. PSUMNet: Unified Modality Part Streams are All You Need for Efficient Pose-based Action Recognition. arXiv 2022, arXiv:2208.05775. [Google Scholar]
  62. Veeriah, V.; Zhuang, N.; Qi, G.J. Differential recurrent neural networks for action recognition. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4041–4049. [Google Scholar]
  63. Wang, J.; Liu, Z.; Wu, Y.; Yuan, J. Learning actionlet ensemble for 3D human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 36, 914–927. [Google Scholar]
  64. Lee, I.; Kim, D.; Kang, S.; Lee, S. Ensemble deep learning for skeleton-based action recognition using temporal sliding lstm networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1012–1020. [Google Scholar]
Figure 1. Flow chart of the graph topology modeling structures.
Figure 1. Flow chart of the graph topology modeling structures.
Electronics 12 00879 g001
Figure 2. The flow chart of the proposed ACC-GC. ACC-GC models the intra-correlation between different joints within each channel separately, meanwhile enhancing the informative channel-wise correlations and suppressing trivial ones, as this contributes to the acquisition of valuable, informative features. The shared parametrized adjacency matrix generates the channel-wise enriched graph topology, and channel-wise aggregation yields the output feature (z).
Figure 2. The flow chart of the proposed ACC-GC. ACC-GC models the intra-correlation between different joints within each channel separately, meanwhile enhancing the informative channel-wise correlations and suppressing trivial ones, as this contributes to the acquisition of valuable, informative features. The shared parametrized adjacency matrix generates the channel-wise enriched graph topology, and channel-wise aggregation yields the output feature (z).
Electronics 12 00879 g002
Figure 3. Illustration of the overall model (ACC-GCN) (a). It consists of spatial-GC represented by ACC-GC (b), and temporal-GC (c). To extract significant correlations across human joints and to aggregate their results as output, we conducted 3 ACC-GC simultaneously. The output from the spatial GC is fused into the temporal GC, and the class label is obtained by using GAP followed by a fully connected layer.
Figure 3. Illustration of the overall model (ACC-GCN) (a). It consists of spatial-GC represented by ACC-GC (b), and temporal-GC (c). To extract significant correlations across human joints and to aggregate their results as output, we conducted 3 ACC-GC simultaneously. The output from the spatial GC is fused into the temporal GC, and the class label is obtained by using GAP followed by a fully connected layer.
Electronics 12 00879 g003
Figure 4. Illustration of the multi-stream model’s architecture. Joint (J-Stream), joint motion (J-M-Stream), bone (B-Stream), and bone motion (B-M-Stream) were the four streams we fed into our model. We aggregated the four streams using weighted summation to obtain the final prediction.
Figure 4. Illustration of the multi-stream model’s architecture. Joint (J-Stream), joint motion (J-M-Stream), bone (B-Stream), and bone motion (B-M-Stream) were the four streams we fed into our model. We aggregated the four streams using weighted summation to obtain the final prediction.
Electronics 12 00879 g004
Figure 5. (a) Illustrates the representation of the shared graph topology, while (b,c) represent our enriched graph topology built based on the attentive channel-wise correlations ACC in two different channels.
Figure 5. (a) Illustrates the representation of the shared graph topology, while (b,c) represent our enriched graph topology built based on the attentive channel-wise correlations ACC in two different channels.
Electronics 12 00879 g005
Table 1. Comparisons of the validation accuracy of ACC-GC with different configurations.
Table 1. Comparisons of the validation accuracy of ACC-GC with different configurations.
ModelC(.)E(.)Act(.)Acc (%)Param
Baseline---88.11.21 M
Amatrix multiplicationYesTanh89.31.55 M
Belement-wise subNoTanh89.51.46 M
Celement-wise subYesSigmoid89.41.55 M
Delement-wise subYesTanh89.71.55 M
Eelement-wise subYesRelu89.61.55 M
Table 2. Comparisons of the validation accuracy of ACC-GCN with different modalities.
Table 2. Comparisons of the validation accuracy of ACC-GCN with different modalities.
ModalityTop-1 (%)
Joint (J)89.70
Joint-motion (JM)87.87
Bone (B)90.30
Bone-motion (BM)87.45
J + B + JM + BM92.0
Table 3. Comparisons of the validation accuracy of ACC-GC with different configurations.
Table 3. Comparisons of the validation accuracy of ACC-GC with different configurations.
ModelGFLOPsAcc (%)Original Accuracy
ST-GCN [35]16.382.581.5
2s-AGCN [18]37.389.688.5
Ours1.992.0-
Table 4. Comparison of the validation accuracy with the state-of-the-art methods on the NTU-RGBD dataset. The best results are in blue, and the second best is in red.
Table 4. Comparison of the validation accuracy with the state-of-the-art methods on the NTU-RGBD dataset. The best results are in blue, and the second best is in red.
Modelx-Sub (%)x-View (%)GFLOPs
Lie Group [53]50.152.8-
H-RNN [11]59.164.0-
PA-LSTM [47]62.970.3-
ST-LSTM+TS [3]69.277.7-
STA-LSTM [54]73.481.2-
Visualize CNN [55]76.082.6-
C-CNN+MTLN [29]79.684.8-
VA-LSTM [10]79.287.7-
ST-GCN [35]81.588.316.3
DPRL [56]83.589.8-
AS-GCN [57]86.894.2-
DC-GCN+ADG [20]90.896.625.7
4s-ShiftGCN [58]90.796.510.0
2s-AGCN [18]88.595.137.3
PA-ResGCN [59]90.996.018.5
RA-GCN [60]87.393.632.8
MS-G3D [13]91.596.248.8
PSUMNet [61]92.996.72.7
Our ACC-GCN92.096.51.93
Table 5. Comparison of the validation accuracy with the state-of-the-art methods on the Northwestern-UCLA dataset.
Table 5. Comparison of the validation accuracy with the state-of-the-art methods on the Northwestern-UCLA dataset.
ModelNorthwestern-UCLA Top-1 (%)GFLOPs
Lie Group [62]72.2-
Actionlet ensemble [63]76.0-
H-RNN [11]78.5-
Ensemble TS-LSTM [64]89.2-
AGC-LSTM [9]93.3-
Shift-GCN [58]94.6-
2s AGC-LSTM93.310.9
CTR-GCN96.51.97
ACC-GCN (ours)96.11.93
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Alsarhan, T.; Harfoushi, O.; Shdefat, A.Y.; Mostafa, N.; Alshinwan, M.; Ali, A. Improved Graph Convolutional Network with Enriched Graph Topology Representation for Skeleton-Based Action Recognition. Electronics 2023, 12, 879. https://doi.org/10.3390/electronics12040879

AMA Style

Alsarhan T, Harfoushi O, Shdefat AY, Mostafa N, Alshinwan M, Ali A. Improved Graph Convolutional Network with Enriched Graph Topology Representation for Skeleton-Based Action Recognition. Electronics. 2023; 12(4):879. https://doi.org/10.3390/electronics12040879

Chicago/Turabian Style

Alsarhan, Tamam, Osama Harfoushi, Ahmed Younes Shdefat, Nour Mostafa, Mohammad Alshinwan, and Ahmad Ali. 2023. "Improved Graph Convolutional Network with Enriched Graph Topology Representation for Skeleton-Based Action Recognition" Electronics 12, no. 4: 879. https://doi.org/10.3390/electronics12040879

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop