Next Article in Journal
Model of Smart Locating Application for Small Businesses
Previous Article in Journal
An Effective Design Scheme of Single- and Dual-Band Power Dividers for Frequency-Dependent Port Terminations
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Part-Wise Adaptive Topology Graph Convolutional Network for Skeleton-Based Action Recognition

1
School of Electronic Information, Wuhan University, Wuhan 430072, China
2
Hubei Three Gorges Laboratory, Yichang 443007, China
*
Author to whom correspondence should be addressed.
Electronics 2023, 12(9), 1992; https://doi.org/10.3390/electronics12091992
Submission received: 16 March 2023 / Revised: 13 April 2023 / Accepted: 18 April 2023 / Published: 25 April 2023
(This article belongs to the Section Artificial Intelligence)

Abstract

:
Human action recognition is a computer vision challenge that involves identifying and classifying human movements and activities. The behavior of humans comprises movements of multiple body parts, and Graph Convolutional Networks (GCNs) have emerged as a promising approach for this task. However, most contemporary GCN methods perform graph convolution on the entire skeleton graph without considering that the human body consists of distinct body parts. To address these shortcomings, we propose a novel method that optimizes the representation of the skeleton graph by designing temporal and spatial convolutional blocks while introducing the Part-wise Adaptive Topology Graph Convolution (PAT-GC) technique. PAT-GC adaptively learns the segmentation of different body parts and dynamically integrates the spatial relevance between them. Furthermore, we utilize hierarchical modeling to divide the skeleton graph, capturing a more comprehensive representation of the human body. We evaluate our approach on three publicly available large datasets: NTU RGB + D 60, NTU RGB + D 120, and Kinetics Skeleton 400. Our experimental results demonstrate that our approach achieves state-of-the-art performance, thus validating the efficiency of our proposed technique for human action recognition.

1. Introduction

Human action recognition is a significant task with numerous applications, including video comprehension, robotic vision, autonomous driving, and virtual reality. Skeleton-based action recognition has recently received more attention due to the development of low-cost motion sensors and human pose estimation algorithms. The goal of this task is to use skeleton graph sequences to identify corresponding actions. In comparison to video data [1,2,3,4], skeleton data has a number of advantages for action recognition. In particular, skeleton data is a more reliable data source for this task because it is more robust to occlusion, provides a more compact data representation, and is less sensitive to lighting conditions. Furthermore, since skeleton data doesn’t include any private visual information about the person being observed, it offers increased privacy. The use of skeleton-based data for action recognition has grown in popularity as a result of these benefits.
Unlike RGB or thermal video, skeleton data only includes the 2D or 3D positions of human joints. This lessens some redundant information while also reducing the amount of information compared to raw video data. The joints in the human skeleton can be thought of as nodes in a sparse graph, and the natural connections between them as edges. These connections between joints are typically described using the adjacency matrix. Given that graph convolution is a logical choice for handling graph-structured data, graph convolutional networks have found extensive use in processing the skeleton of the human body’s graph structure. Relationships within human skeletons have been described using predetermined topology in more recent methods. Predetermined topology, however, has difficulty simulating relationships between joints that are not physically connected, such as brain-controlled motor coordination between the arms and legs that are not physically connected.
Some methods add more manually crafted edges to the physically connected graph structure in order to extract complex information. To improve network performance, attention mechanisms and multiscale operators are also used. Some methods divide the adjacency matrix into subgraphs connected to the arms, legs, or torso in order to extract fine-grained features from the human skeleton. Body part representation can highlight the significance of each component and how they relate to one another in both space and time. However, the partitioning of the adjacency matrix determines how effective these methods are, and various segmentation methods may produce various outcomes. It is still unclear how to accurately describe the human skeleton and use the data it contains. In our opinion, a practical segmentation approach should be data-driven in addition to relying on predetermined topology.
Understanding and analyzing human motion requires the examination of human movement patterns and the relationship between body parts. The interaction between different body parts is a critical aspect of various types of movements, such as walking, running, jumping, and throwing. The human body is organized as an assembly of body parts linked through joints, with all body parts controlled by the brain. While the precise mechanism through which the brain generates and controls movements is unclear, specific effects can be observed. As depicted in Figure 1a, joints within the same part of the body tend to exhibit similar movement patterns. For instance, the shoulder joint and elbow joint both move similarly as they are both situated in the upper arm and connected to the same muscles. The knee and hip joints move similarly as they are situated in the lower body and connected to the same muscles. In Figure 1b, coordination involves both symmetrical and asymmetrical coordination of limbs. A crucial dynamic effect in human walking is that limb movements are nearly synchronized. The limbs move in a rhythmic pattern, with the arms and legs swinging in opposite directions almost simultaneously. This synchronized movement helps maintain balance and enables the body to move forward more efficiently. In Figure 1c, each person’s movement patterns are unique. Due to differences in age, gender, health, and level of physical fitness, these patterns can vary greatly between people. Action recognition may be inaccurate as a result of predetermined topology.
In this paper, we focus on optimizing the description of the human skeleton. Specifically, we describe the skeleton data using both static and dynamic relationships. For static relations, we also use predetermined topology. Motivated by the approach presented in [5] to divide the human skeleton into different parts, we propose a hierarchical approach to dividing the skeleton topology into fine and coarse scales. At the fine scale, the skeleton is subdivided into five parts: trunk (including the head and torso), left arm, right arm, left leg, and right leg. This allows the model to concentrate on the movement patterns of specific body parts and capture their role in the overall motion. The joints are combined into larger groups based on similar movements or relationships between body parts at the coarse scale. For instance, the same side of the arm and thigh can be combined, or the left and right arms can be combined. This larger grouping enables the model to capture more complex movement patterns and relationships between different body parts during motion, such as hand-foot coordination.
The static topology provides a set of fixed connections between joints defined before training the model. While static topology can be effective for certain graphs, it may not always capture all the relevant relationships and dependencies within the data. This is due to the fact that manually created topologies are constrained by prior knowledge of the graph and may not accurately represent more intricate patterns and relationships. Therefore, we treat the relationships between joints as data-driven and aim to capture the latent dependencies between joints. Specifically, we propose a part-wise adaptive topology graph convolution (PAT-GC), which models adaptive topology and extracts discriminative features. The adaptive topology obtained through PAT-GC can be seen as a refinement of the static topology. PAT-GC enables more fine-grained adjustments to the connections between joints and can capture the local structure of the graph in each layer and learn more meaningful representations.
We propose the part-wise adaptive topology graph convolutional network (PAT-GCN), which stacks multiple spatio-temporal blocks to extract features from human skeleton stream data. The spatio-temporal block includes two parts: spatial modeling and temporal modeling. Spatial modeling is achieved using PAT-GC as the basic module and utilizing a hierarchical structure to aggregate motion information. Temporal modeling is accomplished using the multi-scale temporal convolution module, consisting of several branches with different temporal graph convolutions.
To validate the effectiveness of our proposed PAT-GCN, we conducted extensive experiments on three public datasets: NTU RGB + D 60, NTU RGB + D 120, and Kinetics Skeleton 400. The results demonstrate that our method achieves state-of-the-art performance. To show the importance of each component, we also conducted ablation experiments. In summary, the main contributions of our work are as follows:
  • We propose a hierarchical approach to partition the skeleton topology into multiple parts at two different scales. This method enables the exploration of movement patterns for various body parts, as well as their interrelationships during motion.
  • We propose a part-wise adaptive topology graph convolution design that leverages data-driven methods to obtain an adaptive topology and extract discriminative features.
  • The extensive experimental results highlight the benefits of the part-wise adaptive topology graph convolution. Our proposed PAT-GCN outperforms state-of-the-art methods on three different, skeleton-based action recognition benchmarks.
The remainder of this paper is organized as follows: In Section 2, we review related studies, including graph convolutional networks, GCN-based skeleton action recognition, and part-based skeleton action recognition. In Section 3, we elaborate on our approach. Section 4 presents and analyzes experimental details and results. Finally, in Section 5, we summarize our work and provide an outlook for the future.

2. Related Work

2.1. Graph Convolutional Network

Convolutional Neural Networks (CNNs) have been remarkably successful at processing Euclidean data, such as images. Traditional convolution, on the other hand, is restricted to regular grid data and is ineffective for handling general graphs. Graph convolution has been created to overcome this limitation and enable the extraction of local patterns from data using graph structures. Graph Convolutional Networks (GCNs) were created with the specific purpose of directly extracting features from non-Euclidean data. GCN-based methods can be divided into spectral methods [6,7,8] and spatial methods [9,10,11,12]. Spectral methods define convolution in the spectral domain using a set of learned filters in the graph Fourier domain. The main idea is to leverage the eigenvalues and eigenvectors of the graph Laplacian matrix, which captures the structural properties of the input graph, to define a set of filters that can be applied to the graph signal. The convolution filters, which are defined in the spectral domain, are not localized in the vertex domain, which is a significant disadvantage of spectral methods. On the other hand, spatial methods typically update each node layer by layer by choosing neighbors, merging the features from the neighbors that were chosen, and then applying the activation function to the merged features. In this work, we use spatial methods and aggregate features based on skeleton graphs with pre-designed structures.

2.2. Gcn-Based Skeleton Action Recognition

Early studies on skeleton action recognition [13,14,15,16] generally utilized hand-crafted features to record human body motion. However, these approaches primarily focused on utilizing the relative 3D rotations and translations between joints, resulting in complex feature design and sub-optimal performance. In recent years, deep learning methods have made significant progress in the field of skeleton action recognition.
GCN-based models are built on a series of skeleton graphs, enabling the extraction of as much discriminative information as possible in the spatial and temporal domain. As a result, GCN-based approaches have shown significant improvement in skeleton action recognition. ST-GCN [17], which defines a spatial and temporal graph convolutional network, is one notable example of GCN-based techniques. It takes into account both the natural human body structure and the temporal motion correlation in the spatio-temporal domain. Another well-known model is 2s-AGCN [18], which takes the original 3D joint coordinates as joint stream data and the 3D coordinate difference between two adjacent joints as bone stream data. To learn the non-local dependencies in the spatial dimension, this model constructs an adaptive graph to give adaptive attention to each joint. MS-G3D [19] introduces the concept of cross-time-space aggregation, which combines spatial domain graph convolution and temporal convolution and uses a multi-scale aggregation scheme to differentiate the importance of nodes in different neighborhoods for effective remote modeling. The G3D module can use dense cross-time-space edges as skip connections to directly propagate information on the time-space graph. A mobile graph convolutional network called Shift-GCN [20], which combines both spatial and temporal shift graph convolution, was created to recognize skeleton actions. In comparison to standard graph convolution, the non-local spatial shift method performs noticeably better and uses less processing power. While maintaining recognition effectiveness, adaptive temporal shift graph convolution can adapt the receptive field while also requiring less computation and parameter input. In this study, we extract the skeleton feature representation using a GCN-based network.

2.3. Part-Based Skeleton Action Recognition

The contribution of joint points to recognition in various motions varies for the motion categories. Some methods re-weight the joint data to produce better representations. They might be motivated for a variety of reasons, such as to solve the sample imbalance problem [21] or discovering the discriminative joints [22]. The network is able to better capture temporal and spatial information [23] by using the attention mechanism, which also helps the network focus attention on the important areas [24]. The definition and connection relationships of joints can be viewed as prior information for skeleton data. An accurate prior can speed up convergence, increase accuracy, and strengthen the model’s robustness. The part-based approach modifies the previously known information. Similar to the method previously mentioned, it is also a form of re-weighting mechanism that modifies the joints and the adjacency matrix at the neighborhood level.
In order to learn part representations, Du et al. [25] suggest dividing the human skeleton into five parts and feeding each part into a different recurrent neural network (RNN). These part representations are then combined into higher-level features for action recognition using a hierarchical RNN. To extract spatial features from various skeleton parts and learn temporal features from stacked frames, Si et al. [26] propose a novel model made up of a spatial reasoning network (SRN) and a temporal stack learning network (TSLN). To concentrate on the most important body parts for each action sequence, PA-ResGCN [27] makes use of multiple input branches and a part-wise attention module. The Multi-View Interactional Graph Network (MV-IGNet) [28] is an idea put forth by Wang et al. that makes use of various views of the graph to produce supplementary features. These views can be attained by taking into account various joint subsets or by representing various connections with various edge types. Fixed or single-scale segmentation, however, has a limited ability to scale and be flexible. Our proposed model allows to capture more complex motion patterns by including adaptive topology.
The representation of skeleton data can be improved by multiscale technology [29]. In earlier works, joints were clustered to produce a coarser pose. MSR-GCN [30] abstracts a human pose by combining joints that are close to one another and replacing the group with a pseudo joint. DMS-GCN [31] includes a different number of joints at each level. By averaging the locations of the joints in the fine-level joint group, the coarse-level joints are calculated. Our approach does not produce any pseudo joints and coarse poses. Instead, in order to find more informative joints and connections, we learn the communication relationships between joints at coarse and fine levels. The joints are shared by regions of importance, and the hierarchical representation provides a more flexible representation.

3. Methods

In this section, we first formulate the part-wise graph convolution. Then we present the part-wise adaptive topology graph convolution (PAT-GC) and describe the goal of PAT-GC. Finally, we introduce the whole structure of our PAT-GCN.

3.1. Part-Wise Graph Convolution

Formally, a human skeleton can be represented as a graph with joints as vertices and bones as edges. The graph is defined as G = ( V , E ) where V = v 1 , v 2 , , v N is the set of N joints and E is the set of edges. The graph connectivity can be represented by the adjacency matrix A R N × N . Normally, the element value takes 1 or 0 indicating whether the positions of v i and v j are adjacent. The neighborhood of v i is denoted as N v i = v j a i j 0 ; v i , v j V . Given a skeleton sequence, we first compute a set of features X = x t , n 1 t T , 1 n N ; n , t Z that can be represented as a feature tensor X R C × T × N , where x t , n = X t , n denotes the C dimensional features of the node v n at time t.
The normal graph convolution attempts to optimize the weight W for extracting output features Y. The graph convolution is denote as
y i = v j N v i a i j W x j .
In this paper, we use the spatial graph convolution to capture the spatial correlations between human joints within each frame. The spatial graph convolution can be defined as:
Y = ρ D 1 2 A + I D 1 2 W X ,
where D is the degree matrix of A . X denotes the input and Y denotes the output features respectively. W are the graph convolution weights. ρ ( · ) is the activation function.
The human skeleton graph can be divided into multi parts, where each sub-graph is a part of the body. Formally, the whole skeleton graph can be described by each part as:
G = k { 1 , , K } P k P k = V k , E k ,
where P k is the k-th sub-graph, which indicates that the human skeleton is divided into K parts. Similar to the definition of the overall graph, V k and E k represent the set of joints and the set of bones respectively. In the case of the single frame, the graph convolution on a sub-graph of the single part can be defined as follows:
Y v i = v j S v i W v j X v j ; v i , v j P k ,
where Y v i represents the output feature map for v i . S v i is the sampling region for v i . The output of graph convolution is the weighted average of the input feature maps of the nodes within the sampling region. The size of the sampling region is dependent on the size of the convolution kernel.
Here, we present two different division strategies for obtaining spatial features from the skeleton topology, as illustrated in Figure 2. The first strategy involves a fine-scale division that separates the skeleton into five distinct parts: the left arm, right arm, left leg, right leg, and trunk. This approach enables the detection of more subtle differences in motion between various body parts.
The second strategy explores relationships between various body parts while moving by combining the five parts into a coarse-scale division. For instance, one combination might explore movements on one side of the body by involving the arm and leg on the same side, while another might examine coordination between the hands and feet by involving the arm and leg on different sides. It is possible to comprehend the relationships between various movements better by using these body part combinations. Each separated part can be denoted as P k .

3.2. Adaptive Topology Graph Convolution

The skeleton graph’s graph convolution is computed using a predetermined graph, which may not be the best option. It may be difficult to accurately capture each person’s distinctive features and movement traits using a predetermined topology. A predetermined topology might also be incapable of adjusting to evolving movement patterns. As a result, we introduce part-wise adaptive topology graph convolution (PAT-GC) and extend P k into P k ^ . This extension is accomplished using data-driven methods that allow for more flexible modeling of the relationship between joints.
We present a part-wise adaptive topology graph convolution (PAT-GC) block for extracting spatial features, as illustrated in Figure 2. The PAT-GC block contains a toal of three paths, the top and middle paths are used for feature transformation, and the bottom path is used to generate an adaptive topology.

3.2.1. Feature Transformation

In the feature transformation stage, we utilize an attention-based module to enhance the vanilla graph convolution. Specifically, we utilize a 1 × 1 convolution to convert input features into high-level representations. Additionally, we incorporate a channel attention module comprising global pooling, convolution, and activation functions to recalibrate the features. The inputs are converted to 1 × 1 × C by global pooling. After ReLU, the number of channels is made consistent with the number of channels of the skeleton feature by a 1 × 1 convolution. Then sigmoid activation function is used to transform the dot product into a range between 0 and 1, representing the attention weights.

3.2.2. Adaptive Topology

The adaptive topology strategy comprises static and dynamic components. The static topology graph convolution utilizes the predetermined topology as mentioned before, whereas the dynamic topology graph convolution is based on the input features. The dynamic adjacency matrix is generated in a data-driven manner and optimized alongside the other parameters during the training phase.
For topology generation, we use bilinear transformation to extract the feature correlations between vertices. This process is formulated as:
D ˜ = ψ X ϕ X T ,
where D ˜ R N × N is the correlation matrix. ψ x and ϕ x is the transformed feature after convolution and temporal pooling. Here, we choose the same ψ and ϕ with reduction rate r to extract compact representations. After activation function σ , we obtain the correlation D R N × N , which represents relationships of vertices under a certain type of motion feature. Finally, by utilizing the dynamic topology D and the predetermined static topology A , we obtain the adaptive topology A ˜ :
A ˜ = A + α · D ,
where α is a trainable scalar to adjust the intensity of dynamic topology.

3.3. Network Architecture

Using the proposed adaptive topology graph convolution and part aggregation strategy, we design a powerful graph convolutional network called PAT-GCN for skeleton action recognition. The network architecture is illustrated in Figure 3. The proposed network leverages the relationships between different body parts by utilizing GCN architecture to perform action recognition. The backbone of the network comprises ten basic spatial-temporal blocks, followed by a global average pooling layer, and a softmax classifier to obtain the prediction. Formally, given an input sequence of 2D or 3D joint locations X = x t , n R d { 2 , 3 } 1 t T , 1 n N ; t , n Z , the output of the backbone can be defined as:
Y ( b a c k b o n e ) = y t , n ( b a c k b o n e ) R C ( b a c k b o n e ) 1 t T , 1 n N ; t , n Z ,
where y t , n ( b a c k b o n e ) is the output of joint n at frame t,and C ( b a c k b o n e ) represents the channel dimension of the output features. The number of channels in the first basic spatial-temporal block is 64. After the fifth and eighth blocks, the channel dimension is doubled, and the temporal dimension is halved using strided temporal convolution. Each block contains a spatial modeling module, a temporal modeling module, and two residual connections.
Spatial Modeling. Figure 3 illustrates that we use a hierarchical structure to stack the PAT-GC blocks. Specifically, we utilize the residual structure to organize the PAT-GC module, separately processing the segmented parts of two different scales. Using multiple PAT-GCs for the same region allows for obtaining multiple adaptive variants. Therefore, there are p × k PAT-GC blocks in the spatial modeling, where p is the number of parts and k is the number of adaptive variants for each part. Each block is responsible for processing a specific topology. After batch normalization and ReLU activation, the aggregated features from each block are transformed into the inputs of temporal modeling.
Temporal Modeling. To model actions of different durations, we utilize a multi-scale temporal modeling module that can capture both short-term and long-term temporal dependencies. Our module is inspired by the work of Liu et al. [19], who employed multiple branches with different temporal resolutions to disentangle appearance and motion features. However, unlike their method, we use fewer branches to avoid decreasing inference speed. As shown in Figure 3, each branch consists of a 1 × 1 convolution layer that reduces the number of channels in the input feature map. Additionally, four branches contain a temporal layer that operates on the temporal dimension, which can be either a dilated convolution with a specific dilation rate or a max pooling operation. The dilation rates are set to 1, 2, and 4, respectively. By using various dilation rates, we effectively enlarge the receptive field and capture multi-scale temporal patterns. The output feature maps of all branches are concatenated along the channel dimension to form the final output of the proposed ST block.
Multi-stream Ensemble. Following privous works [20,32,33,34], we generate four streams for each skeleton sequence, including joint, bone, joint motion, and bone motion. The raw 3D positions of each joint are contained in the joint stream specifically. The displacement between two adjacent joints in the predetermined human skeleton structure produces the bone stream. The source of the joint motion stream is the displacement of joint data between two adjacent frames, and the bone motion stream is produced similarly by the displacement of bone data. The final prediction in this study is calculated by adding the softmax scores from each stream.

4. Experiments

We evaluate our action recognition method on three widely used skeleton action recognition datasets, including the NTU RGB + D 60, NTU RGB + D 120, and Kinetics Skeleton 400. The experimental details and results are presented in the following section.

4.1. Datasets

NTU RGB + D 60 [2] is a widely known dataset for skeleton action recognition consisting of 56,880 skeleton sequences of 60 action classes. The sequences were recorded from 40 different subjects and three distinct camera perspectives using Microsoft Kinect v2 cameras. Each skeleton sequence includes the 3D spatial coordinates of 25 joints. The dataset offers two evaluation benchmarks: Cross-Subject (X-sub) and Cross-View (X-view). In X-sub, the training set comprises 40,320 sequences from 20 subjects, while the testing set contains 16,560 sequences from the remaining 20 subjects. In X-view, the training set consists of 37,920 sequences from the front and two side views, and the testing set contains 18,960 sequences from the left and right 45-degree views. Both evaluation benchmarks contain all skeleton sequences, which are divided for cross-validation according to subjects or views. We train on both benchmarks from scratch and verify the recognition accuracy independently.
NTU RGB + D 120 [35] is currently the largest dataset available for 3D joint annotation-based human action recognition, including 113,945 action samples across 120 action classes performed by 106 volunteers with three camera views. The dataset includes 32 different setups, each representing a specific location and background. The authors of this dataset recommend two benchmarks to evaluate its performance: Cross-Subject (X-sub) and Cross-Setup(X-setup). In X-sub, the training data (63,026 videos) comes from 53 subjects, and the testing data (50,922 videos) comes from the other 53 subjects. In X-setup, the training data (54,471 videos) from samples with even setup IDs, and testing data (59,477 videos) from samples with odd setup IDs. NTU RGB + D 120 extends NTU RGB + D 60 with additional action categories and performers. We evaluate the model using a strategy similar to that of NTU RGB + D 60.
Kinetics Skeleton 400 [36] is a large-scale dataset of human action video clips sourced from YouTube, comprising 306,245 video clips in 400 classes. Each class contains at least 400 video clips. The dataset includes only raw video clip samples without skeleton sequences and selects two people for multi-person clips based on the average joint confidence. The original video clip is approximately 10 s long. We employ subject-invariant validation strategy to evaluate the performance of our model on the released data (Kinetics-Skeleton). The dataset is divided into three parts, one for training with 250–1000 videos per class, one for validation with 50 videos per class, and one for testing with 100 videos per class. Researchers utilize the publicly available OpenPose toolbox to estimate the positions of 18 joints in each frame of the clip.

4.2. Implementation Details

The experiments were conducted using the Pytorch1.6 deep learning framework on four GTX1080Ti GPUs. Stochastic gradient descent (SGD) was employed with a momentum of 0.9 and weight decay of 0.0001. The model was trained for 50 epochs on NTU-RGB + D 60 and 65 epochs on NTU-RGB + D 120, respectively. The initial learning rate was set to 0.1, decreasing by a factor of 0.1 at epoch 35. For these two datasets, the batch size was set to 64, and input preprocessing was conducted in accordance with previous research [19]. For Kinetics Skleton 400, the batch size was set to 128, with a learning rate of 0.1 that decreased by a factor of 0.1 at epoch 45. The model was trained for a total of 80 epochs. To improve training stability, the learning rate gradually increased during the first 5 epochs in all experiments.

4.3. Ablation Study

We analyze the individual components and their configurations to demonstrate the effectiveness. The reported performance is conducted on the X-Sub subset of the NTU RGB + D 60 using only the joint data.
Configuration Exploration of PAT-GCN. We explore various configurations of PAT-GCN, including the number of adaptive variants of each part, whether or not to use the division configuration, and the choice of spatial modeling structure. As shown in Table 1, we observe that the accuracy improvement by increasing the number of adaptive variants is limited. Next, we verify the influence of static topology A and dynamic topology D by removing them from the PAT-GC respectively. Without dynamic topology, the accuracy decreased by 0.8% indicating the importance of dynamic topology. Similarly, the performance of PAT-GCN without A decreases by 1.1%, confirming that it is difficult to learn a specific topology for each part without any constraints. In addition, we explore the effect of the spatial modeling structure, as shown in Figure 4. The experimental results show that fine-scale parts can benefit from the stacking of PAT-GC blocks, while using a hierarchical structure can reduce the number of parameters and achieve comparable accuracy.
Effectiveness of PAT-GC. To validate the efficacy of PAT-GC block to capture discriminative spatial features, we employ GC blocks in ST-GCN [17] to construct spatial modeling as the baseline for controlled experiments. Unless stated, the other results are obtained using the optimal settings in Table 1. We modify each component of the PAT-GC block and their parameters individually to obtain the accuracy in different situations. The experimental results are shown in Table 2. First, we observe that the accuracy is improved when we replace the GC blocks in the baseline with our proposed PAT-GC blocks. We observe that the best result outperforms the baseline by 2.3%. Besides, we explore different configurations of PAT-GC, including the reduction ratio r and the kind of activation function σ of PAT-GC. As shown in Table 2, we observe that PAT-GC under all configurations outperform the baseline, confirming the robustness of PAT-GC. (1) Comparing models A, B and C, we find models with r = 8 (model B) achieve better result than other models. (2) Comparing modes B, D and E, Tanh function performs better than Sigmoid and ReLU. Hence, we choose model B as our final model.

4.4. Comparisons with the State-of-the-Art Methods

The majority of state-of-the-art methods adopt multi-stream fusion frameworks to exploit complementary information. To make a fair comparison, we adopt the same strategy as in [20,33,37]. Specifically, we fuse the results of four modalities data, namely joint, bone, joint motion, and bone motion.
We compare our network with the state-of-the-art methods on NTU RGB + D 60, NTU RGB + D 120, and Kinetics Skeleton 400 in Table 3, Table 4 and Table 5 respectively. On all three datasets, our method outpeforms all existing methods on almost all evaluation benchmarks. Specifically, as shown in Table 3, our method achieves state-of-the-art performance on the X-sub and X-view of the NTU RGB + D 60, which outpeforms the state-of-the-art DualHead-Net by 0.7% and 0.5%. We show some qualitative results about the diffcult actions in Figure 5. These actions are easily confused with other categories, such as “writing” and “type on a keyboard”. Our method achieves better performance. However, it can be improved in the temporal modeling. DualHead-Net with more complex temporal modeling achieves better performance on “put on a shoe” and “take off a shoe”, which are confused in terms of temporal relationship. The inference speed of our method is 32.9 sequences/(second ∗ GPU). For NTU RGB + D 120 dataset, as shown in Table 4, our method outperforms the EfficientGCN-B4 by 0.8% on the X-sub and outperforms the DualHead-Net by 1.3% on the X-set respectively. In addition, for the Kinetics Skeleton 400 dataset, as shown in Table 5, our four-stream model outperforms the DualHead-Net by 0.7 in terms of the top-1 accuracy. The results demonstrate the effectiveness of the proposed model.

5. Conclusions

This paper presents a novel part-wise adaptive topology graph convolution (PAT-GC) for skeleton action recognition. PAT-GC processes multiple parts instead of the entire skeleton graph. We divide the topology representation of the skeleton into two types, static and dynamic, corresponding to a predetermined adjacency matrix and a trainable matrix, respectively. Static topology provides effective prior information, and dynamic topology guides the adjaceny matrix to focus on more informative connections. We also design a hierarchical structure as spatial modeling to integrate features of two-scale parts. Our method can extract more discriminative features by considering the part-wise information of skeleton data. Finally, we combine the proposed spatial and multi-scale temporal modules to construct the PAT-GCN for skeleton action recognition. Extensive experiments on three large datasets demonstrate that our model outperforms state-of-the-art methods. In the future, we will extend the strategy of temporal modeling and multi-stream ensemble by customizing the structure of each stream to produce more efficient models.

Author Contributions

J.W. completed the main work, including proposing the idea, coding, training the model, and writing the paper. L.Z. and C.F. reviewed and edited the paper. J.W., L.Z., C.F. and R.C. participated in the revision of the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by National Natural Science Foundation of China Enterprise Innovation and Development Joint Fund (Project No. U19B2004) and Open and Innovation Fund of Hubei Three Gorges Laboratory, grant number SK215002.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The numerical calculations in this paper were performed on the supercomputing system in the Supercomputing Center of Wuhan University.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2014; Volume 27. [Google Scholar]
  2. Feichtenhofer, C.; Pinz, A.; Zisserman, A. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1933–1941. [Google Scholar]
  3. Tu, Z.; Li, H.; Zhang, D.; Dauwels, J.; Li, B.; Yuan, J. Action-stage emphasized spatiotemporal VLAD for video action recognition. IEEE Trans. Image Process. 2019, 28, 2799–2812. [Google Scholar] [CrossRef] [PubMed]
  4. Tu, Z.; Xie, W.; Dauwels, J.; Li, B.; Yuan, J. Semantic cues enhanced multimodality multistream CNN for action recognition. IEEE Trans. Circuits Syst. Video Technol. 2018, 29, 1423–1437. [Google Scholar] [CrossRef]
  5. Thakkar, K.; Narayanan, P. Part-based graph convolutional network for action recognition. arXiv 2018, arXiv:1809.04983. [Google Scholar]
  6. Defferrard, M.; Bresson, X.; Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2016; Volume 29. [Google Scholar]
  7. Li, R.; Wang, S.; Zhu, F.; Huang, J. Adaptive graph convolutional neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–3 February 2018; Volume 32. [Google Scholar]
  8. Welling, M.; Kipf, T.N. Semi-supervised classification with graph convolutional networks. In Proceedings of the International Conference on Learning Representations (ICLR 2017), Toulon, France, 24–26 April 2017. [Google Scholar]
  9. Xu, K.; Hu, W.; Leskovec, J.; Jegelka, S. How powerful are graph neural networks? arXiv 2018, arXiv:1810.00826. [Google Scholar]
  10. Ying, Z.; You, J.; Morris, C.; Ren, X.; Hamilton, W.; Leskovec, J. Hierarchical graph representation learning with differentiable pooling. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2018; Volume 31. [Google Scholar]
  11. Wu, F.; Souza, A.; Zhang, T.; Fifty, C.; Yu, T.; Weinberger, K. Simplifying graph convolutional networks. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 6861–6871. [Google Scholar]
  12. Abu-El-Haija, S.; Perozzi, B.; Kapoor, A.; Alipourfard, N.; Lerman, K.; Harutyunyan, H.; Ver Steeg, G.; Galstyan, A. Mixhop: Higher-order graph convolutional architectures via sparsified neighborhood mixing. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 21–29. [Google Scholar]
  13. Song, S.; Lan, C.; Xing, J.; Zeng, W.; Liu, J. An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
  14. Wang, J.; Liu, Z.; Wu, Y.; Yuan, J. Mining actionlet ensemble for action recognition with depth cameras. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 June 2012; pp. 1290–1297. [Google Scholar]
  15. Zhang, P.; Lan, C.; Xing, J.; Zeng, W.; Xue, J.; Zheng, N. View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2117–2126. [Google Scholar]
  16. Li, C.; Xie, C.; Zhang, B.; Han, J.; Zhen, X.; Chen, J. Memory attention networks for skeleton-based action recognition. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 4800–4814. [Google Scholar] [CrossRef] [PubMed]
  17. Li, Y.; He, Z.; Ye, X.; He, Z.; Han, K. Spatial temporal graph convolutional networks for skeleton-based dynamic hand gesture recognition. EURASIP J. Image Video Process. 2019, 2019, 78. [Google Scholar] [CrossRef]
  18. Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 12026–12035. [Google Scholar]
  19. Liu, Z.; Zhang, H.; Chen, Z.; Wang, Z.; Ouyang, W. Disentangling and unifying graph convolutions for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 143–152. [Google Scholar]
  20. Cheng, K.; Zhang, Y.; He, X.; Chen, W.; Cheng, J.; Lu, H. Skeleton-based action recognition with shift graph convolutional network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 183–192. [Google Scholar]
  21. Shah, A.; Mishra, S.; Bansal, A.; Chen, J.C.; Chellappa, R.; Shrivastava, A. Pose and joint-aware action recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 3850–3860. [Google Scholar]
  22. Oikonomou, K.M.A.I.; Manaveli, P.; Grekidis, A.; Menychtas, D.; Aggelousis, N.; Sirakoulis, G.C.; Gasteratos, A. Joint-Aware Action Recognition for Ambient Assisted Living. In Proceedings of the 2022 IEEE International Conference on Imaging Systems and Techniques (IST), Kaohsiung, Taiwan, 21–23 June 2022; pp. 1–6. [Google Scholar]
  23. Li, J.; Liu, X.; Zhang, W.; Zhang, M.; Song, J.; Sebe, N. Spatio-temporal attention networks for action recognition and detection. IEEE Trans. Multimed. 2020, 22, 2990–3001. [Google Scholar] [CrossRef]
  24. Santavas, N.; Kansizoglou, I.; Bampis, L.; Karakasis, E.; Gasteratos, A. Attention! A lightweight 2d hand pose estimation approach. IEEE Sens. J. 2020, 21, 11488–11496. [Google Scholar] [CrossRef]
  25. Du, Y.; Wang, W.; Wang, L. Hierarchical recurrent neural network for skeleton based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1110–1118. [Google Scholar]
  26. Si, C.; Jing, Y.; Wang, W.; Wang, L.; Tan, T. Skeleton-based action recognition with spatial reasoning and temporal stack learning. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 103–118. [Google Scholar]
  27. Song, Y.F.; Zhang, Z.; Shan, C.; Wang, L. Stronger, faster and more explainable: A graph convolutional baseline for skeleton-based action recognition. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1625–1633. [Google Scholar]
  28. Wang, M.; Ni, B.; Yang, X. Learning multi-view interactional skeleton graph for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 1. [Google Scholar] [CrossRef] [PubMed]
  29. Gou, R.; Yang, W.; Luo, Z.; Yuan, Y.; Li, A. Tohjm-Trained Multiscale Spatial Temporal Graph Convolutional Neural Network for Semi-Supervised Skeletal Action Recognition. Electronics 2022, 11, 3498. [Google Scholar] [CrossRef]
  30. Dang, L.; Nie, Y.; Long, C.; Zhang, Q.; Li, G. Msr-gcn: Multi-scale residual graph convolution networks for human motion prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 11467–11476. [Google Scholar]
  31. Yan, Z.; Zhai, D.H.; Xia, Y. DMS-GCN: Dynamic mutiscale spatiotemporal graph convolutional networks for human motion prediction. arXiv 2021, arXiv:2112.10365. [Google Scholar]
  32. Chen, T.; Zhou, D.; Wang, J.; Wang, S.; Guan, Y.; He, X.; Ding, E. Learning multi-granular spatio-temporal graph network for skeleton-based action recognition. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 20–24 October 2021; pp. 4334–4342. [Google Scholar]
  33. Chen, Z.; Li, S.; Yang, B.; Li, Q.; Liu, H. Multi-scale spatial temporal graph convolutional network for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 1113–1122. [Google Scholar]
  34. Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Skeleton-based action recognition with directed graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7912–7921. [Google Scholar]
  35. Liu, J.; Shahroudy, A.; Perez, M.; Wang, G.; Duan, L.Y.; Kot, A.C. Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2684–2701. [Google Scholar] [CrossRef] [PubMed]
  36. Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; et al. The kinetics human action video dataset. arXiv 2017, arXiv:1705.06950. [Google Scholar]
  37. Ye, F.; Pu, S.; Zhong, Q.; Li, C.; Xie, D.; Tang, H. Dynamic gcn: Context-enriched topology learning for skeleton-based action recognition. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 55–63. [Google Scholar]
  38. Yan, S.; Xiong, Y.; Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
  39. Li, M.; Chen, S.; Chen, X.; Zhang, Y.; Wang, Y.; Tian, Q. Actional-structural graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3595–3603. [Google Scholar]
  40. Zhang, P.; Lan, C.; Zeng, W.; Xing, J.; Xue, J.; Zheng, N. Semantics-guided neural networks for efficient skeleton-based human action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1112–1121. [Google Scholar]
  41. Xu, K.; Ye, F.; Zhong, Q.; Xie, D. Topology-aware convolutional neural network for efficient skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2022; Volume 36, pp. 2866–2874. [Google Scholar]
  42. Song, Y.F.; Zhang, Z.; Shan, C.; Wang, L. Constructing stronger and faster baselines for skeleton-based action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 1474–1488. [Google Scholar] [CrossRef] [PubMed]
Figure 1. (a) An example of joints within the same part of the body do tend to have similar movement trends. (b) An example about the coordination of movements of the human body. There is a pattern in the swing of the human hands during running. (c) An example of the difference in time between individuals doing the same action.
Figure 1. (a) An example of joints within the same part of the body do tend to have similar movement trends. (b) An example about the coordination of movements of the human body. There is a pattern in the swing of the human hands during running. (c) An example of the difference in time between individuals doing the same action.
Electronics 12 01992 g001
Figure 2. (a) Fine scale configuration partitioning. One truck, two arms and two legs. (b) Coarse scale configuration partitioning.
Figure 2. (a) Fine scale configuration partitioning. One truck, two arms and two legs. (b) Coarse scale configuration partitioning.
Electronics 12 01992 g002
Figure 3. Network architecture. The full architecture consists of ten basic ST block, and the ST block contains three components: a spatial modeling module with PAT-GC blocks, a multi-scale temporal modeling module and two residual connections.
Figure 3. Network architecture. The full architecture consists of ten basic ST block, and the ST block contains three components: a spatial modeling module with PAT-GC blocks, a multi-scale temporal modeling module and two residual connections.
Electronics 12 01992 g003
Figure 4. (a)One PAT-GC block. (b) Two PAT-GC blocks. (c) Hierarchical Structure. The upper PAT-GC block is used for the coarse scale parts and the lower one is used for the fine scale parts. The dashed arrows represent the features of other parts.
Figure 4. (a)One PAT-GC block. (b) Two PAT-GC blocks. (c) Hierarchical Structure. The upper PAT-GC block is used for the coarse scale parts and the lower one is used for the fine scale parts. The dashed arrows represent the features of other parts.
Electronics 12 01992 g004
Figure 5. Comparing the classification accuracy for several difficult action classes.
Figure 5. Comparing the classification accuracy for several difficult action classes.
Electronics 12 01992 g005
Table 1. Comparing the model accuracy of PAT-GCN with different settings. # represents the number of variants.
Table 1. Comparing the model accuracy of PAT-GCN with different settings. # represents the number of variants.
Design#VariantHierarchicalAcc (%)
PAT-GCN189.9
PAT-GCN290.4
PAT-GCN490.3
w/o D289.6
w/o A289.3
w/o fine-scale290.0
w/o coarse-scale289.5
1 PAT-GC2×90.1
2 PAT-GC2×90.4
Table 2. Comparing the model accuracy with different reduction ratios and activation function.
Table 2. Comparing the model accuracy with different reduction ratios and activation function.
Methodsr σ Acc (%)
Baseline--88.1
A4Tanh90.2
B8Tanh90.4
C16Tanh90.0
D8Sigmoid89.9
E8ReLU90.2
Table 3. Comparison of the Top-1 accuracy (%) with the state-of-the-art methods on the NTU RGB-D 60 skeleton dataset.
Table 3. Comparison of the Top-1 accuracy (%) with the state-of-the-art methods on the NTU RGB-D 60 skeleton dataset.
MethodsNTU RGB + D 60
X-Sub (%)X-View (%)
ST-GCN (2018) [38]81.588.3
AS-GCN (2019) [39]86.894.2
2s-AGCN (2019) [18]88.595.1
DGNN (2019) [34]89.996.1
SGN (2020) [40]89.094.5
Shift-GCN (2020) [20]90.796.5
Dynamic GCN (2020) [37]91.596.0
MS-G3D (2020) [19]91.596.2
MST-GCN (2021) [33]91.596.6
4s DualHead-Net (2021) [32]92.096.6
Ta-CNN (2022) [41]90.494.8
Ta-CNN+ (2022) [41]90.795.1
EfficientGCN-B4 (2022) [42]91.795.7
PAT-GCN (Joint Only)90.495.1
PAT-GCN (Joint+Bone)92.296.4
PAT-GCN92.797.1
Table 4. Comparison of the Top-1 accuracy (%) with the state-of-the-art methods on the NTU RGB-D 120 skeleton dataset.
Table 4. Comparison of the Top-1 accuracy (%) with the state-of-the-art methods on the NTU RGB-D 120 skeleton dataset.
MethodsParam. (M)NTU RGB + D 120
X-Sub (%)X-Set (%)
2s-AGCN (2019) [18]6.982.984.9
SGN (2020) [40]1.879.281.5
Shift-GCN (2020) [20]2.785.987.6
MS-G3D (2020) [19]6.486.988.4
Dynamic GCN (2020) [37]14.487.388.6
MST-GCN (2021) [33]12.087.588.8
4s DualHead-Net (2021) [32]12.088.289.3
Ta-CNN+ (2022) [41]4.485.787.3
EfficientGCN-B4 (2022) [42]1.188.489.1
PAT-GCN5.489.290.6
Table 5. Comparison of the Top-1 accuracy (%) and Top-5 accuracy (%) with the state-of-the-art methods on the Kinetics Skeleton 400 dataset.
Table 5. Comparison of the Top-1 accuracy (%) and Top-5 accuracy (%) with the state-of-the-art methods on the Kinetics Skeleton 400 dataset.
MethodsKinetics Skeleton 400
Top-1 (%)Top-5 (%)
ST-GCN (2018) [38]30.752.8
AS-GCN (2019) [39]34.856.5
2s-AGCN (2019) [18]36.158.7
DGNN (2019) [34]36.959.6
MS-G3D (2020) [19]38.060.9
Dynamic GCN (2020) [37]37.961.3
MST-GCN (2021) [33]38.160.8
4s DualHead-Net (2021) [32]38.461.3
PAT-GCN39.262.4
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, J.; Zou, L.; Fan, C.; Chi, R. Part-Wise Adaptive Topology Graph Convolutional Network for Skeleton-Based Action Recognition. Electronics 2023, 12, 1992. https://doi.org/10.3390/electronics12091992

AMA Style

Wang J, Zou L, Fan C, Chi R. Part-Wise Adaptive Topology Graph Convolutional Network for Skeleton-Based Action Recognition. Electronics. 2023; 12(9):1992. https://doi.org/10.3390/electronics12091992

Chicago/Turabian Style

Wang, Jiale, Lian Zou, Cien Fan, and Ruan Chi. 2023. "Part-Wise Adaptive Topology Graph Convolutional Network for Skeleton-Based Action Recognition" Electronics 12, no. 9: 1992. https://doi.org/10.3390/electronics12091992

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop