Adaptive Channel-Enhanced Graph Convolution for Skeleton-Based Human Action Recognition

Han, Xiao-Wei; Chen, Xing-Yu; Cui, Ying; Guo, Qiu-Yang; Hu, Wen

doi:10.3390/app14188185

Open AccessArticle

Adaptive Channel-Enhanced Graph Convolution for Skeleton-Based Human Action Recognition

by

Xiao-Wei Han

^1,2,3,4,*,

Xing-Yu Chen

^1,4

,

Ying Cui

^1,4

,

Qiu-Yang Guo

^1,4

and

Wen Hu

^1,2,3,4

¹

School of Computer and Information Engineering, Harbin University of Commerce, Harbin City 150028, China

²

Postdoctoral Research Workstation of Northeast Asia Service Outsourcing Research Center, Harbin City 150000, China

³

Post-Doctoral Flow Station of Applied Economics, Harbin City 150000, China

⁴

Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin City 150000, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(18), 8185; https://doi.org/10.3390/app14188185

Submission received: 19 August 2024 / Revised: 8 September 2024 / Accepted: 10 September 2024 / Published: 11 September 2024

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

Obtaining discriminative joint features is crucial for skeleton-based human action recognition. Current models mainly focus on the research of skeleton topology encoding. However, their predefined topology is the same and fixed for all action samples, making it challenging to obtain discriminative joint features. Although some studies have considered the complex non-natural connection relationships between joints, the existing methods cannot fully capture this complexity by using high-order adjacency matrices or adding trainable parameters and instead increase the computation parameters. Therefore, this study constructs a novel adaptive channel-enhanced graph convolution (ACE-GCN) model for human action recognition. The model generates similar and affinity attention maps by encoding channel attention in the input features. These maps are complementarily applied to the input feature map and graph topology, which can realize the refinement of joint features and construct an adaptive and non-shared channel-based adjacency matrix. This method of constructing the adjacency matrix improves the model’s capacity to capture intricate non-natural connections between joints, prevents the accumulation of unnecessary information, and minimizes the number of computational parameters. In addition, integrating the Edgeconv module into a multi-branch aggregation improves the model’s ability to aggregate different scale and temporal features. Ultimately, comprehensive experiments were carried out on NTU-RGB+D 60 and NTU-RGB+D 120, which are two substantial datasets. On the NTU RGB+D 60 dataset, the accuracy of human action recognition was 92% (X-Sub) and 96.3% (X-View). The model achieved an accuracy of 96.6% on the NW-UCLA dataset. The experimental results confirm that the ACE-GCN exhibits superior recognition accuracy and lower computing complexity compared to current methodologies.

Keywords:

human action recognition; adaptive graph convolutional neural network; channel attention; skeleton data

1. Introduction

Human action recognition is a field of study in computer vision that has gained significant attention because of its various applications in security monitoring and virtual games [1]. Human action data encompasses various data modalities, including RGB video, depth video, and skeleton sequence. However, the human action recognition method based on RGB data is easily disturbed by background, illumination intensity, and viewpoint, affecting recognition accuracy [2]. In contrast, skeleton data consists of a sequence of human joint points, including head, shoulder, elbow, wrist, hip, knee, ankle, etc. The position of each joint point is usually represented by 2D or 3D coordinates. Skeleton data can effectively overcome background interference, illumination intensity, and viewpoint and is robust [3]. Due to the progress in sensor technology and the increasing availability of large-scale datasets, there has been significant interest in skeleton-based human action recognition. This approach is valued for its ability to remain effective even in the presence of environmental variations.

Because skeleton sequences are sequences of graph data rather than pseudo-image and vector sequences, which are grid-like structures, traditional deep learning networks such as CNN and RNN struggle to fully represent and utilize the natural structural information from human bones [4]. In order to overcome the limitation of grid-like structural features, graph convolutional networks (GCNs) have the ability to extend convolutional neural networks to graphs with diverse architectures. Thus, in the context of skeleton-based human action recognition tasks, GCNs possess a superiority compared to the other two deep learning-based approaches. Yan et al. [5] were the pioneers in employing graph convolutional neural networks for skeleton action identification. They developed the spatial temporal graph convolutional network (ST-GCN). The ST-GCN divides the input human skeleton data into different partitions, assigns different weight labels, and then uses graph convolution to a weighted average or splices its features with those of neighboring nodes to obtain the aggregated features. The model’s feature extraction method effectively improves the accuracy of human action recognition tasks on skeleton data and also opens up the application of graph convolutional networks in human skeleton action recognition tasks. Shi et al. [6] introduced the two-stream adaptive graph convolutional network (2s-AGCN), which successfully utilized the length and direction features of skeleton data to enhance the precision of human action recognition. Furthermore, some human skeleton action identification techniques utilizing GCN have demonstrated impressive results in references [2,7,8]. Consequently, the utilization of GCN for extracting features from human skeleton data has progressively emerged as the predominant approach in the domain of human action recognition. Currently, human skeleton action recognition algorithms that rely on GCN are still encountering certain difficulties:

The graph topology of the network is predetermined, shared, and remains constant. The model’s limitations include a lack of flexibility and the inability to accurately represent the various levels of semantic information in human activities. Consequently, it encounters challenges in precisely depicting the interrelationships among the joints of the human body. The interdependence of joints is quite intricate and fluctuates depending on various actions. The adjacency matrix shared in the learning channel equally makes it difficult to fit these correlations, which makes the model lose the ability to distinguish the subtle differences between different actions.
In human skeleton action recognition, the local receptive field is closely related to the body’s physical structure, and different joint points have different numbers of neighborhood nodes. However, the aggregation method in the existing methods only simply aggregates the acquired features, resulting in a limited receptive field. Therefore, it is difficult for the model to effectively reflect the diverse relationships between long and short ranges within skeletons.

This study introduces an ACE-GCN for the purpose of skeleton human action recognition. The proposed network aims to address the aforementioned issues to a certain degree. The ACE-GCN can enhance the representation capability of GCNs by providing a more flexible and discriminative feature learning process. The main goal of this study is to create a relevance matrix and affinity matrix of the graph topology using the attention mechanism. This matrix utilizes not only the local information in each channel but also the global contextual information across channels. This approach further refines the graph topology in each channel by adaptively assigning weights to all channels. Since the adjacency matrix in each channel is adaptively non-shared, the aggregation of redundant information is avoided. In addition, this study integrates Edgeconv [9] into the model, constituting a multi-branch aggregation module. Edgeconv captures local geometric features in the spatial domain by dynamically constructing the neighborhood relations of points, while other parallel aggregation branches capture the spatio-temporal features of the joints globally. The utilization of several branches in this aggregation strategy improves the extraction of spatio-temporal information at various scales with the model.

The experimental findings from the NTU RGB+D 60 dataset [10], NTU RGB+D 120 [11] dataset, and NW-UCLA dataset [12] demonstrate that the suggested model achieves superior accuracy while requiring fewer computer resources. The innovative work of this study is as follows:

In this study, a novel adaptive channel-enhanced graph convolution network (ACE-GCN) is proposed. The ACE-GCN utilizes a channel attention mechanism to dynamically construct adaptive and non-shared channel-based adjacency matrices, which achieves topology refinement in each feature channel and enables the model to capture effective correlations between joints in different actions.
In this study, Edgeconv is integrated to form a multi-branch aggregation method, which can effectively aggregate spatio-temporal features from different global–local scales in spatial and temporal domains. The implementation of this multi-branch aggregation method not only boosts the model’s ability to express itself but also improves its flexibility during complex activities.
This design not only improves the overall performance of the model but also optimizes the use of computing resources, which provides a more practical solution for practical applications.

2. Related Work

2.1. Skeleton-Based Action Recognition

Traditional approaches to human action recognition, which rely on skeleton data, typically involve the manual extraction of human action features [13,14,15,16,17,18]. The process of manual feature-based methods is quite complicated, and the performance is unsatisfactory. As deep learning has gradually become the mainstream method in computer vision, attempts have been made to utilize RNN and CNN to recognize and classify human actions using skeletal data. RNN-based approaches typically utilize skeleton data from human activities as a collection of time series [19,20,21], whereas CNN-based approaches initially transform the skeleton data from human actions into a pseudo-picture. RNN-based approaches prioritize the temporal aspects of human actions, whereas CNN-based approaches prioritize the spatial aspects of human actions [22,23,24]. Due to its strong parallel computing capability, CNN is extensively employed in skeleton-based human activity recognition. Nevertheless, research has indicated that both RNN and CNN are insufficient for accurately representing the intricate structure of human action skeleton data [4]. Skeleton data falls within the category of non-Euclidean data, whereas RNN and CNN are usually used to process Euclidean data. This makes it difficult for RNN or CNN to obtain useful features from skeleton data.

GCNs are very good at dealing with non-Euclidean data. Yan et al. [5] developed the ST-GCN and used it to recognize human actions based on skeletons. The ST-GCN blends the GCN and TCN, which can pull out the spatio-temporal features of human action skeleton data at the same time. For the first time, this method has shown great promise in the area of skeleton-based human action recognition. One problem with ST-GCN is that it cannot fully catch the complex dynamic changes in actions because it uses a fixed graph structure. Subsequently, Shi et al. [6] merged the two-stream architecture and GCN to produce a two-stream adaptive graph convolutional network (2s-AGCN). The 2s-AGCN learns the graph topology by constructing a parameterized adjacency matrix so that the network can adaptively adjust the graph structure according to the input data. While this approach enhances the model’s ability to accurately recognize, it also introduces a substantial amount of computational parameters. However, Cheng et al. [25] proposed the Shift-GCN by combining Shift convolution with ST-GCN. Shift-GCN greatly reduces the computational cost while maintaining a good recognition accuracy. Similarly, Song et al. [26] processed the skeleton data of human actions to obtain three information streams of human actions: joints, bones, and speed. They fused these information streams and used a composite scaling strategy to propose an efficient GCN with high accuracy and a small number of trainable parameters.

2.2. Attention Mechanisms in Action Recognition

Attention mechanisms [27] are widely used in deep learning models. In human action recognition, Baradel et al. [28] were the first to introduce temporal and spatial attention in order to extract spatio-temporal aspects from human action RGB data and human joint postures. Si et al. [29] introduced an attention-enhanced graph convolutional LSTM network (AGC-LSTM) for the purpose of identifying human behaviors based on skeletal data. The AGC-LSTM model utilizes an attention mechanism to amplify the input from important joints in each AGC-LSTM layer. This allows the model to effectively capture the distinguishing characteristics of spatial arrangement and temporal changes. In addition, Wu et al. [30] proposed the multi-grain contextual focus module (MCF) to capture relation information related to actions from body joints and parts.

In summary, several GCN-based human skeleton action recognition methods have been proposed, demonstrating the potential applications of GCN models in human action recognition. In the human action recognition task, in order to be able to focus on the feature information of skeletal joint points more effectively, many scholars have proposed models that use different attention mechanisms. However, current GCN methods lack attention to the optimization aspect of the channel adjacency matrix. In order to improve the feature extraction ability of GCN models in human skeleton action recognition tasks, this study proposes the ACE-GCN, which uses a channel attention mechanism to recode the adjacency matrix, thus obtaining similarity attention maps and affinity attention maps for the construction of non-shared adaptive adjacency matrices. This study solves the problem of insufficient abstraction ability of existing GCN models for graph topological features in the channel in human action recognition tasks.

3. Method

This section first defines the relevant symbols and then explains the structure and mathematical concepts of the ACE-GCN in detail.

3.1. Preliminaries

Notation. Human joints are used as graph nodes, and the human skeleton is used as an edge connecting the nodes in the skeleton data of human actions to create an undirected graph. The skeleton diagram of human action is represented as G = (V, E, X), where V = {v₁, v₂, …, v_N } is a set of N joint points. E is the edge set of the skeleton graph. X is the feature set of N joint points, where X∈R^N^×C. C is the number of characteristic channels. The adjacency matrix A∈R^N^×N represents the predefined connection relationships of human joints. The elements a_{i, j} and a_j,i in the adjacency matrix represent the connection relationships between joint points v_i and v_j. When joint points v_i and v_j are connected, a_{i, j}, a_{j, i} ≠ 0. A skeleton sequence Z = {z₁, z₂, …, z_T } in inputted, where z_t is the 3D coordinates of every joint point in the skeleton graph at t and indicates the spatial state of human action in the t time frame.

Spatial Temporal Graph Convolutional Network. The skeletal data for human action recognition comprises the spatial and temporal attributes of the action. The ST-GCN model efficiently captures the spatio-temporal features of the action by employing a hierarchical network that alternates between a GCN and TCN. Spatial convolution can be divided into two steps: feature embedding and feature aggregation. Feature embedding converts the input feature X into a high-dimensional feature

\tilde{X}

with the formula shown in Equation (1):

\tilde{X} = c o n v 2 D_{1 \times 1} (X)

(1)

In spatial graph convolution, the node features can be aggregated with those of its neighboring nodes by the product of the adjacency matrix A and the node feature matrix

\tilde{X}

after feature transformation. In order to prevent the influence of the out-degree and in-degree of the nodes, it is usually necessary to perform symmetric normalization on the adjacency matrix first. The symmetric normalization formula is shown in Equation (2):

\tilde{A} = Λ^{- \frac{1}{2}} (A + I) Λ^{- \frac{1}{2}}

(2)

The matrix

Λ

represents the degrees of the nodes, which is defined as

Λ

= diag(d₀, …, d_N-1). I is the self-connecting node matrix. After the normalization of the adjacency matrix A, the aggregation of the nodes is shown in Equation (3):

\hat{X} = \tilde{A} \tilde{X}

(3)

In each layer of spatial graph convolution, the new feature representation

{\hat{x}}_{i}^{(l + 1)}

of node v_i is obtained by aggregating the current feature

{\hat{x}}_{i}^{(l)}

of node v_i with the

{\tilde{x}}_{j}^{(l)}

of its neighbor node v_j, as shown in Equation (4):

{\hat{x}}_{i}^{(l + 1)} = σ (\sum_{j \in N (i)} {\tilde{A}}_{i j}^{(l)} {\hat{x}}_{j}^{(l)} w^{(l)})

(4)

N(i) denotes the set of neighbor nodes of node v_i, containing node v_i itself, and w denotes the weight.

Through the aggregation and updating of nodes, the spatial graph convolution of each layer gradually extracts richer graph structure information. As the number of network layers increases, the node features gradually integrate the influence from more distant neighbors, which makes the feature representation extend from the local neighborhood to a larger range of graph structures. The feature extraction process of spatial graph convolution can be represented by Equation (5):

{\hat{X}}^{(l + 1)} = φ (B N ({\tilde{A}}^{(l)} {\hat{X}}^{(l)} W^{(l)}))

(5)

where

φ

is the activation function.

B N (\cdot)

is the normalized processing function, and W is the linear mapping matrix.

In order to obtain the structural information of the temporal dimension in the human skeleton action samples, the temporal convolution network (TCN) employs a one-dimensional convolution kernel to convolve the sequence data along the temporal dimension. The features

\hat{X}

obtained from the spatial map convolution are inputted into the TCN to obtain the spatio-temporal features of the human skeleton action samples. The output of the TCN is shown in Equation (6):

Z^{(l + 1)} = σ (C o n v_{1 \times K_{t}} ({\hat{X}}^{(l)}))

(6)

where K_t denotes the size of the convolution kernel along the time axis, σ is the activation function, and Z^(l+1) is the spatio-temporal features of the human action skeleton extracted by the temporal convolution.

3.2. Adaptive Channel-Enhanced Graph Convolution

The skeleton sequence consists of frames, each of which contains a collection of 2D or 3D joint coordinates that comprise the skeleton. The GCN is designed to represent the topological organization of human actions. In this model, the graph consists of nodes representing joints and edges representing bones. In this study, we propose a new network architecture that stacks the ACE-GCN to extract the spatio-temporal features of skeleton sequences and uses a classifier to predict the class score to obtain the correct action classification. Figure 1 illustrates the fundamental structure of the ACE-GCN, which consists of distinct sub-modules for handling the spatial and temporal data of human action skeletal sequences. These sub-modules are referred to as the ACE-GC module and the TCN module, respectively. In this section, we provide a detailed description of the network structure of the ACE-GC module and explain the model’s structure and implementation process through equations.

The attention mechanism can refine the graph topology in each channel to obtain the discriminative graph topology that is critical for the recognition of overall human actions. The ACE-GC network structure is shown in Figure 2. ACE-GC consists of three parts: feature transformation, channel graph topology modeling, and feature aggregation.

Feature transformation: ACE-GC utilizes the human action skeleton graph as the input feature tensor X, where X is a tensor with dimensions X∈R^N^×C^×T^×V. In this context, C denotes the quantity of feature channels, T signifies the number of time frames of the action, and V indicates the count of joint points in the skeleton graph. To minimize the number of parameters, this study used a 1 × 1 convolution to carry out feature transformation on the input data, resulting in weighted features across multiple channels. The formula is shown in Equation (7):

\tilde{X} = θ (X) = X W, \tilde{X} \in ℝ^{N \times \frac{C}{r} \times T \times V}

(7)

where r is the channel reduction rate, and r = 8 and r = 16 are used in the experiments in this study. W is the weight matrix,

W \in ℝ^{N \times C \times \frac{C}{r}}

.

Channel graph topology modeling: The heuristic adjacency matrix is shared by all channels and trained through backpropagation. This study uses channel affinity attention to learn the channel affinity matrix, which can assign higher weights to different channels to avoid clustering similar redundant information. It can also refine the graph topology features in the channel by weighting all channels to obtain a non-shared channel graph topology

A^{'}

, as shown in Figure 3.

For channel topology modeling, we initially employ a global average pooling layer to aggregate the time dimension. This pooling method preserves both long-term and short-term temporal information without sacrificing any spatial information regarding the graph structure. Then, the input feature X is transformed into a high-level feature

\tilde{X}

. This high-level feature

\tilde{X}

is used as the query matrix

{\tilde{X}}_{1}

and the key matrix

{\tilde{X}}_{2}

, respectively. By applying the transpose of the product of the query matrix

{\tilde{X}}_{1}^{T}

and the key matrix

{\tilde{X}}_{2}

, the channel similarity matrix S is obtained, as shown in Equation (8):

S (θ (X_{1}), θ (X_{2})) = d o t (θ (X_{1}), θ (X_{2})) = d o t ({\tilde{X}}_{1}^{T}, {\tilde{X}}_{2})

(8)

The channel similarity matrix S calculates the correlation between joint pairs in each channel. By calculating the inner product, the remote dependence in the spatial data can be obtained. On the contrary, by calculating the similarity between joint pairs, the irrelevance between joint pairs in the channel, i.e., the channel affinity, can be obtained. In our method, the affinity matrix M is obtained by selecting the maximum similarity along the rows of the similarity matrix S and then expanding them to the same size as S. In the expanded affinity matrix M, joint pairs with higher similarity have a lower affinity. In addition, since the affinity matrix M is an adaptive weight matrix used for channel topology refinement, the Softmax function is used to normalize the matrix, as shown in Equation (9):

M = softmax (\underset{1 \to V}{expand} (\underset{1 \to V}{M A X (S)}) - S_{V \times V}), M \in ℝ^{N \times \frac{C}{r} \times V}

(9)

The resulting attention matrix is used to calculate weighted features through matrix operations so that each channel has an adaptive merging weight to integrate the nodes. Then, the channel correlation modeling function C(·) uses the weighted features to calculate the distance between features along the channel dimension. The nonlinear transformation of these distances is used to determine the specific topological relationship between joint pairs on the channel. The specific topological relationship is merged to obtain a higher-level semantic graph. The correlation Z of the specific topological relationship between the joint pairs on the channel is calculated, as shown in Equation (10):

Z = C (M {\tilde{X}}_{1}, {\tilde{X}}_{2}) = \tanh (M {\tilde{X}}_{1} - {\tilde{X}}_{2}), Z \in ℝ^{N \times \frac{C}{r} \times V}

(10)

Z reflects the specific cross-channel topological relationships between joints. Lastly, by applying the channel-specific topological connection Z to refine the shared graph topology A, the adaptive channel-specific graph topology

A^{'}

can be obtained, as shown in Equation (11):

A^{'} = Z \cdot ε + A, A^{'} \in ℝ^{N \times \frac{C}{r} \times T \times V}

(11)

ε is a trainable parameter that adjusts the strength of the channel topology refinement.

Channel aggregation: For the refined adaptive channel graph topology

A^{'}

and the advanced feature

\tilde{X}

, ACE-GC aggregates features along the channel, as shown in Figure 4.

The channel-wise feature aggregation operation utilizes Einstein summation notation to aggregate the features of all nodes, thereby obtaining the feature representation for each node. The aggregated features, denoted as

Z^{'}

, are shown in Equation (12):

Z^{'} = \sum_{i}^{3} e i n s u m (A^{'}, \tilde{X}), Z^{'} \in R^{N \times C_{o u t} \times T \times V}

(12)

einsum(·) refers to the Einstein summation convention. ACE-GC creates a distinct channel graph topology for each channel to represent the node relationship of a certain sort of action feature. The Edgeconv operation can aggregate branches to build a multi-branch aggregation module that aggregates local neighborhood information and captures different levels of physical and semantic attributes. The aggregation connects all channel graph topologies in parallel, thus obtaining a global–local feature information aggregation. Therefore, this aggregation operation can improve the model’s understanding of the input data.

\emptyset θ

is a learned nonlinear mapping function, typically implemented as an MLP. The output of the ACE-GCN is shown in Equation (13):

G_{o u t} = B N (Re L U (Z^{'} + \max_{j \in N (i)} ϕ θ ({z^{'}}_{i}, {z^{'}}_{i} - {z^{'}}_{j})))

(13)

3.3. Network Architecture

This study utilizes ACE-GC to develop a graph convolutional network, ACE-GCN, for the purpose of recognizing human skeleton actions. Figure 5 illustrates that the ACE-GCN has ten fundamental network layers. Following the 10 fundamental layers, the network is linked to a global average pooling layer and a Softmax classifier [31]. Each basic network layer is mainly composed of a spatial convolution, a temporal convolution module, and a residual connection. The output of each basic network layer is shown in Equation (14):

F_{o u t} = σ (r e s (X) + T C N (G_{o u t})), X \in R^{N \times T \times C \times V}

(14)

The design of the temporal convolution module in the ACE-GCN basic network layer is based on the design in [31]. The multi-scale temporal convolution module in the basic network layer contains four branches, as shown in Figure 6. Finally, the output of the four branches is aggregated to provide the next network layer with more advanced spatio-temporal features as input.

3.4. Metricrs

To validate the effectiveness of the proposed method, this study evaluates the performance of the ACE-GCN model using the Top-1 classification accuracy (%) metric. Top-1 classification accuracy measures the proportion of instances where the predicted class matches the true class, directly reflecting the model’s accuracy in the human skeleton action recognition task. The formula for calculating the Top-1 classification accuracy is shown in Equation (15):

T o p - 1_{a c c} = \frac{1}{N} \sum_{j = 1}^{N} 1 ({\hat{y}}_{j} = y_{j}) \times 100 %

(15)

N is the total number of samples in the dataset.

{\hat{y}}_{j}

represents the predicted class for the j-th action sample, and y_j is the true class of the j-th action sample. Furthermore,

1 ({\hat{y}}_{j} = y_{j})

is an indicator function that takes the value of 1 when the predicted class

{\hat{y}}_{j}

is the same as the true class y_j and is 0 otherwise.

4. Experimental Evaluation

4.1. Datasets

NTU-RGB+D 60. The NTU-RGB+D 60 [10] dataset is a highly prevalent dataset in the domain of skeleton behavior recognition. The Rose Lab of Nanyang Technological University published it in 2016. The dataset comprises 56,880 action clips, amounting to a total of 4,000,000 frames. The data modalities consist of RGB video, depth information, and skeletal data. There are a total of 60 classes included in this dataset. Every frame may accommodate a maximum of two individuals. The collection contains the three-dimensional coordinates of 25 important spots for each subject. The Kinect v2 depth-sensing camera captures the coordinates of these important spots. The dataset contains two assessment benchmarks: (1) Cross Subject (X-Sub): The training set contains 40,320 skeleton sequences from 20 volunteers, and the test set contains 16,540 skeleton sequences from other volunteers. (2) Cross View (X-view): The training set contains 37,920 skeleton sequences from two camera views, while the test set contains 18,960 skeleton sequences from another camera view.

NTU-RGB+D 120. NTU-RGB+D 120 [11] is an expanded iteration of NTU RGB+D60. This dataset offers a greater number of samples pertaining to environmental circumstances, human traits, and changes in camera view angle. It is presently the most extensive dataset for recognizing actions in indoor settings. This dataset offers data in three distinct modalities: RGB video, depth video, and skeleton sequence. The NTU RGB+D 120 dataset consists of 120 action classes, which encompass 82 daily actions, 12 health-related acts, and 26 interactive actions. Each action was demonstrated by 106 volunteers for a total of 114,480 videos. Similarly, the NTU-RGB+D 120 dataset is set up with two evaluation benchmarks: (1) Cross Setting (X-Set): In the 32 camera position settings, if the setting parameter ID is even, the samples captured by these cameras are classified as the training set. Conversely, if the setting ID is odd, the samples captured by these cameras are classified as the test set. (2) Cross Subject (X-Sub): The 106 volunteers who perform the action demonstrations are divided into a training group and a test group, each consisting of 53 volunteers.

Northwestern-UCLA. The Northwestern-UCLA dataset [12] is a multi-view dataset for 3D action recognition proposed by Wang et al. in 2014, specifically designed for evaluating and developing deep learning-based action recognition models. The dataset contains actions from 10 different classes, which were performed by 10 different actors in three different Kinect v1 camera views. The 3D skeleton data for each action was recorded in the dataset, including the 3D coordinate information of 10 joints for each subject. This study follows the cross-view setting in [12]: the data acquired by the first two cameras is used as the training set data, and the data acquired by the other camera is used as the test set data.

4.2. Experiment Settings

Every experiment was carried out using a deep learning DCU that has the PyTorch framework loaded. Using SGD with Nesterov momentum (0.9), a weight decay of 0.0004, and a warm-up method for the first five epochs to increase training process stability, the model was trained for 65 epochs. The initial learning rate was set at 0.1 and decreased exponentially with a decay rate of 0.1 at epochs 35 and 55. The batch size was uniformly fixed to 64 across all datasets, and the data pretreatment method outlined in reference [32] was employed. For the Northwestern-UCLA dataset, we set the batch size to 64 and applied the data preprocessing method outlined in [25]. The experimental results for the NTU RGB+D 60 dataset (X-sub) indicate that as the learning rate was gradually decreased (0.1, 0.01, 0.001, 0.0001), the model’s Top-1 accuracy was 92%, 89.55%, 75.97%, and 40.92%, respectively, with an increase in training time. This suggests that the model did not achieve optimal performance, with excessively low learning rates. Conversely, when the learning rate was gradually increased (0.2 and 0.3), the model’s Top-1 accuracy reached 90.86% and 90.4%, respectively. This indicates that while higher learning rates accelerated convergence, they resulted in a slight decline in accuracy. In summary, a base learning rate of 0.1 demonstrated effective performance in training, balancing the model’s convergence speed and stability. The method proposed in this study, and the benchmark model corresponding to this method, use the above settings in ablation studies and comparative experiments.

4.3. Ablation Study and Discussion

In order to assess the efficacy of the suggested approach, we established three networks with distinct configurations to conduct ablation experiments on the NTU-RGB+D 60 (X-Sub) dataset. The objective was to test the usefulness of each module in the ACE-GCN. This study used the ST-GCN [5] as the baseline. The ST-GCN is a non-adaptive topology-sharing graph convolutional network, and its topology is fixed and cannot be learned. In this study, we enhance the ST-GCN model by using residual connections as the fundamental convolutional module. Additionally, we substitute the time convolution module of the ST-GCN with the time convolution module outlined in Section 3.3.

The experimental results of human action recognition by the ACE-GCN without integrated Edgeconv are shown in Table 1. The symbol in Edgeconv indicates whether the network uses Edgeconv for feature aggregation. Joint (J) and bone (B) represent the human action recognition ability of the model in different data streams. The ACE-GCN enhances the Top-1 accuracy of the joint stream by 0.1% compared to the baseline. Similarly, it improves the Top-1 accuracy of the bone stream by 0.8%. Furthermore, the fusion of the J and B stream with ACE-GCN results in a 0.7% increase in Top-1 accuracy. The joint stream primarily consists of the 3D coordinates of human joints, while the bone stream is formed by the differences in coordinates between interconnected joints in space. The proposed ACE-GCN can obtain the attention weight matrix of joint points through the attention mechanism and adaptively optimize the graph topology in the channel through the attention weight matrix so as to better capture the correlation between paired joint points. Therefore, the improvement of the Top-1 accuracy of the ACE-GCN in the bone flow is more significant.

The Top-1 accuracy of the joint flow (J) is improved by 0.2%, the Top-1 accuracy of the bone flow (B) is improved by 1%, and the Top-1 accuracy of the joint flow and bone flow fusion (J+B) is enhanced by 0.8% when the ACE-GCN includes Edgeconv for human action recognition. Edgeconv is one of the parallel aggregation branches. Without destroying the topology of the graph, Edgeconv can effectively aggregate the spatio-temporal features of the local topology of the human skeleton through clustering. The other aggregation branch focuses on global information aggregation, and through global–local information aggregation, it achieves global feature aggregation that retains local key features. Therefore, adding the Edgeconv aggregation branch to the ACE-GCN improves the Top-1 accuracy.

In the confusion matrices shown in Figure 7 and Figure 8, the data on the diagonal represents the Top-1 classification accuracy of the model in each action class. The pictures in Figure 8 show the intercepted results of some of the confusion matrices in Figure 7 and Figure 8. The left-hand-side plots (a), (c), and (e) show the confusion matrices of the BASELINE model on different action classes, and the right-hand-side plots (b), (d), and (f) show the confusion matrices of the ACE-GCN model on the corresponding classes. The horizontal coordinates of plots (a)–(f) are the predicted action labels, the vertical coordinates are the correct action labels, and the data in the plots represents the recognition accuracies. The values in Figure 9 show that compared to the baseline model, the Top-1 classification accuracy of the ACE-GCN is improved in each action classes, and the overall accuracy of human action classification is better than the baseline model. Figure 10 further compares the Top-1 classification accuracies of the ACE-GCN and baseline in different action classes, and the results show that the recognition performance of the ACE-GCN is better than that of the baseline in general. In summary, the ACE-GCN proposed in this paper has excellent action recognition performance.

However, through the analysis of the experimental data and its visualization, we also found some limitations of the ACE-GCN model. Hand actions usually involve finer-grained spatio-temporal features, and these actions mainly rely on the subtle actions of the finger joints. As can be seen from the data comparison visualization in Figure 9, although the recognition accuracy on hand motion-related action classes (e.g., reading and using mobile phones/tablets, etc.) is improved, the recognition accuracy is still low compared to that for other action classes. In addition, since the skeleton data does not contain information about objects interacting with hand actions, the model struggles to accurately differentiate between them when recognizing hand actions, leading to the occurrence of misrecognition. Therefore, the ACE-GCN still has considerable room for improvement in feature extraction, especially when dealing with fine-grained features.

4.4. Comparison Study and Discussion

Within this section, we conduct a comparative analysis of the ACE-GCN alongside other cutting-edge models using the NTU-RGB+D 60, NTU-RGB+D 120, and NW-UCLA datasets. The empirical findings presented in Table 2, Table 3 and Table 4 illustrate that the ACE-GCN can attain a Top-1 identification accuracy of 92% (X-Sub) and 96.3% (X-View) for the NTU RGB+D 60 dataset. The ACE-GCN achieves a Top-1 recognition accuracy of 88.6% (X-Sub) and 90.1% (X-Set) for the NTU RGB+D 120 dataset. The ACE-GCN achieves 96.6% Top-1 recognition accuracy for the NW-UCLA dataset.

The ACE-GCN outperforms the ST-GCN in terms of Top-1 recognition accuracy, achieving a 10.5% improvement on the X-Sub benchmark and an 8.3% improvement on the X-View benchmark of the NTU RGB+D 60 dataset. The parameter count is decreased by 1.2 million. Thanks to the adaptive channel topology and the multi-branch aggregation method formed by Edgeconv under the refined attention mechanism, the ACE-GCN effectively converges important global and local information in the spatial and temporal domains, thereby achieving the comprehensive capture and fusion of channel and spatio-temporal features. Therefore, the ACE-GCN’s human action recognition Top-1 accuracy for the NTU RGB+D 60, NTU RGB+D 120, and NW-UCLA datasets is significantly better than that of other cutting-edge methods. In addition, the ACE-GCN has fewer parameters than other cutting-edge models and achieves better performance on the four benchmarks.

5. Conclusions

The aim of this study is to propose a new model called the ACE-GCN for recognizing human actions based on skeletal data. Unlike existing methods, the ACE-GCN adaptively encodes the graph topological features in the channel through an attention mechanism, achieving feature refinement between joints and effectively extracting more representative human action features to improve recognition accuracy. Simultaneously, this strategy can also prevent the buildup of unnecessary information and decrease the quantity of computational parameters. In addition, by introducing Edgeconv to build a multi-branch aggregation module, the model enhances its ability to aggregate temporal and spatial features at different scales. The results of the ablation and comparison experiments show that the ACE-GCN model proposed in this study achieves 92% (X-Sub) and 96.3% (X-View) accuracy using the NTU RGB+D 60 dataset and 88.6% (X-Sub) and 90.1% (X-View) using the NTU RGB+D 120 dataset, respectively. It achieved an accuracy of 96.6% for the NW-UCLA dataset while maintaining a relatively low number of computational parameters. These results indicate that the ACE-GCN has significant advantages in spatio-temporal feature extraction and global–local context capture. Since no skeletal features are created for objects in the hand-object action class, the model’s accuracy in recognizing fine hand actions is low relative to that for other actions. This may affect its performance in certain tasks with a high proportion of fine action samples. In this paper, we have focused on improving the model’s performance in the spatial dimension, and future work will build skeletal features for objects and focus on obtaining a balanced performance of the model on different action types.

Author Contributions

Conceptualization, X.-Y.C. and X.-W.H.; methodology, X.-Y.C.; software, X.-Y.C.; validation, X.-Y.C.; formal analysis, X.-Y.C. and X.-W.H.; investigation, X.-Y.C.; resources, X.-Y.C.; data curation, X.-Y.C. and Y.C.; writing—original draft preparation, X.-Y.C.; writing—review and editing, X.-Y.C. and Q.-Y.G.; visualization, X.-Y.C.; supervision, X.-W.H. and Y.C.; project administration, X.-W.H. and W.H.; funding acquisition, X.-W.H. All authors have read and agreed to the published version of the manuscript.

Funding

This study was awarded financial assistance under the Heilongjiang Postdoctoral Fund to pursue scientific research in Heilongjiang Province in 2019. The level of financial assistance is II, and the amount of financial assistance is RMB 70,000. Serial number: LBH-Z19072.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all the subjects involved in this study.

Data Availability Statement

The data presented in this study are openly available in [NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis] at [10.1109/CVPR.2016.115], reference number [10.1109/CVPR.2016.115].

Acknowledgments

The Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing and the Postdoctoral research workstation of Northeast Asia Service Outsourcing Research Center provided academic support.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ahmad, T.; Jin, L.; Zhang, X.; Lai, S.; Tang, G.; Lin, L. Graph Convolutional Neural Network for Human Action Recognition: A Comprehensive Survey. IEEE Trans. Artif. Intell. 2021, 2, 128–145. [Google Scholar] [CrossRef]
Chaquet, J.M.; Carmona, E.J.; Fernández-Caballero, A. A Survey of Video Datasets for Human Action and Activity Recognition. Comput. Vis. Image Underst. 2013, 117, 633–659. [Google Scholar] [CrossRef]
Wang, C.; Yan, J. A Comprehensive Survey of RGB-Based and Skeleton-Based Human Action Recognition. IEEE Access 2023, 11, 53880–53898. [Google Scholar] [CrossRef]
Sun, Z.; Ke, Q.; Rahmani, H.; Bennamoun, M.; Wang, G.; Liu, J. Human Action Recognition from Various Data Modalities: A Review. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 3200–3225. [Google Scholar] [CrossRef] [PubMed]
Yan, S.; Xiong, Y.; Lin, D. Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. AAAI Conf. Artif. Intell. 2018, 32, 7444–7452. [Google Scholar] [CrossRef]
Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 12018–12027. [Google Scholar]
Yu, L.; Tian, L.; Du, Q.; Bhutto, J.A. Multi-Stream Adaptive Spatial-Temporal Attention Graph Convolutional Network for Skeleton-Based Action Recognition. IET Comput. Vis. 2022, 16, 143–158. [Google Scholar] [CrossRef]
Xie, Y.; Zhang, Y.; Ren, F. Temporal-Enhanced Graph Convolution Network for Skeleton-Based Action Recognition. IET Comput. Vis. 2022, 16, 266–279. [Google Scholar] [CrossRef]
Wang, Y.; Sun, Y.; Liu, Z.; Sarma, S.E.; Bronstein, M.M.; Solomon, J.M. Dynamic Graph CNN for Learning on Point Clouds. ACM Trans. Graph. 2019, 38, 1–12. [Google Scholar] [CrossRef]
Shahroudy, A.; Liu, J.; Ng, T.; Wang, G. NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Liu, J.; Shahroudy, A.; Perez, M.; Wang, G.; Duan, L.-Y.; Kot, A.C. NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2684–2701. [Google Scholar] [CrossRef]
Wang, J.; Nie, X.; Xia, Y.; Wu, Y.; Zhu, S.-C. Cross-View Action Modeling, Learning and Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 24–27 June 2014. [Google Scholar]
Lv, F.; Nevatia, R. Recognition and Segmentation of 3-D Human Action Using HMM and Multi-Class AdaBoost; Leonardis, A., Bischof, H., Pinz, A., Eds.; Springer: Berlin/Heidelberg, Germany, 2006; Volume 3954, pp. 359–372. [Google Scholar]
Wang, J.; Liu, Z.; Chorowski, J.; Chen, Z.; Wu, Y. Robust 3D Action Recognition with Random Occupancy Patterns. In Computer Vision—ECCV 2012; Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2012; Volume 7573, pp. 872–885. ISBN 978-3-642-33708-6. [Google Scholar]
Wang, J.; Liu, Z.; Wu, Y.; Yuan, J. Mining Actionlet Ensemble for Action Recognition with Depth Cameras. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 1290–1297. [Google Scholar]
Yang, X.; Tian, Y. Effective 3D Action Recognition Using EigenJoints. J. Vis. Commun. Image Represent. 2014, 25, 2–11. [Google Scholar] [CrossRef]
Cai, X.; Zhou, W.; Wu, L.; Luo, J.; Li, H. Effective Active Skeleton Representation for Low Latency Human Action Recognition. IEEE Trans. Multimed. 2016, 18, 141–154. [Google Scholar] [CrossRef]
Su, B.; Wu, H.; Sheng, M.; Shen, C. Accurate Hierarchical Human Actions Recognition From Kinect Skeleton Data. IEEE Access 2019, 7, 52532–52541. [Google Scholar] [CrossRef]
Zhu, W.; Lan, C.; Xing, J.; Li, Y.; Shen, L.; Zeng, W.; Xie, X. Co-Occurrence Feature Learning for Skeleton Based Action Recognition Using Regularized Deep LSTM Networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016. [Google Scholar]
Liu, J.; Shahroudy, A.; Xu, D.; Wang, G. Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
Liu, J.; Wang, G.; Duan, L.-Y.; Abdiyeva, K.; Kot, A.C. Skeleton-Based Human Action Recognition with Global Context-Aware Attention LSTM Networks. IEEE Trans. Image Process. 2018, 27, 1586–1599. [Google Scholar] [CrossRef] [PubMed]
Zhang, P.; Lan, C.; Xing, J.; Zeng, W.; Xue, J.; Zheng, N. View Adaptive Neural Networks for High Performance Skeleton-Based Human Action Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 1963–1978. [Google Scholar] [CrossRef]
Caetano, C.; Sena, J.; Brémond, F.; Santos, J.A.d.; Schwartz, W.R. SkeleMotion: A New Representation of Skeleton Joint Sequences Based on Motion Information for 3D Action Recognition. In Proceedings of the 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Taipei, Taiwan, 18–21 September 2019. [Google Scholar]
Caetano, C.; Brémond, F.; Schwartz, W.R. Skeleton Image Representation for 3D Action Recognition Based on Tree Structure and Reference Joints. In Proceedings of the 2019 32nd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), Rio de Janeiro, Brazil, 28–31 October 2019. [Google Scholar]
Cheng, K.; Zhang, Y.; He, X.; Chen, W.; Cheng, J.; Lu, H. Skeleton-Based Action Recognition with Shift Graph Convolutional Network. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 180–189. [Google Scholar]
Song, Y.-F.; Zhang, Z.; Shan, C.; Wang, L. Constructing Stronger and Faster Baselines for Skeleton-Based Action Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 1474–1488. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, VA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 6000–6010. [Google Scholar]
Baradel, F.; Wolf, C.; Mille, J. Human Action Recognition: Pose-Based Attention Draws Focus to Hands. In Proceedings of the 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), Venice, Italy, 22–29 October 2017; pp. 604–613. [Google Scholar]
Si, C.; Chen, W.; Wang, W.; Wang, L.; Tan, T. An Attention Enhanced Graph Convolutional LSTM Network for Skeleton-Based Action Recognition. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 1227–1236. [Google Scholar]
Wu, L.; Zhang, C.; Zou, Y. SpatioTemporal Focus for Skeleton-Based Action Recognition. Pattern Recognit. 2023, 136, 109231. [Google Scholar] [CrossRef]
Liu, Z.; Zhang, H.; Chen, Z.; Wang, Z.; Ouyang, W. Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 140–149. [Google Scholar]
Zhang, P.; Lan, C.; Zeng, W.; Xing, J.; Xue, J.; Zheng, N. Semantics-Guided Neural Networks for Efficient Skeleton-Based Human Action Recognition. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 1109–1118. [Google Scholar]
Chen, Z.; Li, S.; Yang, B.; Li, Q.; Liu, H. Multi-Scale Spatial Temporal Graph Convolutional Network for Skeleton-Based Action Recognition. AAAI 2021, 35, 1113–1122. [Google Scholar] [CrossRef]
Wen, Y.-H.; Gao, L.; Fu, H.; Zhang, F.-L.; Xia, S.; Liu, Y.-J. Motif-GCNs with Local and Non-Local Temporal Blocks for Skeleton-Based Action Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 2009–2023. [Google Scholar] [CrossRef]
Zhu, Q.; Deng, H. Spatial Adaptive Graph Convolutional Network for Skeleton-Based Action Recognition. Appl. Intell. 2023, 53, 17796–17808. [Google Scholar] [CrossRef]
Chen, H.; Li, M.; Jing, L.; Cheng, Z. Lightweight Long and Short-Range Spatial-Temporal Graph Convolutional Network for Skeleton-Based Action Recognition. IEEE Access 2021, 9, 161374–161382. [Google Scholar] [CrossRef]
Yang, H.; Yan, D.; Zhang, L.; Li, D.; Sun, Y.; You, S.; Maybank, S.J. Feedback Graph Convolutional Network for Skeleton-Based Action Recognition. IEEE Trans. Image Process. 2020, 31, 164–175. [Google Scholar] [CrossRef] [PubMed]
Li, X.; Kang, J.; Yang, Y.; Zhao, F. A Lightweight Attentional Shift Graph Convolutional Network for Skeleton-Based Action Recognition. Int. J. Comput. Commun. Control 2023, 18, e5061. [Google Scholar] [CrossRef]
Li, C.; Huang, Q.; Mao, Y. DD-GCN: Directed Diffusion Graph Convolutional Network for Skeleton-Based Human Action Recognition. In Proceedings of the 2023 IEEE International Conference on Multimedia and Expo (ICME), Brisbane, QLD, Australia, 10–14 July 2023; pp. 786–791. [Google Scholar]

Figure 1. Basic block of the proposed ACE-GCN.

Figure 2. Displays the structure of the ACE-GC module.

Figure 3. Detailed flowchart of channel graph topology modeling.

Figure 4. Detailed flowchart of the channel aggregation process in the ACE-GCN.

Figure 5. The architecture of the ACE-GCN.

Figure 6. The architecture of the temporal convolution module.

Figure 7. The confusion matrix of the baseline model for the NTU-RGB+D 60 (X-sub) dataset is presented.

Figure 8. Confusion matrix of the ACE-GCN for the NTU-RGB+D 60 (X-sub) dataset.

Figure 9. Confusion matrix of the baseline and ACE-GCN models for different action classes. The confusion matrix of the baseline is shown on the left, and the confusion matrix of the ACE-GCN is shown on the right. Subfigures (a), (c), and (e) display the accuracy performance of the baseline model across different action classes, while subfigures (b), (d), and (f) illustrate the accuracy performance of ACE-GCN for the same action classes.

Figure 10. Comparison of the recognition accuracy rate between the ACE-GCN and baseline across action classes in the NTU-RGB+D 60 X-sub dataset. The red dash line indicates the accuracy rate of the ACE-GCN in each action class, and the blue bar graph indicates the accuracy rate of the baseline model in each action class.

Table 1. Comparisons of the Top-1 validation accuracy of the ACE-GCN with various configurations.

Model	Edgeconv	J (%)	B (%)	J+B (%)
Baseline	×	89.3	89.1	91.2
ACE-GCN	×	89.4 ↑	89.9 ↑	91.9 ↑
ACE-GCN	√	89.5 ↑	90.1 ↑	92 ↑

Table 2. Top-1 classification accuracy comparisons for the NTU RGB+D 60 dataset.

Model	Param.	NTU RGB+D 60
Model	Param.	X-Sub (%)	X-View (%)
ST-GCN [5]	3.1 M	81.5	95.1
2s-AGCN [6]	9.94 M	88.5	94.5
SGN [32]	0.69 M	89	96.5
Shift-GCN [25]	2.76 M	90.7	96.2
MS-G3D [31]	6.4 M	91.5	96.6
MST-GCN [33]	12 M	91.5	96.1
SMotif-GCN [34]	-	90.5	95.8
DD-GCN [35]	-	88.9	95.7
EfficientGCN-B4 [26]	1.1 M	91.7	94.83
SARGCN [36]	1.09 M	88.9	96.3
ACE-GCN (ours)	1.9 M	92	95.1

Table 3. Top-1 classification accuracy comparisons for the NTU RGB+D 120 dataset.

Model	Param.	NTU RGB+D 120
Model	Param.	X-Sub (%)	X-View (%)
ST-GCN [5]	3.1 M	70.7	73.2
2s-AGCN [6]	9.94 M	82.5	84.2
SGN [32]	0.69 M	79.2	81.5
Shift-GCN [25]	2.76 M	85.9	87.6
MS-G3D [31]	6.4 M	86.9	88.4
MST-GCN [33]	12 M	87.5	88.8
SMotif-GCN [34]	-	88.4	88.9
DD-GCN [35]	-	84.9	86
EfficientGCN-B4 [26]	1.1 M	88.3	89.1
SARGCN [36]	1.09 M	83.8	85.1
ACE-GCN (ours)	1.9 M	88.6	90.1

Table 4. Top-1 classification accuracy comparisons for the NW-UCLA dataset.

Model	Param.	NW-UCLA
SGN [32]	0.69 M	92.5
AGC-LSTM [29]	-	93.3
Shift-GCN [25]	2.76 M	94.6
NLB-ACSE [37]	1.21 M	95.3
FGCN [38]	-	95.3
LA-SGCN [39]	0.43 M	95.7
ACE-GCN (ours)	1.9 M	96.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Han, X.-W.; Chen, X.-Y.; Cui, Y.; Guo, Q.-Y.; Hu, W. Adaptive Channel-Enhanced Graph Convolution for Skeleton-Based Human Action Recognition. Appl. Sci. 2024, 14, 8185. https://doi.org/10.3390/app14188185

AMA Style

Han X-W, Chen X-Y, Cui Y, Guo Q-Y, Hu W. Adaptive Channel-Enhanced Graph Convolution for Skeleton-Based Human Action Recognition. Applied Sciences. 2024; 14(18):8185. https://doi.org/10.3390/app14188185

Chicago/Turabian Style

Han, Xiao-Wei, Xing-Yu Chen, Ying Cui, Qiu-Yang Guo, and Wen Hu. 2024. "Adaptive Channel-Enhanced Graph Convolution for Skeleton-Based Human Action Recognition" Applied Sciences 14, no. 18: 8185. https://doi.org/10.3390/app14188185

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Adaptive Channel-Enhanced Graph Convolution for Skeleton-Based Human Action Recognition

Abstract

1. Introduction

2. Related Work

2.1. Skeleton-Based Action Recognition

2.2. Attention Mechanisms in Action Recognition

3. Method

3.1. Preliminaries

3.2. Adaptive Channel-Enhanced Graph Convolution

3.3. Network Architecture

3.4. Metricrs

4. Experimental Evaluation

4.1. Datasets

4.2. Experiment Settings

4.3. Ablation Study and Discussion

4.4. Comparison Study and Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI