Attention-Guided and Topology-Enhanced Shift Graph Convolutional Network for Skeleton-Based Action Recognition

Lu, Chenghong; Chen, Hongbo; Li, Menglei; Jing, Lei

doi:10.3390/electronics13183737

Open AccessFeature PaperArticle

Attention-Guided and Topology-Enhanced Shift Graph Convolutional Network for Skeleton-Based Action Recognition

Graduate School of Computer Science and Engineering, University of Aizu, Tsuruga, Ikki-machi, Aizuwakamatsu 965-8580, Japan

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2024, 13(18), 3737; https://doi.org/10.3390/electronics13183737

Submission received: 7 July 2024 / Revised: 12 September 2024 / Accepted: 17 September 2024 / Published: 20 September 2024

(This article belongs to the Special Issue Artificial Intelligence in Image Processing and Computer Vision)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Graph Convolutional Networks (GCNs) have emerged as a game-changer in skeleton-based action recognition. However, most previous works are resource-heavy, with large FLoating-number OPerations (FLOPs) limiting the model’s potential. A recent work involving shift operators to GCN (Shift-GCN) has successfully introduced a lightweight GCN, but there is still a performance gap compared to previous results. Inspired by Shift-GCN, we propose an innovative and novel model named attention-guided and topology-enhanced shift graph convolutional network (AT-Shift-GCN), which continues the lightweight benchmark and provides a more powerful performance. We employ a topological transfer operation to aggregate the information flow of different channels and extract spatial information. In addition, to extract temporal information across scales, we apply attention to interacting with shift convolution kernels of different lengths. Furthermore, we integrate an ultralight spatiotemporal attention module to fuse spatiotemporal details and provide robust neighborhood representation. In summary, AT-Shift-GCN is a breakthrough in skeleton-based action recognition that provides a lightweight model with enhanced performance on three datasets.

Keywords:

recognize motions; basic motions

1. Introduction

Deep learning has revolutionized various fields, including healthcare, robotics, and autonomous systems, enabling advanced applications ranging from medical diagnostics to intelligent robotic systems. In particular, human motion recognition has gained significant attention in healthcare and security due to its ability to enhance human–robot interaction and automate complex tasks. This focus on recognizing human movements has proven valuable for monitoring, analysis, and interaction.

Human motion recognition has been gaining increasing attention in computer vision due to its potential applications in various fields such as healthcare, sports, and security [1,2,3]. In recent years, the recognition of skeleton-based human movements has emerged as a promising approach due to the availability of depth sensors. Compared to RGB images, skeleton data provides a lightweight and highly abstract representation of information, which contains only 2D or 3D positions of important human joints. This simplicity of data representation makes it possible to process large amounts of data efficiently and effectively. Moreover, the lack of environmental factors, such as lighting and clothing variations, makes skeleton-based recognition more robust against complicated backgrounds. Therefore, skeleton-based human motion recognition has become a popular research topic in computer vision and has shown great potential for practical applications.

Earlier methods of skeleton-based human motion recognition relied on handcrafted features extracted from the coordinates of joints, limiting the model’s capability. With the advent of deep learning, newer methods have utilized Recurrent Neural Networks (RNNs) or Convolutional Neural Networks (CNNs) to achieve superior performance. These deep-learning-based methods [4,5,6] convert skeleton data into either pseudo-images or coordinate sequence vectors as input to the neural network. This has led to significant improvements in the accuracy of human motion recognition. The ability of deep-learning methods to automatically learn relevant features from the input data has made them particularly well-suited for this task.

In current research on human pose estimation, there is an ongoing need to address the inherent correlations between joints and the spatial relationships that define human body topology. One early attempt at addressing this issue is the ST-GCN (Spatial-Temporal Graph Convolutional Network) method [7], which applies spatial graph convolutions and interleaving temporal convolutions for spatial-temporal modeling. However, ST-GCN has limitations due to its manually defined topology, which restricts its ability to model relationships beyond the natural connections of joints and hinders its representation capability. To overcome these limitations, researchers have developed several ST-GCN variants [8,9,10,11,12,13] that can go beyond joint connectivity by adding attention modules or multi-scale feature aggregation. These variants have improved the model’s representation capability by allowing it to capture more complex relationships between body parts. However, the introduction of incremental modules has increased the computational complexity of these methods significantly compared to ST-GCN. To further improve human pose estimation, ongoing research explores new methods that can effectively model the complex spatial relationships between human body parts while minimizing computational complexity. Shift-GCN [14] is a recent advancement in human pose estimation that introduces shift operators into the graph convolutional network (GCN) architecture, effectively reducing computational complexity.

Despite its advantages, Shift-GCN has some shortcomings that need to be addressed. First, the flexibility of shifting can lead to information redundancy and reduce the robustness of the model. Although Shift-GCN is designed to interact freely with neighboring joint information, excessive information shifts may decrease model performance. Second, sharing topological information across all channels limits the flexibility of feature extraction. As a result, the model may not be able to extract features specific to a particular topology, which may lead to reduced accuracy. Third, Shift-GCN uses a single adaptive convolution to extract temporal information, which may not be sufficient to capture all the temporal nuances of a given action. For example, hitting requires a higher level of attention at the moment of impact, so a multi-scale flow of information is necessary to capture all the relevant temporal information. Finally, although Shift-GCN provides a good baseline for reducing computational complexity, its performance falls short of some previous approaches. Therefore, there is a need for further research to optimize the model’s architecture and improve its accuracy. In conclusion, while Shift-GCN has demonstrated promising results in human pose estimation, there is still room for improvement. Future research should focus on developing more robust and flexible models to capture the intricate spatial and temporal relationships between human body parts while minimizing computational complexity.

To address the shortcomings of Shift-GCN, we propose a novel graph convolutional network model called AT-Shift-GCN that enhances the performance of Shift-GCN without significantly increasing the computational burden. Our proposed model improves the modeling of channel-wise topological relations by introducing a simple and efficient transfer operation that captures the subtle relationships between joint pairs within each channel. This transfer operation allows for dynamic and efficient modeling of topological relations while minimizing computational burden. Additionally, our weight distribution scheme for neighborhood information aggregation enhances the differences between joints and improves the stability of learned representations. Furthermore, we incorporate both adaptive and multi-scale temporal modeling by learning adaptive weights for channel attention and fusing information flows at different scales. Our experimental comparisons show that AT-Shift-GCN outperforms previous work in temporal modeling while maintaining computational efficiency. We evaluate AT-Shift-GCN on three datasets: NTU RGB+D [15], NTU-120 RGB+D [16], and Northwestern-UCLA [17], and demonstrate that our model achieves state-of-the-art results with lower computational cost. Our model notably outperforms Shift-GCN with comparable computation cost on the NTU RGB+D dataset, as shown in Figure 1. AT-Shift-GCN is an efficient and powerful graph convolutional network model that combines attention-guided and topology-enhanced methods for skeleton-based action recognition.

The main contributions of this work are summarized as follows:

We propose a simple yet efficient transfer operation to enhance space modeling, achieving more powerful performance with negligible additional computation.
We employ attention neighborhood aggregation instead of averaging, which enhances the robustness and expressiveness of graph learning while promoting spatiotemporal information fusion.
We propose an attention-guided adaptive multi-scale temporal convolution that outperforms previous work with a very low computational burden and excellent performance.
The proposed AT-Shift-GCN outperforms existing state-of-the-art methods on three datasets for skeleton-based action recognition while requiring significantly fewer computational resources. Our approach achieves this superior performance with more than twice the computational efficiency of previous methods.

2. Related Work

To provide the necessary context for our work, this section begins by reviewing graph convolutional networks, which serve as a fundamental basis for our approach. We also examine previous successful applications of graph neural networks to skeleton action recognition, focusing on models that capture information from space and time domains. Additionally, we introduce the concepts of topology-shared and topology-non-shared methods and emphasize the importance of understanding the differences between these two approaches. This understanding is crucial for appreciating the improvements that we propose in this work.

2.1. Graph Convolutional Networks

With the great success of convolutional neural networks (CNNs) on Euclidean data, there is a desire to migrate to non-Euclidean data (e.g., Graph). With the ability of graph convolutional neural networks (GCNs) to model local structures and the prevalence of node dependencies on graphs, graph convolutional neural networks have gained much attention. GCNs are often categorized as spectral methods and spatial methods. The spectral method [21] employs the convolution theorem on graphs to define the graph convolution from the spectral domain. The spatial way [21] aggregates each central node and its adjacent nodes of the node domain by defining aggregation functions.

The GCN variant proposed by Kipf et al. [22] is widely applied for numerous downstream tasks due to its simplicity and efficiency. Our work is also based on this work. Our work is also based on this work.

2.2. GCN-Based Skeleton Action Recognition

Graph Convolutional Networks (GCNs) have proven successful in the skeleton-based action recognition domain, as demonstrated by several research studies [7,8,9,10,14,23,24]. ST-GCN [7] is a widely used GCN-based model that employs spatial and temporal graph convolutions interleaved for effective spatial-temporal modeling. The spatial graph convolution component of ST-GCN utilizes graph convolutions to model the relationships between joints in the skeleton. This method involves defining the graph structure using the joints as nodes and the edges as the relationships between them. The graph convolution operation is then applied to the graph to capture the spatial dependencies between the joints. This approach has shown great success in capturing the complex spatial relationships between joints in the skeleton, leading to improved action recognition performance. The temporal graph convolution component of ST-GCN is used to model the temporal dynamics of the actions. This is achieved by modeling the temporal evolution of the joint movements as a sequence of graphs. The temporal convolution operation is then applied to the sequence of graphs to capture the temporal dependencies between the joint movements. This method has shown great success in capturing the complex temporal dynamics of actions, leading to improved action recognition performance. So, we will introduce GCN-based models in the following two parts: spatial graph convolution and temporal graph convolution.

2.2.1. Spatial Graph Convolution

In the spatial modeling component of ST-GCN, the human skeleton is represented as a graph, denoted as

G = (V, E)

. The set of nodes, V = {

V_{1}

,

V_{2}

,

\dots

,

V_{n}

}, represents the N joints of the skeleton. The edge set, E, represents the bones connecting the joints and is captured by an adjacency matrix A. The spatial GCN operation performed for each frame in a skeleton sequence is formulated as follows:

F_{o u t} = \sum_{d = 0}^{D} W_{d} F_{i n} ({\tilde{D}}_{d}^{- \frac{1}{2}} {\tilde{A}}_{d} {\tilde{D}}_{d}^{- \frac{1}{2}} \otimes M_{d})

(1)

In this paragraph, a graph convolutional network (GCN) is being described, which operates on a human skeleton graph. The GCN takes as input the feature maps

F_{i n}

and

F_{o u t}

, which represent the input and output feature maps, respectively. The human skeleton graph is represented by an adjacency matrix

{\tilde{A}}_{d}

that indicates the pairs of joints under a graph distance d. The maximum graph distance is given by the parameter D.

To perform the convolution, the GCN uses a normalized version of

{\tilde{A}}_{d}

, denoted by

\tilde{D}

. This matrix is used to tune the importance of each edge using a learnable parameter

M_{d}

. The term

F_{i n} {\tilde{D}}^{- \frac{1}{2}} \tilde{A} {\tilde{D}}^{- \frac{1}{2}}

can be interpreted as a spatial mean feature aggregation from the direct neighborhood.

The ST-GCN model has been widely recognized for its effectiveness in action recognition tasks. However, its design is heuristically determined and may limit its performance. Some improvements to ST-GCN rely on more complex network designs, which can increase the computational burden. To address these challenges, Shift-GCN has been introduced. This novel approach introduces a lightweight shift operation to the heavy GCN-based action recognition models. By leveraging this shift operation, Shift-GCN achieves improved performance while maintaining computational efficiency. Specifically, Shift-GCN modifies the graph convolution operation by shifting the positions of the convolution filters instead of using fixed positions. This shift operation effectively expands the receptive field of the model, allowing it to capture more complex spatiotemporal features. Additionally, Shift-GCN incorporates the shift operation into the graph pooling operation to further enhance the model’s ability to capture spatial and temporal information.

Shift-GCN employs a novel spatial modeling approach that utilizes local/non-local shift graph operations and point-wise convolutions instead of traditional graph convolutional networks. The local shift graph convolution leverages the physical structure of the human body as pre-defined by skeleton datasets. In this approach, each node is denoted as v, and its set of neighbor nodes is denoted as

B_{v} = {V_{v}^{1}, V_{v}^{2}, . . ., V_{v}^{n}}

, where n represents the number of neighbor nodes of v.

To enable the shift operation, the channels of node v are equally divided into

n + 1

partitions, with the 1^st partition retaining the features of v. The remaining n partitions are shifted from the set of neighbor nodes

V_{v}^{1}, V_{v}^{2}, . . ., V_{v}^{n}

, respectively. This approach effectively expands the receptive field of the model, allowing it to capture more complex spatial relationships among the joints in the human skeleton.

An illustration of the shift operation for node one is shown in Figure 2. This approach enables the model to capture both local and non-local spatial dependencies among the joints in the human body, facilitating more accurate action recognition. The point-wise convolution further enhances the model’s ability to capture fine-grained spatial features, resulting in improved performance compared to traditional GCN-based approaches. Overall, the local/non-local shift graph operation and point-wise convolution in Shift-GCN represent a significant advancement in action recognition.

In addition to the local shift graph operation, Shift-GCN utilizes a non-local shift operation to capture long-range spatial dependencies among joints. This operation is achieved by shifting the channels of the input feature map

F \in

R^{N \times C}

according to their position in the channel dimension. Specifically, the shift distance of the ith channel is i mod N, where N is the total number of nodes in the skeleton.

The non-local shift operation in Shift-GCN provides a means to capture long-range dependencies among joints and incorporate more information about the neighborhood. However, it has some drawbacks that need to be addressed. One of the drawbacks is that the non-local shift operation may introduce redundant information, which can decrease the importance of critical joints in the skeleton. Additionally, the operation may also introduce useless noise, which can reduce the robustness of the model.

2.2.2. Temporal Graph Convolution

In the temporal modeling of ST-GCN, the regular 1D convolution on the temporal dimension is used as the temporal graph convolution. However, this approach has a heavy computational cost, mainly when a convolution kernel of size 9 is used to extract information on the time frames.

To overcome the limitation of using the same convolutional kernel weights for different scales, MS-G3D proposes a novel multi-scale fusion module that employs different weights for each convolutional kernel at different scales. The multi-scale fusion module decomposes a large convolutional kernel into a fusion of smaller kernels at multiple scales. Then, each kernel at different scales is convolved with the input feature map to extract features at a specific scale. However, the weights of the convolution kernels are the same for different scales, limiting the further improvement of its performance.

To further improve the model’s performance, Shift-GCN introduces an adaptive temporal shift graph operation. Instead of using fixed shift distances for each channel, they learn a set of dynamic shift distances by treating them as learnable parameters. Specifically, they use a one-layer MLP to generate the shift distances for each channel. This allows the model to adaptively adjust the shift distances based on the input data, capturing the dynamics of the action more effectively. After the adaptive temporal shift operation, the output feature maps are passed through a point-wise convolution to capture more complex temporal patterns.

However, the temporal modeling of both shifts extracts information through a single scale, which reduces the complexity in terms of results and still falls short of the performance of multi-scale temporal modeling.

2.3. Topology-Shared/Topology-Non-Shared Methods

Topology-shared methods, while effective in reducing the complexity of the model, have limitations in terms of performance. As mentioned earlier, these methods utilize the same topology in all channels, aggregating features with the same topology across different channels. Although ST-GCN and Shift-GCN have shown promising results, they still have room for improvement in model performance. Specifically, as shown in Figure 2 and Figure 3, these methods only allow information to interact in the joint dimension, and the channels’ information is shared within the same joint. This limitation may hinder the model’s ability to capture the complex temporal relationships and interactions between different channels, resulting in suboptimal performance.

In contrast, topology-non-shared methods employ different topology information in different channels or channel groups. These methods enhance the model’s expressiveness and decouple topology’s limitations on graph learning. However, they are based on ST-GCN for improvement, and the models are still extensive. This prompted us to consider continuing the lightweight baseline improvement model.

3. Model

The discussion above motivates introducing more powerful and lightweight improvements for the shift graph convolutional network. This section reviews the basic framework of Shift-GCN and presents our improved model’s framework for comparison. The section also explains how to achieve lightweight topology enhancement, improved spatial modeling, and improved temporal modeling. The improved model builds upon the existing Shift-GCN, enhancing its capabilities by introducing multi-scale modeling and adaptive temporal shift graph operations. The goal is to improve the model’s performance while reducing computational complexity.

3.1. Network Architecture

In this paragraph, we further elaborate on the improvements made to the Shift-GCN backbone. We utilize the same backbone structure to allow for a fair comparison with Shift-GCN and ST-GCN. The Shift-GCN backbone comprises 10 blocks, each containing a spatial non-local shift operation, a spatial point-wise convolution, an adaptive temporal shift operation, and a temporal point-wise convolution, as shown in Figure 4. In contrast, our improved model replaces the non-shared topology shift operation with the original shift operation, thus providing greater flexibility in the model’s design in Figure 5. Moreover, we learn multi-scale adaptive fusion by utilizing temporal convolution fusion of different scales, which enhances the model’s performance. We employ an attention module to facilitate robust neighborhood representations and spatiotemporal information fusion. These enhancements significantly improve the model’s performance and reduce computational complexity.

3.2. Non-Shared Topology Shift Graph Convolution

To address the limitation of shared topology in local/non-local shift graph convolution, we propose a straightforward and efficient method as shown in Figure 6 and Figure 7. Our method allows the flow of information in the channel stream and allows channel information from different joints to be fused, as opposed to being restricted to the same joint as in previous methods. By allowing a more flexible information flow, we can better capture the inter-channel correlations and enhance the model’s ability to extract spatiotemporal features. Specifically, we introduce a channel-specific topology, where each channel has its own shift operations and the shared topology used by all channels. This allows for more diverse and nuanced feature extraction, improving the model’s performance.

To enhance the capability of non-shared topology shift graph convolution, we propose a simple yet effective approach. Starting from a spatial skeleton feature map

F \in R^{N \times C}

, we apply non-shared topology shift graph convolution by multiplying an elementary matrix

G \in R^{N \times C \times C}

. However, to avoid redundancy in the flow of topological information between channels, we introduce a learnable mask that can dynamically adjust the strength of the topology for each channel. Specifically, we multiply the elementary matrix by a mask matrix

M \in R^{C \times C}

, which can be trained jointly with the rest of the model. The resulting matrix after each mask represents a unique topology for each channel and enhances the model’s ability to extract spatial features.

F^{'} = (G + m a s k) \times F

(2)

G∈

R^{N \times C \times C}

is an elementary matrix,

m a s k

∈

R^{N \times C \times C}

represent the mask matrix,

F

∈

R^{N \times C}

and

F^{'}

∈

R^{N \times C}

are input feature maps and output feature maps, respectively.

3.3. Adaptive Multi-Scale Temporal Shift Graph Convolution

To further elaborate, our proposed attention-guided adaptive multi-scale temporal shift graph convolution method introduces a new attention module to the traditional multi-scale convolution. This attention module learns the importance of different scales by assigning weights to each scale during training, allowing the model to focus on the most informative scales. By doing so, our method can effectively capture temporal information at different scales and with different weights, which enables the model to achieve better performance without a significant increase in computational cost. Moreover, the attention-guided adaptive multi-scale temporal shift graph convolution can also avoid the problem of weight bias, which is common in the single adaptive shifted convolution kernel. With our proposed method, the weights of each scale are adaptively adjusted during training, making the model more robust to different input conditions. Our method can significantly improve the temporal shift graph convolution performance by effectively incorporating multi-scale information and optimizing the weights using channel attention.

In our proposed method, we adopt a similar strategy as SKNet [25], which divides the feature map into multiple groups and processes them in parallel. Specifically, as shown in Figure 8, we divide the input feature map equally into four groups in the channel dimension, each processed independently by a group of convolutional layers. This allows for a more efficient and parallelized computation, as well as promoting diversity in the learned features. Furthermore, we apply a channel attention mechanism to learn a set of weights that adaptively adjust the importance of each group so that the network can dynamically allocate resources to the most informative features. By doing so, our method can achieve lightweight topology enhancement without significantly increasing the computational cost.

In addition, we incorporate an attention module to adaptively adjust the weight of each branch in the fusion process. This attention module enhances the network’s performance by focusing on more informative branches while suppressing less informative ones. The fused feature map is then fed into the temporal shift graph convolutional layer for temporal modeling. Finally, we apply global average pooling and fully connected layers for classification. Our model can achieve competitive performance with fewer parameters and less computation than previous methods.

F = F_{1} ‖ F_{2} ‖ F_{3} ‖ F_{4}

(3)

After obtaining the fused feature map from the four branches, we apply global average pooling (GAP) to extract global information for each channel. Next, we pass the resulting vector through a linear layer, generating dynamic weights for each channel. These weights are then normalized using SoftMax across channels, guiding the selection of information at different scales while keeping the feature information compact.

A_{1} = S o f t M a x \times (σ (W (P o o l i n g \times F_{1}) + b))

(4)

W∈

R^{C \times C}

is weight matrix, b is the bias vectors,

σ

denotes the activation function,

P o o l i n g

means global average pooling,

F_{1}

is the first branch as a sample,

A_{1}

is activation values,

S o f t M a x

denotes SoftMax function. In detail, after obtaining the global information of each channel by performing global average pooling (GAP) on the fusion output, we pass it through a linear layer to generate dynamic weights for each channel. Then, we apply the SoftMax normalization function to the weights across channels to guide the selection of information at different scales while maintaining compact feature information. Finally, we multiply the activation values of the different weights with each branch’s original vector to obtain the new branch’s weights and fuse the new branches. Although the feature maps are equally distributed, the weights have changed, allowing the model to adjust its focus on different scales and channels based on the input data.

F^{'} = A_{1} \times F_{1} ‖ A_{2} \times F_{2} ‖ A_{3} \times F_{3} ‖ A_{4} \times F_{4}

(5)

3.4. Attention-Guided Local Shift Graph Convolution

In the context of Shift-GCN, the non-local shift operation has been found to yield better results compared to the local shift operation. The primary reason is that the non-local shift operation helps integrate more information, leading to improved performance. However, this also leads to the transfer of redundant information, which can potentially impact the robustness of the model. In the ablation study, it was observed that the non-local shift operation does not offer significant gains when used with non-shared topology. This is in contrast to the behavior observed with the local shift operation.

Our motivation for pairing the attention module with the local shift operation is multifaceted. First, the attention module enables the local convolution to interact with information beyond the physical connections, allowing it to capture more context and make more informed decisions. Furthermore, the attention module distributes weights instead of averaging them, which helps to filter out secondary information and noise, thus maintaining the robustness and potential of the model. This is particularly important in the case of local shift operations, where the transfer of redundancy information can pose a challenge.

In contrast to Shift-GCN, where non-local information is used directly, and all neighboring information is treated as equally weighted, our approach is more in line with the human kinetic specification. In action recognition, neighboring joints naturally require higher weights because interconnected joints perform most actions. Second, by reusing the channel shift information, our model can enhance the ability of non-shared topology. Finally, by facilitating spatial modeling, attention can effectively integrate with the temporal module, thus improving the model’s overall performance.

Previous work [8,9] in action recognition has often utilized traditional graph convolution combined with a self-attention module to incorporate global information. However, these approaches can be computationally burdensome due to incremental modules. In contrast, we have designed a lightweight and powerful self-attention module to address these issues. Figure 9 demonstrates the architecture of our self-attention module, which can efficiently capture long-range dependencies while maintaining model efficiency. Our approach avoids incremental modules and instead uses a single attention layer with an adaptive gating mechanism to adjust the contribution of each node to the final output. The resulting module is powerful and lightweight, making it well-suited for action recognition tasks.

In the attention module, the feature map

F

is split into two branches: the channel attention branch and the spatiotemporal attention branch. Let us focus on the channel attention branch. We begin with a spatial skeleton feature map

F \in R^{N \times C \times T}

. To obtain the channel-wise vector f, we apply global averaging pooling (GAP) to

F

. This operation shrinks

F

along the dimensions

T \times C

, resulting in a tensor

F \in R^{N \times 1 \times 1}

. The resulting tensor contains a single value for each of the batch’s N samples, representing each channel’s average activation over the spatial and temporal dimensions. This channel-wise vector f captures the global information about the activations of each channel in the feature map. It is then fed into a channel attention block to learn channel-wise attention weights. These weights are applied to the original feature map

F

to obtain a channel-wise attention map, highlighting the most informative channels for the task. By employing channel attention, the attention module can selectively enhance the important features in the feature map while suppressing the less informative ones. This allows for a more focused and efficient use of the network’s capacity, leading to better performance on the given task.

F = F_{c} ‖ F_{s}

(6)

F_{c}^{'} = (σ (W (P o o l i n g \times F_{c} + b))) \times F_{c}

(7)

In the attention module, the channel attention block takes as input the feature maps

F_{c}

of dimensions

R^{N \times C / 2 \times T}

and outputs the feature maps

F_{c}^{'}

of the same dimensions. The channel attention block aims to learn channel-wise attention weights and selectively enhance the important features while suppressing the less informative ones.

To obtain the channel-wise attention weights, we first employ global averaging pooling (GAP) on the feature maps

F_{c}

along the time dimension, resulting in a space–time vector

s

of dimensions

R^{N \times C \times 1}

. The space–time vector

s

represents the average activation of each channel over the temporal dimension and all spatial locations. Next, we feed the vector into a fully connected layer with weight matrix W of dimensions

R^{C \times C}

and bias vector b. We apply an activation function

σ

to the resulting tensor, followed by another fully connected layer with weight matrix W and bias vector b. Finally, we apply the sigmoid function to obtain the channel-wise attention weights. We multiply the channel-wise attention weights element-wise with the original feature maps

F_{c}

to obtain the channel-wise attention map. This map highlights the most informative channels and suppresses the less important ones. The resulting attention map is then added to the original feature maps

F_{c}

to obtain the final output feature maps

F_{c}^{'}

. The channel attention block can improve the network’s performance on the given task by selectively enhancing important features while suppressing less informative ones.

F_{s}^{'} = (σ (W (P o o l i n g \times F_{s} + b))) \times F_{s}

(8)

Our attention generation approach differs from previous work in that we generate attention weights through a 1-dimensional weight generation process rather than a fully connected layer. This approach substantially reduces the complexity of the model, making it more efficient and faster to train. Moreover, we use global average pooling (GAP) to compress the extra information and generate the attention weights for both the channel and space dimensions. The GAP operation helps to reduce the spatial dimensions and retain only the most relevant features. We can effectively learn the most informative features by compressing the extra information and generating attention weights through a simple process. Finally, we stitch together the channel and space attention maps in the channel dimension to obtain the final attention map, which can improve the network’s performance on the given task. By selectively enhancing the most informative features while suppressing the less important ones, our attention module can improve the discriminative power of the model and achieve better performance.

F^{'} = F_{c}^{'} ‖ F_{s}^{'}

(9)

4. Experiment

In this section, we introduce three large-scale datasets that are commonly used for evaluating the performance of human action recognition models, namely NTU RGB+D 60 [14], NTU RGB+D 120 [15], and Northwestern-UCLA [16]. These datasets consist of diverse human actions captured from different viewpoints and in various environments, making them challenging benchmarks for evaluating the effectiveness of human action recognition models. We conduct exhaustive ablation studies on each improved section to verify the effectiveness and efficiency of our proposed improvements. This involves systematically evaluating the model’s performance by removing or modifying specific components and analyzing the impact on the overall performance. Through these experiments, we can determine which components are critical for achieving high accuracy and efficiency and make informed design decisions.

4.1. Performance Metrics

In this paper, we evaluate the performance of our model using several standard metrics commonly employed in action recognition tasks:

Top-1 Accuracy: This metric measures the proportion of correctly classified samples among all test samples. It is defined as:

Top-1 Accuracy = \frac{Number of Correct Predictions}{Total Number of Samples}

(10)

A higher Top-1 accuracy indicates better performance.

FLOPs (G): Floating Point Operations per Second (FLOPs) is used to quantify the computational complexity of the model. FLOPs are calculated based on the number of operations performed in the forward pass. The FLOPs for a model are often measured in gigaflops (G), i.e.,

10^{9}

operations. Lower FLOPs suggest better computational efficiency.

X-Sub and X-Setup: These refer to the two evaluation protocols used in the NTU datasets [14]. X-Sub (Cross-Subject) divides the dataset into training and test sets based on subjects. X-Setup (Cross-Setup) splits the dataset based on different camera setups, providing a more challenging evaluation of the model’s generalization ability across different perspectives.

F1-Score: The F1-score is the harmonic mean of precision and recall, defined as:

F 1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}

(11)

where precision is the ratio of true positive predictions to all positive predictions, and recall is the ratio of true positive predictions to all actual positive cases. The F1-score provides a balance between precision and recall, which is especially useful in imbalanced datasets.

Ratio: This metric compares the computational efficiency of different models by considering the relationship between performance (e.g., accuracy or F1-score) and computational cost (e.g., FLOPs). It can be expressed as:

Ratio = \frac{Performance Metric}{FLOPs (G)}

(12)

A higher ratio indicates that the model achieves better performance with fewer computational resources.

These metrics collectively provide a comprehensive evaluation of our model’s performance, covering aspects such as accuracy, computational efficiency, and robustness in different testing conditions.

4.2. Datasets and Settings

4.2.1. NTU60

NTU60 RGB+D Dataset [14] consists of 60 action categories, including daily activities (e.g., reading, writing), health-related actions (e.g., falling, sneezing), and interactive behaviors (e.g., shaking hands, hugging). These activities involve varying levels of motion complexity, making it difficult to distinguish between similar actions, particularly those involving subtle joint movements. The dataset includes 56,880 action clips performed by 40 actors and recorded from 17 different camera setups. It contains 25 joint points and follows two evaluation protocols: (1) Cross View, where the samples from cameras 2 and 3 are used for training, and the samples from camera 1 are used for testing, and (2) Cross-Subject, where half of the subjects in the dataset are used for training, and the remaining samples are used for testing.

4.2.2. NTU120

NTU120 RGB+D Dataset [15] extends the NTU60 dataset to 120 action categories, adding a wider range of activities such as sports movements and multi-person interactions. This results in more diverse and complex actions captured from more varied camera placements and performed by more actors, though the number of joint points remains unchanged at 25. The dataset presents two evaluation protocols: (1) Cross-Setup (C-Setup) and (2) Cross-Subject (C-Sub). These diverse activities further challenge models to accurately classify actions, especially those with similar motions.

4.2.3. Northwestern-UCLA

Northwestern-UCLA Dataset [16] contains 1494 video clips, categorized into 10 action types, such as walking, running, and jumping, performed by 10 actors. Despite having fewer categories compared to NTU datasets, the variation in camera placements introduces complexity in accurately capturing spatiotemporal features. The protocol used follows [14], where samples from the first two cameras are used for training, and the third camera’s samples are used for testing.

4.2.4. Experiment Settings

For NTU RGB+D and NTU-120 RGB+D, we use Adam with an initial learning rate of 0.001 to train the model for 150 epochs, batch size 64, the learning rate decays by a factor of 10 at the 70th epoch, the 90th epoch, and the 110th epoch, respectively. For Northwestern-UCLA, the batch size is 16. Our model is trained with SGD with a momentum of 0.9 and weight decay of 0.0005. The training epoch is set to 80. Moreover, the learning rate is set to 0.1, divided by 10 at epochs 40 and 60. All experiments are conducted on the Pytorch platform with one RTX3060 GPU card. The data processing is similar to Shift-GCN [14].

4.2.5. Network Settings

Figure 4 shows that we employ the same backbone (Shift-GCN) to construct our spatiotemporal model. The Shift-GCN backbone comprises one input block and nine residual blocks, each containing a shift spatial convolution and a shift temporal convolution.

5. Ablation Study

5.1. Topology Shared/Non-Shared Shift Graph Convolution

In this subsection, we present evidence that our spatial non-shared topology shift graph operation significantly enhances the performance of the local spatial shift graph operation. We use two spatial shift graph operations from Shift-GCN as a baseline: (1) A point-wise convolution with a local shift operation and (2) A point-wise convolution with a non-local shift operation. As shown in Table 1, our non-shared topology significantly outperforms the baseline models. Specifically, our non-shared shift operation improves local shift operations by 1.0% on the NTU RGB+D 120 X-sub task. Interestingly, introducing learnable masks does not significantly benefit the non-local shift operation. In the non-local shift operation, each slice is shifted into the information on a different joint, making the individual nodes of the same slice more specific. Moreover, our shift operations are computed against pre-defined indexes, ensuring improved performance without introducing additional computational burden in the model.

5.2. Attention-Guided/Attention-Non-Guided Shift Graph Convolution

In this subsection, we verify that the local shift graph operation with the attention module can outperform the non-local shift graph operation with comparable computational complexity. Our results, shown in Table 2, demonstrate that the attention block can effectively aggregate global information with a minimal computational burden for both regular and local shift graph convolution. Moreover, conventional convolution with attention performs better than non-local shift graph convolution. These results suggest that while non-local operations can extract global information to a certain extent, they can become a bottleneck for model fitting. On the other hand, the attention mechanism can efficiently capture the relevant global information and enhance model performance, making it a better choice for spatial modeling in action recognition tasks.

In this subsection, we introduce an attention module with a slightly increased computational burden of 0.2%. Specifically, we adopt a dot-product 0-vector-inspired attention mechanism to effectively improve performance without introducing a significant computational cost. As shown in Table 2, the attention module significantly enhances the model’s ability to aggregate global information, even for regular convolution and local shift graph convolution. The negligible computational cost of the attention module makes it a viable and effective option for improving model performance.

5.3. Temporal Graph Convolution Modeling

In this subsection, we conduct experiments to evaluate the effectiveness and efficiency of different temporal models using the fixed spatial model of regular spatial convolution from ST-GCN [7]. The results are summarized in Table 3. Traditional methods using a single convolution kernel are computationally burdensome and ineffective. Multi-scale convolution kernels, on the other hand, can combine multiple scales of receptive fields, and their computational effort is still limited by conventional convolution. Shift temporal convolution has a considerable computational advantage, but its performance is comparable to native convolution. Adaptive shift convolution outperforms regular convolution but falls short of the performance of multi-scale convolution. Our method has a comparable computational cost (0.6%) to the shift method, significantly outperforming previous approaches in accuracy. Overall, our experiments demonstrate that our proposed method effectively combines the strengths of shift temporal convolution and multi-scale convolution, providing both accuracy and computational efficiency.

In this subsection, we evaluate the effectiveness and efficiency of different temporal models while keeping the spatial model as the regular spatial convolution of ST-GCN [7]. As shown in Table 3, traditional methods, which use a single convolution kernel, are computationally burdensome and ineffective. On the other hand, multi-scale convolution kernels can combine multiple scales of receptive fields, and their computational effort is still limited compared to regular convolution. Shift temporal convolution offers a significant computational advantage, but native convolution performs comparably to conventional convolution. Adaptive shift convolution improves on regular convolution but falls short of the performance of multi-scale convolution. Regarding our proposed method, although we follow the example of regular multi-scale convolution [19] by replacing one of the branches with max-pooling, introducing attention inevitably increases computational overhead by 0.6%. Nevertheless, our method achieves significantly improved accuracy compared to the previous methods, with a computational magnitude comparable to the shift method’s.

5.4. Different Combinations of Adaptive Multi-Scale

In this subsection, we aim to evaluate the effectiveness of different sizes of native shift convolution kernels and investigate the role of the max-pooling layer. As shown in Table 4, we use a baseline of convolving the native shift to 7 while attempting to fuse different scales of natives. We observe that when the branch is equal to 2, the effect is comparable to that of a single branch, whereas when the branch is greater than or equal to 3, it slightly outperforms the baseline, indicating that the multi-scale gain plays a role in temporal modeling. Furthermore, the performance improvement is more significant when we replace the larger convolutional branches with a max-pooling layer, highlighting the importance of condensed time-frame information for the network, which has been overlooked in previous works. Finally, the attention module helps the model adaptively adjust the weights of the different branches, resulting in a 2.3% improvement in accuracy over the baseline, proving the effectiveness of the proposed approach.

In this subsection, we analyze the differences in mean attention weights associated with the three kernels and a max-pooling over all layers on NTU-120 RGB+D X-sub, as shown in Figure 10. The results show that different layers have varying perceptual wilderness needs. The top layer (output) tends to have larger temporal receptive fields than the bottom layer (input), similar to Shift-GCN’s adaptive shift operation. This implies that the top layers require a larger temporal context, while the bottom layers tend to learn spatial relations. The intermediate layer is biased towards learning temporal information with scale mixing. Moreover, max-pooling is crucial in purifying temporal information, especially in the top and middle layers. Our attention-guided adaptive multi-scale temporal convolution provides three key benefits: (1) It can exploit multi-scale information effectively. (2) It significantly reduces computational burden compared to conventional convolution, achieving substantial accuracy improvements. (3) It is accompanied by adaptive network training with good migration capability. This means that our method can adaptively adjust the weights of different branches to better fit different layers’ perceptual wilderness needs, therefore improving model performance. Overall, our proposed method is a computationally efficient and effective solution for multi-scale temporal modeling, providing improved accuracy with a reduced computational burden.

6. Comparison with the State-of-the-Art

6.1. Comparison with the State-of-the-Art on Three Datasets

In many current state-of-the-art methods, multi-stream fusion strategies are employed. To ensure a fair comparison, we also utilize a multi-stream fusion strategy in our experiments, following the approach used in [18], which employs four streams: joint, bone, joint motion, and bone motion. Our model has three settings: 1-stream, which only uses the joint stream; 2-stream, which uses both joint and bone streams; and 4-stream, which uses all four streams. We compare our model with state-of-the-art methods on three datasets: NTU RGB+D 120, NTU RGB+D, and NW-UCLA, as shown in Table 5, Table 6 and Table 7, respectively. Compared to Shift-GCN, our method only increases the computational overhead by 0.8% while significantly improving the accuracy on all three datasets. Our method achieves comparable accuracy to state-of-the-art methods at less than twice the computational cost. This demonstrates the effectiveness and efficiency of our approach in comparison to current state-of-the-art methods.

Our method demonstrates superior performance compared to other methods on both the NTU RGB+D 60 dataset and the Northwestern-UCLA dataset. Additionally, on the NTU RGB+D 120 dataset, our model achieves comparable results to the previous state-of-the-art method EfficientGCN-B4 [9] while requiring significantly less computation, with a reduction of 6.07×. Furthermore, our method outperforms Shift-GCN in each stream by a substantial margin. This highlights the effectiveness of our attention-guided adaptive multi-scale temporal convolution in improving the accuracy of action recognition while maintaining a reasonable computational cost.

6.2. Model Complexity and Efficiency Comparison

To further emphasize the correlation between model accuracy and efficiency, we adopt a more compelling approach to comparison. In Table 8, we can see that our model has a significant advantage in accuracy over previous models despite a large efficiency gap. Compared to Shift-GCN, our model has a similar computational burden but achieves significantly better accuracy. Moreover, our 4s-AT-Shift-GCN model with 4-stream fusion outperforms all previous methods in terms of accuracy while effectively balancing the computational burden. This demonstrates the efficiency and effectiveness of our proposed method in achieving state-of-the-art performance while maintaining reasonable computational costs.

In the top part of Table 8, it can be observed that our 1s AT-Shift-GCN is comparable to the 1s Shift-GCN in terms of computational scale while significantly outperforming the other state-of-the-art method in terms of accuracy. In the middle part, the 2s AT-Shift-GCN outperforms the 4s-Shift-GCN, indicating that our model has a significant advantage over the Shift-GCN and other existing works. In the last part, our model surpasses previous models to achieve an optimal level of accuracy, further demonstrating the validity and capability of the proposed approach. With this persuasive comparison, we can conclude that our model achieves a superior balance between efficiency and accuracy compared to existing state-of-the-art methods.

6.3. Discussion

In Table 9, we compare the performance of our model with Shift-GCN on some challenging action categories that are often confused, such as “reading a book” and “writing on a paper”. Our model often outperforms Shift-GCN, indicating its ability to effectively distinguish between similar actions. This further demonstrates the validity and effectiveness of our approach.

The comparison of our model with Shift-GCN on confusing action categories in Table 9 highlights the effectiveness of our approach. We have selected four sets of similar movements that are challenging to distinguish, such as writing and reading, which invoke similar joints and magnitudes in time. By learning non-shared topological information in the spatial dimension using boosting graph convolution and integrating information at multiple scales in the temporal dimension, our model can better distinguish finer temporal differences. Additionally, the attention mechanism facilitates the interaction of spatial and temporal information, leading to improved recognition of similar actions. Our model shows a significant improvement in accuracy compared to Shift-GCN at the same computational scale, demonstrating the validity and capability of our approach.

This performance advantage can be further understood by analyzing key architectural choices within our model. The choice between single-head and multi-head attention impacts performance and computational efficiency. While multi-head attention offers richer feature representations, it increases complexity. In contrast, single-head attention is more computationally efficient, which aligns with our model’s goal of balancing performance and efficiency. Additionally, the adaptive multi-scale temporal convolution is essential for capturing temporal dependencies across different scales. Removing or altering it would reduce the model’s flexibility in handling temporal variations, likely degrading performance. Finally, the ultralight spatiotemporal attention module plays a critical role in efficiently fusing spatial and temporal information. Without it, the model would lose its ability to focus on key interactions, resulting in lower accuracy, though with reduced computational cost. Thus, all these components contribute to the model’s superior performance with minimal computational overhead.

7. Conclusions

This work presents an attention-guided and topology-enhanced shift graph convolutional network for skeleton-based action recognition. Our model can capture spatiotemporal relationships while learning channel-wise topological features effectively. Compared to Shift-GCN, our model offers significant improvements and enhancements to the former with comparable computational burdens. Compared to the previous work, our model replies with a multiplicatively smaller scale than its, improving the accuracy and validating the efficiency of our model. On three datasets, the proposed AT-Shift-GCN outperforms state-of-the-art methods.

Despite its success, AT-Shift-GCN has some limitations. Its performance may decline in environments with noisy or incomplete skeleton data, reducing its robustness in real-world applications. Additionally, while computational efficiency is improved, scaling to larger datasets or real-time processing with high-frequency input remains challenging. Future work could enhance robustness by integrating hybrid approaches combining skeleton data with modalities like RGB or depth. Further optimizations for real-time processing and scalability, such as more efficient attention mechanisms, are needed. Expanding AT-Shift-GCN to handle multi-modal data would also increase its versatility for broader action recognition tasks.

Author Contributions

Conceptualization, H.C.; methodology, H.C. and C.L.; software, H.C.; validation, H.C. and M.L.; formal analysis, H.C.; investigation, H.C. and C.L.; resources, L.J.; data curation, H.C.; writing—original draft preparation, H.C.; writing—review and editing, H.C. and C.L.; visualization, M.L.; supervision, L.J.; project administration, L.J.; funding acquisition, L.J., C.L. and H.C. contributed equally to this paper. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by JSPS KAKENHI Grant Number 22K12114 and NEDO Intensive Support for Young Promising Researchers Number 21502121-0 and JKA and its promotion funds from KEIRIN RACE.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. NTU RGB+D 60 and NTU RGB+D 120 dataset: https://rose1.ntu.edu.sg/dataset/actionRecognition/, Northwestern-UCLA dataset: https://wangjiangb.github.io/my_data.html, all accessed on 7 July 2024.

Conflicts of Interest

The authors declare no conflict of interest.

References

Wang, J.; Liu, Z.; Wu, Y.; Yuan, J. Mining actionlet ensemble for action recognition with depth cameras. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 1290–1297. [Google Scholar]
Zhang, Y.; Cao, C.; Cheng, J.; Lu, H. EgoGesture: A new dataset and benchmark for egocentric hand gesture recognition. IEEE Trans. Multimed. 2018, 20, 1038–1050. [Google Scholar] [CrossRef]
Gui, L.Y.; Zhang, K.; Wang, Y.X.; Liang, X.; Moura, J.M.; Veloso, M. Teaching robots to predict human motion. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 562–567. [Google Scholar]
Du, Y.; Fu, Y.; Wang, L. Skeleton based action recognition with convolutional neural network. In Proceedings of the 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), Kuala Lumpur, Malaysia, 3–6 November 2015; pp. 579–583. [Google Scholar]
Li, C.; Zhong, Q.; Xie, D.; Pu, S. Skeleton-based action recognition with convolutional neural networks. In Proceedings of the 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Hong Kong, China, 10–14 July 2017; pp. 597–600. [Google Scholar]
Lee, I.; Kim, D.; Kang, S.; Lee, S. Ensemble deep learning for skeleton-based action recognition using temporal sliding lstm networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1012–1020. [Google Scholar]
Yan, S.; Xiong, Y.; Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
Song, Y.F.; Zhang, Z.; Shan, C.; Wang, L. Stronger, faster and more explainable: A graph convolutional baseline for skeleton-based action recognition. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1625–1633. [Google Scholar]
Song, Y.; Zhang, Z.; Shan, C.; Wang, L. Constructing stronger and faster baselines for skeleton-based action recognition. arXiv 2021, arXiv:2106.15125. [Google Scholar] [CrossRef] [PubMed]
Chen, Z.; Li, S.; Yang, B.; Li, Q.; Liu, H. Multi-Scale Spatial Temporal Graph Convolutional Network for Skeleton-Based Action Recognition. AAAI Conf. Artif. Intell. 2021, 35, 1113–1122. [Google Scholar] [CrossRef]
Rahevar, M.L.; Ganatra, A. Spatial–Temporal gated graph attention network for skeleton-based action recognition. Pattern Anal. Appl. 2023, 26, 929–939. [Google Scholar] [CrossRef]
Rahevar, M.; Ganatra, A.; Saba, T.; Rehman, A.; Bahaj, S.A. Spatial–Temporal Dynamic Graph Attention Network for Skeleton-Based Action Recognition. IEEE Access 2023, 11, 21546–21553. [Google Scholar] [CrossRef]
Shiraki, K.; Hirakawa, T.; Yamashita, T.; Fujiyoshi, H. Spatial Temporal Attention Graph Convolutional Networks with Mechanics-Stream for Skeleton-Based Action Recognition. In Proceedings of the Computer Vision—ACCV 2020, Kyoto, Japan, 30 November–4 December 2020; Ishikawa, H., Liu, C.L., Pajdla, T., Shi, J., Eds.; Springer: Cham, Switzerland, 2021; pp. 341–357. [Google Scholar]
Cheng, K.; Zhang, Y.; He, X.; Chen, W.; Cheng, J.; Lu, H. Skeleton-based action recognition with shift graph convolutional network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 183–192. [Google Scholar]
Shahroudy, A.; Liu, J.; Ng, T.T.; Wang, G. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1010–1019. [Google Scholar]
Liu, J.; Shahroudy, A.; Perez, M.; Wang, G.; Duan, L.Y.; Kot, A.C. Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2684–2701. [Google Scholar] [CrossRef] [PubMed]
Wang, J.; Nie, X.; Xia, Y.; Wu, Y.; Zhu, S.C. Cross-view action modeling, learning and recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 2649–2656. [Google Scholar]
Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 12026–12035. [Google Scholar]
Liu, Z.; Zhang, H.; Chen, Z.; Wang, Z.; Ouyang, W. Disentangling and unifying graph convolutions for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 143–152. [Google Scholar]
Chen, H.; Li, M.; Jing, L.; Cheng, Z. Lightweight Long and Short-Range Spatial-Temporal Graph Convolutional Network for Skeleton-Based Action Recognition. IEEE Access 2021, 9, 161374–161382. [Google Scholar] [CrossRef]
Bruna, J.; Zaremba, W.; Szlam, A.; LeCun, Y. Spectral networks and locally connected networks on graphs. arXiv 2013, arXiv:1312.6203. [Google Scholar]
Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
Bavil, A.F.; Damirchi, H.; Taghirad, H.D. Action Capsules: Human Skeleton Action Recognition. Comput. Vis. Image Underst. 2023, 233, 103722. [Google Scholar] [CrossRef]
Cheng, K.; Zhang, Y.; Cao, C.; Shi, L.; Cheng, J.; Lu, H. Decoupling gcn with dropgraph module for skeleton-based action recognition. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXIV 16. Springer: Cham, Switzerland, 2020; pp. 536–553. [Google Scholar]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 510–519. [Google Scholar]
Si, C.; Jing, Y.; Wang, W.; Wang, L.; Tan, T. Skeleton-based action recognition with spatial reasoning and temporal stack learning. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 103–118. [Google Scholar]
Thakkar, K.; Narayanan, P. Part-based graph convolutional network for action recognition. arXiv 2018, arXiv:1809.04983. [Google Scholar]
Gao, X.; Hu, W.; Tang, J.; Liu, J.; Guo, Z. Optimized skeleton-based action recognition via sparsified graph regression. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 601–610. [Google Scholar]
Li, M.; Chen, S.; Chen, X.; Zhang, Y.; Wang, Y.; Tian, Q. Actional-structural graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 3595–3603. [Google Scholar]
Song, Y.F.; Zhang, Z.; Wang, L. Richly activated graph convolutional network for action recognition with incomplete skeletons. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 1–5. [Google Scholar]
Si, C.; Chen, W.; Wang, W.; Wang, L.; Tan, T. An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 1227–1236. [Google Scholar]
Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Skeleton-based action recognition with directed graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 7912–7921. [Google Scholar]
Zhang, P.; Lan, C.; Zeng, W.; Xing, J.; Xue, J.; Zheng, N. Semantics-guided neural networks for efficient skeleton-based human action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 1112–1121. [Google Scholar]
Peng, W.; Hong, X.; Chen, H.; Zhao, G. Learning graph convolutional network for skeleton-based human action recognition by neural searching. AAAI Conf. Artif. Intell. 2020, 34, 2669–2676. [Google Scholar] [CrossRef]
Huang, L.; Huang, Y.; Ouyang, W.; Wang, L. Part-level graph convolutional network for skeleton-based action recognition. AAAI Conf. Artif. Intell. 2020, 34, 11045–11052. [Google Scholar] [CrossRef]
Ye, F.; Pu, S.; Zhong, Q.; Li, C.; Xie, D.; Tang, H. Dynamic GCN: Context-enriched topology learning for skeleton-based action recognition. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 55–63. [Google Scholar]
Veeriah, V.; Zhuang, N.; Qi, G.J. Differential recurrent neural networks for action recognition. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4041–4049. [Google Scholar]
Wang, J.; Liu, Z.; Wu, Y.; Yuan, J. Learning actionlet ensemble for 3D human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 36, 914–927. [Google Scholar] [CrossRef] [PubMed]
Du, Y.; Wang, W.; Wang, L. Hierarchical recurrent neural network for skeleton based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1110–1118. [Google Scholar]
Zhang, Q.L.; Yang, Y.B. Sa-net: Shuffle attention for deep convolutional neural networks. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 2235–2239. [Google Scholar]

Figure 1. The Flops vs. Accuracy on NTU120 X-sub setting. (2S-AGCN [18], 4s-Shift-GCN [14], Efficient-Net [9], MS-G3D [19], NLB-ACSE [20], PA-ResGCN-B19 [8]).

Figure 2. Local shift operation dynamically adjusts how a node aggregates information from its immediate neighbors, staying within its local neighborhood for flexible information gathering.

Figure 3. Non-local shift operation allows a node to aggregate information from distant, non-adjacent nodes in the graph, extending beyond its immediate neighborhood for a more global perspective.

Figure 4. The structure of Shift-GCN.

Figure 5. The structure of AT-Shift-GCN.

Figure 6. Non-shared topology in local shift operation: Each channel has its own unique topology for aggregating information from its immediate neighboring nodes.

Figure 7. Non-shared topology in non-local shift operation: Each channel has its own distinct topology for aggregating information from distant, non-adjacent nodes.

Figure 8. Attention-guided adaptive multi-scale temporal shift graph convolution.

Figure 9. Lightweight self-attention module.

Figure 10. The mean attention weights and activation patterns across layers for max-pooling and native kernels (3, 5, 7).

Table 1. Comparisons between the baseline and our non-shared topology spatial shift graph convolution on NTU-120 RGB+D X-sub task.

Model	Top-1
Local shift [2]	79.8
Non-local shift [2]	80.3
None-shared local shift	80.8
None-shared local shift with mask	81.0
Non-shared non-local shift	80.3
Non-shared non-local shift with mask	80.4

Table 2. Comparisons between the non-local shift operation graph convolution and our attention-guided local shift graph convolution on NTU-120 RGB+D X-sub task.

Model	Top-1	FLOPs (G)
Regular conv [7]	73.2	4.0
Local shift graph conv [14]	79.8	1.1
Non-local shift graph conv [14]	80.3	1.1
Regular conv with att	80.5	4.0+
Local shift graph conv with att	81.4	1.1+
Non-local shift graph conv with att	79.9	1.1+

Table 3. Comparisons between other temporal convolution and our attention-guided adaptive multi-scale temporal convolution on NTU-120 RGB+D X-sub task.

Model	Top-1	FLOPs (G)
Regular conv [7]	73.2	12.17
Regular multi-scale conv [19]	75.1	4.06−
Native shift conv [14]	73.4	1.1
Adaptive shift conv [14]	74.4	1.1
Adaptive multi-scale conv	75.7	1.1+

Table 4. Comparisons between different scales and the role of max-pooling on NTU-120 RGB+D X-sub task.

Native 3	Native 5	Native 7	Native 9	Pooling	Att	Top-1
×	×	✓	×	-	-	79.8
✓	✓	×	×	-	-	79.6
✓	×	✓	×	-	-	79.9
✓	×	×	✓	-	-	79.4
×	✓	✓	×	-	-	80.0
×	✓	×	✓	-	-	79.7
✓	✓	✓	×	-	-	80.4
✓	✓	×	✓	-	-	80.1
✓	×	✓	✓	-	-	80.2
×	✓	✓	✓	-	-	80.3
✓	✓	✓	✓	-	-	80.4
✓	✓	✓	×	✓	-	80.9
✓	✓	×	✓	✓	-	80.6
✓	×	✓	✓	✓	-	80.4
×	✓	✓	✓	✓	-	80.2
✓	✓	✓	×	✓	✓	82.1
✓	✓	×	✓	✓	✓	81.7
✓	×	✓	✓	✓	✓	81.2
×	✓	✓	✓	✓	✓	81.0

Table 5. Accuracy result with X-Sub and X-View on NTU60.

Method	X-Sub	X-View
ST-GCN [7]	81.5	88.3
SR-TSL [26]	84.8	92.4
PB-GCN [27]	87.5	93.2
GR-GCN [28]	87.5	94.3
AS-GCN [29]	85.9	93.5
2S-AGCN [18]	88.5	95.1
RA-GCN [30]	88.7	94.3
AGC-LSTM [31]	89.2	95.0
DGNN [32]	89.9	96.1
SGN [33]	89.0	94.5
NAS-GCN [34]	89.4	95.7
4s-Shift-GCN [14]	90.7	96.5
PL-GCN [35]	89.2	95.0
Dynamic-GCN [36]	91.5	96.0
PA-ResGCN-B19 [8]	90.9	96.0
MS-G3D [19]	91.5	96.2
NLB-ACSE [20]	91.0	96.1
MST-GCN [10]	91.5	96.6
EfficientGCN-B4 [9]	91.7	95.7
Shift-GCN (1-stream)	87.8	95.1
Shift-GCN (2-stream)	89.7	96.0
Shift-GCN (4-stream)	90.7	96.5
AT-Shift-GCN (1-stream)	89.1	95.7
AT-Shift-GCN (2-stream)	91.2	96.8
AT-Shift-GCN (4-stream)	91.7	97.1

The bold in the table represent the best results in the comparison.

Table 6. Accuracy result with X-Sub and X-Setup on on NTU120.

Method	X-Sub	X-Setup
ST-GCN [7]	70.7	73.2
AS-GCN [29]	77.9	78.5
2S-AGCN [18]	82.5	84.2
SGN [33]	79.2	81.5
4s-Shift-GCN [14]	85.9	87.6
PA-ResGCN-B19 [8]	87.3	88.3
MS-G3D [19]	86.9	88.4
NLB-ACSE [20]	86.2	88.1
MST-GCN [10]	87.5	88.8
EfficientGCN-B4 [9]	88.3	89.1
Shift-GCN (1-stream)	80.9	83.2
Shift-GCN (2-stream)	85.3	86.6
Shift-GCN (4-stream)	85.9	87.6
AT-Shift-GCN (1-stream)	82.8	85.2
AT-Shift-GCN (2-stream)	86.8	88.2
AT-Shift-GCN (4-stream)	88.5	89.0

The bold in the table represent the best results in the comparison.

Table 7. Accuracy result on the Northwestern–UCLA dataset.

Method	Top-1
Lie Group [37]	74.2
Actionlet ensemble [38]	76.0
HBRNN-L [39]	78.5
AGC-LSTM [31]	93.3
4s-Shift-GCN [14]	94.6
Shift-GCN (1-stream)	92.5
Shift-GCN (2-stream)	94.2
Shift-GCN (4-stream)	94.6
AT-Shift-GCN (1-stream)	93.4.
AT-Shift-GCN (2-stream)	94.8
AT-Shift-GCN (4-stream)	95.4

The bold in the table represent the best results in the comparison.

Table 8. Accuracy result with Top-1, Flops and Ratio on NTU60.

Method	Top-1	Flops	Ratio
1s-AT-Shift-GCN	95.7	2.51	$1 \times$
SR-TSL [40]	92.4	4.20	$1.67 \times$
RA-GCN [27]	93.5	32.80	$13.06 \times$
AS-GCN [30]	93.5	26.76	$10.66 \times$
2S-AGCN [18]	95.1	37.32	$14.87 \times$
EfficientGCN-B0 [9]	94.7	3.08	$1.23 \times$
EfficientGCN-B2 [9]	95.5	6.12	$2.44 \times$
EfficientGCN-B4 [9]	95.7	15.24	$6.07 \times$
1s-Shift-GCN [14]	95.1	2.49	$0.99 \times$
2s-AT-Shift-GCN	96.8	5.02	$1 \times$
PA-ResGCN-B19 [8]	96.0	18.52	$3.69 \times$
NLB-ACSE [20]	96.1	14.83	$2.95 \times$
2s-Shift-GCN [14]	96.0	4.99	$0.99 \times$
4s-Shift-GCN [33]	96.5	9.98	$1.99 \times$
MS-G3D [19]	96.2	48.88	$9.73 \times$
4s-AT-Shift-GCN	97.1	10.05	$1 \times$

The bold in the table represent the best results in the comparison.

Table 9. Comparison of accuracy and F1-score by 1-stream model on NTU60 (X-sub setting).

Action	Shift-GCN	F1-Score	AT-Shift-GCN	F1-Score
Writing	58.46	0.54	64.10	0.61
Reading	59.34	0.53	68.13	0.59
Put on a shoe	75.09	0.67	80.95	0.71
Take off a shoe	72.16	0.71	72.99	0.74
Play with phone/tablet	58.18	0.58	67.27	0.68
Type on a keyboard	58.18	0.62	69.09	0.66
Chest pain	74.82	0.73	81.02	0.83
Back pain	78.10	0.83	90.14	0.87

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lu, C.; Chen, H.; Li, M.; Jing, L. Attention-Guided and Topology-Enhanced Shift Graph Convolutional Network for Skeleton-Based Action Recognition. Electronics 2024, 13, 3737. https://doi.org/10.3390/electronics13183737

AMA Style

Lu C, Chen H, Li M, Jing L. Attention-Guided and Topology-Enhanced Shift Graph Convolutional Network for Skeleton-Based Action Recognition. Electronics. 2024; 13(18):3737. https://doi.org/10.3390/electronics13183737

Chicago/Turabian Style

Lu, Chenghong, Hongbo Chen, Menglei Li, and Lei Jing. 2024. "Attention-Guided and Topology-Enhanced Shift Graph Convolutional Network for Skeleton-Based Action Recognition" Electronics 13, no. 18: 3737. https://doi.org/10.3390/electronics13183737

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Attention-Guided and Topology-Enhanced Shift Graph Convolutional Network for Skeleton-Based Action Recognition

Abstract

1. Introduction

2. Related Work

2.1. Graph Convolutional Networks

2.2. GCN-Based Skeleton Action Recognition

2.2.1. Spatial Graph Convolution

2.2.2. Temporal Graph Convolution

2.3. Topology-Shared/Topology-Non-Shared Methods

3. Model

3.1. Network Architecture

3.2. Non-Shared Topology Shift Graph Convolution

3.3. Adaptive Multi-Scale Temporal Shift Graph Convolution

3.4. Attention-Guided Local Shift Graph Convolution

4. Experiment

4.1. Performance Metrics

4.2. Datasets and Settings

4.2.1. NTU60

4.2.2. NTU120

4.2.3. Northwestern-UCLA

4.2.4. Experiment Settings

4.2.5. Network Settings

5. Ablation Study

5.1. Topology Shared/Non-Shared Shift Graph Convolution

5.2. Attention-Guided/Attention-Non-Guided Shift Graph Convolution

5.3. Temporal Graph Convolution Modeling

5.4. Different Combinations of Adaptive Multi-Scale

6. Comparison with the State-of-the-Art

6.1. Comparison with the State-of-the-Art on Three Datasets

6.2. Model Complexity and Efficiency Comparison

6.3. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI