1. Introduction
Human action recognition has been extensively applied in various fields such as video understanding [
1], human–computer interaction [
2], and virtual reality [
3]. Compared to the original RGB video action recognition methods [
4], skeleton action recognition approaches, given their explicit inclusion of human joint coordinates, are less affected by lighting or moving backgrounds. Additionally, they exhibit stronger robustness in representing action variations with fewer data. Consequently, the increasing interest in the domain is towards skeleton-based action recognition methods [
5,
6,
7].
Initial skeleton-based action recognition algorithms typically used manual feature extraction methods. They capitalized on geometric transformations to depict the spatial relationships among joints, such as relative positions of the joints [
8] and movements of different body parts [
9]. However, these techniques often exhibit inadequate generalizability, struggling to capture spatiotemporal features concurrently. In recent years, with the rapid development of deep learning computation [
10,
11], data-driven approaches have garnered increasing attention, leading to the emergence of Recurrent Neural Networks (RNNs) [
12] and Convolutional Neural Networks (CNNs) [
13]. RNNs inherently excel at modeling sequential data, making them readily applicable to skeleton-based action recognition. Shahroudy et al. [
14] transformed the 3D coordinates of human joints into a time series, with an RNN leveraged for feature extraction. Echoing the approach of [
14], a multitude of contemporary methods have adopted RNNs and reported promising outcomes [
15,
16,
17]. Conversely, a CNN can transform skeleton data into pseudo-images to simulate spatiotemporal dynamics.The dual-stream CNN methodology [
18] introduces a skeleton transformer module for learning joint representation. However, human skeletal structures cannot be applied directly to methods utilizing RNNs or CNNs as natural graph structures.
Given that skeletal data comprise non-Euclidean structures, the modeling abilities of RNNs [
12] and CNNs [
13] fall short in capturing inter-joint information. To tackle these issues, Graph Convolutional Networks (GCNs) were introduced to skeleton-based action recognition, yielding excellent results. Yan et al. [
5] pioneered the use of GCNs for skeleton data modeling, thereby proposing the Spatio-Temporal Graph Convolutional Network (ST-GCN), and constructed a predefined graph subject to topological constraints. However, the ST-GCN struggles to learn the relationships between skeletal nodes that lack physical connections and do not depend on the data. Hence, 2s-AGCN [
19] was proposed, with an adaptive dual-stream graph convolution, allowing new connections beyond natural ones for dynamic graph structure adjustment. The model’s proposed graph topology can be trained end-to-end or independently. Liu et al. [
20] proposed a 3D graph convolution, unifying the feature extraction methods of spatiotemporal dimensions for the first time. Zhang et al. [
21] enriched node information by introducing joint semantics as an additional feature dimension. Chen et al. [
22] proposed a Channel Topology Refinement Graph Convolution Network (CTR-GCN), which captures spatial dependencies between nodes within a channel.
While skeleton action recognition based on GCNs has made some progress in increasing recognition accuracy, the approach still has several drawbacks:
(1). The ST-GCN [
5] addressed the challenge of manual graph topology setting by employing a learning-based adjacency matrix method, deploying edge weight multiplication to construct the graph structure. However, the ST-GCN merely forms a graph reflecting natural human connections. It overlooks the links between joints devoid of physical connections. This prevents the addition of new connections to the graph. As its structure is fixed, this might lead to less-than-optimal predictions for samples across diverse action categories. Existing models [
19,
20,
22] fail to make full use of this prior knowledge—the specific movement patterns that the human body follows in daily activities.
(2). As per the human motion coordination theory, encoding processes can track relative movements among body parts while preserving invariance across varying body sizes. Ref. [
23] notes that high-order encoding features can be easily incorporated into the existing action recognition framework, complementing joint and skeletal features. However, coordination could be designed to include these higher-order features for understanding motion characteristics, something that is not considered in existing models.
(3). The relationship between the position coordinates of the skeletal nodes is often overlooked, without adequately considering the difference in importance of each skeletal node’s position under different actions. Moreover, it is more appropriate to focus on those frames characterizing representative action features when dealing with a sequence of skeletons.
This work tackles the aforementioned issues from two angles. Firstly, the spatiotemporal representation learning is bifurcated into spatial and temporal modeling. For spatial modeling, knowledge from daily human activities is used to preprocess and comprehend skeletal data. Studying daily human movements can unveil underlying patterns and laws between diverse behaviors. To achieve this, we present a Multi-level Topological Channel Attention Module (MTC), in combination with a Human Movement Coordination Module (CM). Regarding temporal modeling, we devise a Multi-scale Global Spatiotemporal Attention Module (MGS), leveraging multi-scale temporal convolution. Secondly, robust application of attention mechanisms in spatiotemporal modeling accommodates variations in the significance of spatiotemporal data. The key aspects of these modules can be summarized as follows:
(1). The Multi-level Topological Channel Attention Module (MTC) and the human motion Coordination Module (CM) extract the prior knowledge and coordination features of the human body. These extracted features effectively enhance the base model’s precision across both coarse-grained and fine-grained dimensions.
(2). The Multi-scale Global Spatiotemporal Attention Module (MGS) unifies the causal convolution module and the time attention module with a masking approach, targeting two critical goals. Firstly, this design effectively prevents future information leakage, ensuring that the model can only predict and compute attention using past and present information. Secondly, by introducing the time attention module with a mask, the model can adaptively focus on key feature areas at different time locations, thereby better capturing significant information in the time-series data. This comprehensive attention mechanism endows the model with a stronger expressive ability and better context understanding when processing time-series data. The framework diagram of the proposed Multi-level Topological Channel Attention Network can be viewed in
Figure 1.
The structure of this paper is as follows: In the
Section 1 we provide a succinct overview of the history of action recognition and the methods previously employed. The
Section 2 offers a concise introduction to the concept of skeleton action recognition, as well as knowledge pertinent to spatiotemporal representation learning and attention mechanisms. In the
Section 3, detailed descriptions of the three primary modules of the proposed graph convolution skeleton action recognition model are given. The
Section 4 presents the effectiveness of the proposed model, validated on three large public datasets, accompanied by ablation and comparative experiments. The
Section 5 discusses the proposed model in conjunction with experimental results. The
Section 6 summarizes the whole paper and forecasts future research directions.
3. Methodology
In this section, we first review the sequential representation method of skeleton action recognition and spatiotemporal graph convolution operators. Following this, we provide a detailed description of the multi-level topological channel module based on attention and the multi-scale global spatiotemporal module.
3.1. Preliminaries
3.1.1. Skeleton Sequence Representation
The original skeleton sequence consists of a series of coordinate data, which can be represented by the 3D joints of the human body in each video frame. Since the topological structure of the human skeleton is a natural graph, it allows skeleton-based human actions to be represented as spatiotemporal graphs. The ST-GCN [
5] is the earliest graph neural network that utilized spatiotemporal graphs for modeling skeletal points in time and space dimensions. Specifically, an undirected graph
is constructed on the skeleton sequence
X, which comprises
N skeletal nodes and a time length of
T. The set of nodes can be expressed as:
Here, V denotes the set of nodes, represents the ith skeleton point in the tth frame, and V includes all nodes in the skeleton sequence X. The skeleton edge set E consists of skeleton edge set , which connects various skeletal points within the same frame and skeleton edge set that links the same skeletal points between successive frames, where H represents the naturally connected human skeletal joints.
According to the defined graph G, the spatial graph convolution operator, in terms of spatial dimensions, is represented as:
Here,
denotes the input skeletal sequence of dimensions
;
signifies the output skeletal sequence of dimensions
; ⊗ is used to represent the convolution operation, while
signifies the spatial convolution kernel with dimensions
;
A also refers to the adjacency matrix with dimensions of
N ×
N.
The time graph convolution operator within the temporal dimension shares similarities with the classic 2D convolution operation. This is due to each vertex
, whose corresponding joint vertices on two adjacent continuous frames remain consistent, meaning it possesses two neighboring nodes on the timeline. The time graph convolution operator is represented as:
Here, denotes the time graph convolution kernel with dimensions , which represents the trainable parameters of the time graph convolution kernel.
3.1.2. Dataset
NTU-RGB+D 60 Dateset [
14]: The NTU-RGB+D 60 Dataset is a publicly available large dataset tailored for action recognition based on 3D skeletons. It comprises 56,578 action sequences, spanning across 60 categories of everyday interactions, which include individual actions, interactions with objects, and between people. The model is evaluated using two benchmarks: cross-subject (xsub) and cross-view (xview). For the cross-subject, 3D skeleton sequences from 20 specific actors’ IDs are used for training, with the remaining samples used for testing. For cross-view, it utilizes the skeleton data from three cameras, where cameras 2 and 3 are used for training, and camera 1 for testing.
NTU-RGB+D 120 Dateset [
47]: The NTU-RGB+D 120 Dataset is an extension of the NTU-RGB+D 60 Dataset, encompassing 113,945 skeletal sequences, which cover a more diverse range of everyday activities, totalling up to 120 categories. Specifically, this dataset encompasses skeletal sequences from 106 performers of varying ages, is set in 32 different scenes, and involves 155 camera views. The dataset has two conventional evaluation criteria: cross-subject (xsub), whereby the skeletal data from 53 specific performer IDs are used for training, with the remaining samples for testing, and cross-setup (xset), where even IDs are designated for training and odd IDs for testing.
NW-UCLA [
48]: The NW-UCLA dataset includes 1497 videos of 10 different types of actions, captured simultaneously from three cameras. In this paper, the data from the first two cameras are used for training, while the remaining data are employed for testing, following the methodology outlined in [
48].
3.1.3. Experimental Settings
All experiments in this paper were conducted using the Pytorch deep learning framework on an RTX3080 12 g graphics card, with Python version 3.9 and Pytorch version 9.1. All models utilized Stochastic Gradient Descent (SGD), with a momentum of 0.9, weight decay of 0.0004, batch size of 64, and an initial learning rate of 0.1. The Cross Entropy loss function was employed for a total of 65 epochs. The learning rate was divided by 10 at the 35th and 55th epochs. A warm-up strategy was employed during the first 5 epochs to stabilize the training process. For NTU-RGB+D 60 and NTU-RGB+D 120, the preprocessing method from [
22] was applied to adjust each skeleton sequence to 64 frames. For the NW-UCLA dataset, the batch size was set to 16 and the preprocessing method from [
28] was utilized. Additionally, four data modalities were set up for training: joint, bone, joint-motion, and bone-motion. The performance of the four modalities was then integrated to obtain the final accuracy. To enhance the reliability of the experimental results, the ablation and comparison experiments described in
Section 4.2 and
Section 4.4 of this paper were repeated 10 times each during the training process. The final result for all experiments in this paper was computed as the average of the outcomes from the 10 repetitions.
The experimental evaluation metric is defined as the probability of correctly identifying all actions, that is, accuracy. Since all the classes are equally essential, it is widely employed. It can be defined by the following formula:
(True Positive) is the number of true positives, that is, the number of samples correctly identified as positive. (True Negative) is the number of true negatives, that is, the number of samples correctly identified as negative. (False Positive) is the number of false positives, that is, the number of samples incorrectly identified as positive. (False Negative) is the number of false negatives, that is, the number of samples incorrectly identified as negative. This formula takes into account all possible classification results and calculates the accuracy across all test samples.
3.2. Multi-Level Topological Channel Attention Network (MTC)
This module models the channel relationships of the input skeletal feature X and the coordination of human limbs in kinematics. This paper divides the prior knowledge of human motion into two categories:
(1). From a detailed perspective, human motion is carried out on a limb-by-limb basis. This section depicts the relationship between individual limb movements and overall body motion.
(2). From a coordination standpoint, human motion involves inter-limb movements. This section articulates the relationship between the movements of different limbs.
3.2.1. The Multi-Level Topological Channel Attention Module (MTC)
Following the laws of human motion, this paper categorizes the human skeletal structure into two hierarchical levels. The first level divides the body into two segments: the upper body, consisting of everything above the last lumbar vertebra, and the lower body, consisting of everything below it. The second level consists of four parts: the left arm, right arm, left leg, and right leg. Channel attention initially calculates the attention on the dimension of the first-level topological structure, rendering a coarse-grained representation in the feature map. Subsequently, based on the coarse-grained information, it computes the attention on the second-level topological structure’s channel dimension, producing a finer-grained representation in the feature map.
Initially, as shown in
Figure 2, we used an action recognition dataset based on human bone structure as input. This dataset is carefully processed and consists of human skeleton data. The dimensions of the input original data are
N ×
C ×
T ×
V, where
N represents the batch size,
C represents the number of channels,
T represents the timing length, and
V represents the number of bone points. The figure shows the channel attention module “att” used in Multi-level Topological Channel Attention, forming a multi-level topology. Additionally, the lower part of the figure shows the Coordination Module. The output of the model is jointly weighted by the Multi-level Topological Channel Attention Module and the Coordination Module.
As depicted in
Figure 3, both global average pooling and global max pooling layers are employed for extracting advanced topological features. Various types of global pooling layers can extract a wealth of topological features. The skeletal data
, input in the shape of
, where
C denotes the number of channels,
T the sequence length of the skeleton, and
V the number of skeleton joints, will yield channel features of shape
after processing through two global pooling layers. The channel features are then dimensionality-reduced by a convolutional layer with a kernel of
. The process can be represented as:
where
;
;
.
Here,
and
represent average pooling and global max pooling, respectively, while
i,
j, and
k denote positions in the
N,
T, and
V dimensions. Conv stands for a convolution layer with a kernel size of 1.
and
represent the extracted global average pooling and global max pooling features, respectively. The outputs from these two types of global pooling are concatenated and fed into a convolution layer with a kernel size of 1. This convolution layer serves as a selector, adept at adaptively focusing on the features represented by the two types of global pooling. Finally, the features are reweighted using a sigmoid activation function. This process can be illustrated as follows:
In this context, Cat denotes the concatenation operation, Sigmoid refers to the activation function, and Conv stands for a convolution layer with a kernel size of 1. represents the input, and the final outcome, , refers to the features weighted by channel attention
The Multi-level Topological Channel Attention Module first uses a feature linear transformation layer to convert the input features
into
, thereby extracting high-level representations:
Afterwards, using a predefined first-level topology structure,
is transformed into
and
. Subsequently,
U1 and
U2 are separately fed into the channel attention modules, yielding two channel feature descriptors,
and
. Finally, the Cat operation is employed to concatenate these two descriptors, forming the channel feature descriptor
with a first-level topology structure:
Following the aforementioned process, we obtain a feature map of coarse granularity. Subsequently, treating
as the input, we divide it into four parts according to a predefined partition,
,
,
, and
, corresponding respectively to left hand, right hand, left leg, and right leg. We then repeat the mentioned formula, to calculate the fine-grained attention for each part within the channels. This results in obtaining the mixed channel feature
. This defines a secondary topology as follows:
3.2.2. Coordination Module (CM)
Even though this paper achieved a high accuracy in ablation experiments using first-order and second-order features of skeletons (namely,
and
, derived from Formulas (11) and (12) in
Section 3.2.1, correspond to the first-order and second-order information, respectively), the similar motion trajectories of type actions still lead to misjudgments. Hence, it becomes necessary to obtain higher-order features to support the lower ones. A person always maintains balance during motion, which requires dynamic coordination between the limbs. From a coordination perspective, human motion involves inter-limb movements, i.e., the relationship between limb movements. As shown in
Figure 4, in the realm of human kinetics, motion or types of motion are typically classified into contralateral coordination, as shown in (a), usually contralateral movements of the left hand and right foot (such as running and walking) and ipsilateral coordination, as shown in (b), through ipsilateral movements of the left and right hands (such as swimming and Tai Chi). Both are used to describe the coordinated movements between limbs. Therefore, this study aims to construct a coarse-grained ratio graph to extract the coordination characteristics between limbs and apply weights to the original skeleton by generating a coordination matrix.
Next, using the given skeleton sequence X, we construct a coarse-grained proportion map. The human skeleton is divided into five parts: the central torso, the left arm, the right arm, the left leg, and the right leg. Each of these parts is processed separately. Typically, the most common method is to calculate the centroid of each region to represent its approximate location. However, this method can be flawed for non-convex shapes, such as a bent arm. A protrusion at the elbow, or any indentations, might lead to offsetting the centroid coordinates, thereby failing to accurately depict the features of that part. Consequently, this study adopts the mean coordinate method, which is applicable to various non-convex shapes present in skeleton data, thereby preserving more detailed information. The positions of the skeleton points included in each part are processed by calculating the mean coordinates, which are then merged into new skeleton points. This is referred to as the coarse-grained proportion map.
Figure 5 illustrates the graph structure constructed corresponding to the coarse-grained proportion map.
In biomechanics, movements or types of movements are typically classified as contralateral or ipsilateral coordination. Therefore, in the coordination module, correlation coefficients for the left and right arms (a2, a3), the left arm and right leg (a2, a5), and the right arm and left leg (a3, a4) in the coarse-grained proportion map are calculated separately (in the analysis of movement types in the dataset, instances of ipsilateral coordination involving both legs are sparse, which is to say, if the hands are coordinated, so are the legs). First, calculate the Euclidean spatial distances between a2 and a3, a2 and a5, and a3 and a4 to obtain the distance parameters
d1,
d2, and
d3. These are subsequently processed using exponential weighting and normalization.
Figure 5 below represents a schematic of the coarse-grained proportion map:
In which w1, w2, and w3 represent the coordination correlation coefficients. A weighted calculation is performed on the skeleton as follows: . denotes the weighted skeleton sequence.
3.3. Multi-Scale Global Spatiotemporal Attention Module (MGS)
Since the spatial graph convolutional layer only aggregates information in space, it cannot effectively interact with information in the time window. Thus, it is necessary to model the time dimension features of the skeleton sequence. This paper designs a Multi-scale Global Spatiotemporal Attention Module, which employs multi-scale time graph convolutional layers for multi-branch expansion, captures spatiotemporal patterns of different feature granularities, and calculates the correlation between the current feature’s position and other spatiotemporal positions to capture the global dependencies among spatiotemporal features. The network model is illustrated in
Figure 6:
The Multi-scale Global Spatiotemporal Attention Module (MGS) first takes in the X skeleton feature with a shape of , passing it through three convolutional layers with a kernel size of 1, yielding x1, x2, x3 , where C denotes channels, T stands for skeleton temporal scale, and V represents the number of skeleton joints. Inputs x1 and x2 are entered into the Multi-scale Time Convolutional Layer that contains depth-wise causal convolution blocks, resulting in y1 and y2. Next, y1, y2, and x3 are subjected to matrix transformations, resulting in , ; , ; and , . The relationship between the current spatiotemporal features and others is calculated by performing matrix multiplication on y2 and x3, followed by a Softmax operation to obtain the global spatiotemporal attention weight coefficient . After obtaining the global attention weight coefficient, is obtained by element-wise multiplication with y1, resulting in an attention-inclusive feature map. Finally, the features are passed on to the output of the module by way of residual connection with the input features.
The MST module is shown in
Figure 7; this paper introduces improvements to the classic Multi-scale Temporal Graph Convolution Layer (MSTGCL). Behind these improvements, we became aware of a specific scenario where using the MSTGCL might lead to leakage of future information due to different dilation rates, as depicted in
Figure 8. The leakage of future information is an issue that must be avoided when dealing with temporal data. “Information leakage into the future” simply refers to a scenario where the model acquires data during temporal convolution that it should only be able to access in the future. Dilated convolution plays a crucial role in the MSTGCL. It allows for the calculation of convolution at different dilation rates, thus capturing patterns in the input data at varied scales. Thus, as shown in
Figure 7b, we implemented a deep causal convolution module within the dilated convolution. It ensures that data convolution is performed under the premise of causal relationships.
Specifically, when performing convolution operations on the temporal dimension T, disregarding the causal relation on the time axis—that is, if the convolution kernel can access data beyond the current timestep—can lead to leakage of future information. To mitigate this issue, we adopt a unique padding method ensuring that the convolution kernel only has access to current and past data, eliminating potential access to future temporal information and maintaining causality. As per
Figure 7b, the skeleton features of shape
are input into the depth convolution. Within the depth convolution, each channel convolves only with itself, reducing parameter and computational demands. Next, the inputs are directed into pointwise convolution, where the output channels are merged within the pointwise convolution. Thereafter, a two-dimensional convolution layer is defined in “Remove causal pad”, setting the padding parameters, dilation for specifying the convolution kernel dilation factor, and stride for determining the convolution step length. Input data of varying scales are first convolved through the convolution layer. The remove operation is then used to eliminate the outputs queued behind by R timestep lengths, where the value of remove equates to
. Lastly, the output is returned after normalization using Batch Normalization (BN).
The MST module adopts a bottleneck design, thereby reducing the parameter count to a certain extent. As per
Figure 7a, six branches were designed. Each utilizes a temporal graph convolution kernel with a dimension of
to reduce the channel dimension of the skeleton sequence to C/6 to minimize computational complexity. Four branches respectively employ temporal graph convolution kernels with dimensions set as
and dilation rates of 2, 3, 4, and 5 to extract multi-scale temporal features from the skeleton sequence. Moreover, to expand the receptive field of the model, an additional temporal graph convolution and max pooling branch with a dimension of
has been included. Finally, the Concat operation is used to restore the channel dimension to C. A type of residual connection is introduced to optimize gradient propagation, resulting in the output of the multi-scale temporal graph convolution layer.
In the Multi-scale Global Spatiotemporal Attention Module (MGS), Self-Attention Temporal Module (SATM) is introduced, which includes self-attention with a time mask and time convolution network. In this context, this paper opts to employ masked time attention. One reason for this is to enable the model to adaptively focus on key feature areas of different time positions as needed. This allows for the extraction of features in the most effective manner, and also enhances the model’s performance with limited parameters. On the other hand, the masked time attention can obscure information for a specific time moment, thereby preventing data leakage.
Specifically, the Self-Attention Temporal Module (SATM) takes a skeleton sequence with dimensions
as input. Initially, the input goes through a linear transformation layer and is then plugged into the Tanh nonlinear mapping function, obtaining the attention distribution at different time points with dimensions
. Next, this attention distribution is replicated T times to yield the attention distribution matrix A1 with dimensions
, as depicted in
Figure 9.
Concurrently, the input features undergo the same processing to result in the feature matrix B1 with dimensions . Next, a masking operation is performed on the upper right corner of attention matrix A1 by filling it with negative infinity, thus creating the masked attention distribution matrix A2. Thus, after passing through the Softmax function, the weight coefficients corresponding to the masked part will become zero, thereby preventing the model from focusing on future information. Subsequently, the masked attention distribution matrix A2 and feature matrix B1 are matrix-multiplied to generate the time-varied feature matrix C with dimensions . Finally, a global average pooling operation is applied on feature matrix C. By averaging the weighted feature at each moment, an output of dimensions is obtained. The weighted feature at each moment is obtained through the time-attentive weighting of past features. With this processing procedure, the SATM module can adaptively focus on the key feature regions at different time positions and extract more effective feature representations using limited parameters.
5. Discussion
In this research, we delve into the significance of our method in skeletal action recognition while analyzing its correlations and differences with existing research, and exploring potential limitations. The primary innovation proposed herein is that the channel and coordination relationships of the human skeleton, along with the temporal positioning features of skeletal nodes, could greatly boost the performance of action recognition. The experiments on challenging datasets like NTU-RGB+D 60, NTU-RGB+D 120, and NW-UCLA have convincingly validated these innovations. Despite the significant success of graph convolution in action recognition as demonstrated by existing research, we find that current models do not fully leverage prior knowledge of human body structure and the coordination between limbs. Therefore, we propose a Multi-level Topological Channel Attention Module based on the human skeleton, integrating limb coordination and skeletal node position features into the model.
In the experimental stage, the model presented in the paper achieved accuracy rates of 91.9% (Xsub) and 96.3% (Xview) on the NTU-RGB+D 60 dataset, surpassing the current mainstream model STF-Net by 0.8% and slightly less than 0.5%, respectively. On the NTU-RGB+D 120 dataset, the model achieved accuracy rates of 88.5% (Xsub) and 90.3% (Xset), outperforming the current mainstream model RSA-Net by 0.1% and 0.6%, respectively. Additionally, the model achieved an accuracy rate of 95.6% on the NW-UCLA dataset, although it did not surpass the current mainstream model CTR-GCN.
This paper acknowledges that notwithstanding the model’s robust performance across all three datasets, it exhibits a notable limitation in detecting actions that follow the same trajectory but in a direction opposite to the original action. Indeed, our model may inaccurately categorize certain complex or unique actions when their trajectories resemble known ones. This constitutes a key limitation of our model, a constraint we recognize and plan to address in future research. We aim to enhance action recognition accuracy by extracting more attributes from human skeletal data.