1. Introduction
In recent years, the rapid development of deep learning has resulted in its great success in computer vision tasks [
1]. As a research hotspot in this field, action recognition has been widely used in human–computer intelligent interaction, virtual reality, video surveillance and other practical applications [
2,
3,
4]. human action recognition extracts discriminative behavior features to fully describe the temporal dynamics and spatial structure of human motion [
5]. Currently, however, due to the high complexity and variability of human action, recognition models cannot fully meet the practical requirements of both light weight and recognition accuracy.
At present, skeleton-based human data representation has been extensively studied and applied in human action recognition. Firstly, skeleton data, as a natural representation of human joint positions in the form of 2D or 3D coordinates, can be better extracted from depth sensors like Microsoft Kinects [
6]. Secondly, human skeleton data contain relatively richer behavior characteristics with a compact data form [
7] for describing human motion. In addition, skeleton-based data representations are less susceptible to lighting, camera angle and other background changes, showing more robustness. The many advantages of skeleton-based human data representation have promoted the exploration and utilization of the information features in skeleton motion sequences for action recognition.
Traditional methods used to deal with skeleton data mainly include convolutional neural networks (CNNs) [
8,
9] and recursive neural networks (RNNs) [
10,
11]; current mainstream action recognition methods utilize graph convolutional networks (GCNs) [
12,
13]. CNN-based methods model the skeleton data into pseudo-images using manually designed transformation rules. For example, Kim and Reiter [
9] linked joint coordinates and used a 1D residual CNN to identify skeleton sequences, providing a method to learn and interpret 3D human action recognition. RNN-based methods usually model the skeleton data in spatial and temporal dimensions as a coordinate vector sequence. For example, Du et al. [
10] constructed a hierarchical bidirectional RNN model to capture the sequence features between different body parts. However, human skeleton data are naturally connected in the form of graphs, rather than in the form of 2D grids or vector sequences. Therefore, it is difficult for CNN-based and RNN-based methods to represent the topological structure of skeleton data and fully express the spatial structure information between human joints. GCN is a universal deep learning framework, which can be directly applied to non-Euclidean structured data, in comparison with other mature models [
14]. More recently, Yan et al. [
12] first constructed a spatial–temporal graph convolution network (ST-GCN) for skeleton-based human action recognition. They innovated the research methods of human action recognition by modeling the human skeleton data as a spatial–temporal graph.
Inspired by ST-GCNs, more and more researchers have applied GCNs to skeleton-based action recognition tasks. Si et al. [
15] proposed an AGC-LSTM model, wherein the graph convolution module is embedded into a multi-layer LSTM module which is able to improve the graph convolution network’s performance of extracting the features of joints in the temporal dimension. Shi et al. [
16] constructed a two-stream action recognition model, improving the limitations of input data with only node features, which can be utilized to adaptively learn the topological structure in the spatial–temporal dimension and enhance the recognition ability of the model. Compared with previous studies, PA-ResGCN-N51 [
17] and EfficientGCN-B0 [
18] innovatively used three early fused input features to reduce the complexity and number of parameters and introduced a residual network to increase the stability of the model. Xie et al. [
19] proposed a new strategy for constructing graph convolution kernels that mainly automatically adjusted the number of graph convolution kernels according to the complexity of the topology. Wang et al. [
20] calculated the weight values of the edges between non-physically connected nodes in a human skeleton graph according to the node distance, increasing the sensitivity field of the GCN. At the same time, a modified partitioning strategy is adopted to extract a large amount of non-adjacent joint information from the model.
These models based on GCNs can significantly improve recognition accuracy, but they also have some limitations: (1) The adjacency matrix set according to predefined human body topology is usually fixed in all graph convolutional layers and input samples, which is not the best choice for action recognition task. Meanwhile, graph convolution kernels artificially defined based on the adjacency matrix can only extract the feature information of neighbor nodes, and it is difficult to capture the feature information between two nodes that are far apart in the topology of the human body graph, which lacks the ability to perceive the features of nodes globally. (2) The multi-stream frame model [
16], by making use of first-order information (joint coordinates) and second-order information (motion speed and bone characteristics) of skeleton data, has achieved good performance. However, the behavior of the human body is a consecutive motion stream in multiple frames, in which each sub-stage has different characteristic information. The direct fusion of first-order information and second-order information of skeleton data ignores the fact that different input features have different degrees of importance to different action samples, leading to the redundancy of feature information and introducing confusion regarding the identification features extracted by the model. (3) The effectiveness and necessity of attention mechanisms in action recognition tasks have been demonstrated. Some related works [
17,
21] used single attention mechanism with GCNs so as to discover the relationship between joints and between frames and identify the differences between action samples. In a GCN model of a hierarchical architecture like CNN, different graph convolutional layers contain different semantic information; single attention mechanisms lack the ability to understand the multi-layer semantic information of all graph convolutional layers, smoothing the output features of the model.
To address the above issues, we propose an enhanced adjacency matrix-based graph convolutional network with a combinatorial attention mechanism (CA-EAMGCN) to improve the extraction of skeleton data features. Specifically, our main contributions in proposing the CA-EAMGCN model are as follows: (1) A novel method to construct an adjacency matrix is proposed. The new adjacency matrix is optimized for different graph convolution layers, making up for the shortcoming that graph convolution kernels can only extract the features of neighbor nodes, enhancing the ability of the model to capture the features of global nodes and provide global regularization for graph learning. (2) A feature selection fusion module (FSFM) is designed, which can adaptively calibrate the fusion ratio of multiple input features to enhance the differences of input data and reduce the amount of redundant feature information. (3) A combinatorial attention mechanism is proposed to enhance the model’s ability to understand the semantic information contained in different graph convolution layers. Specifically, in the multi-input branch stage of the model, a spatial–temporal (ST) attention module is designed to process the semantic information at the spatial and temporal levels of different graph convolution layers; in the mainstream network stage of the model, a limb attention module (LAM) is designed to process the joint semantic information between local joints in the human body topology in one graph convolution layer.
In order to verify the effectiveness of our proposed model, we conduct extensive experiments on three large-scale datasets, NTU RGB+D 60 [
22], NTU RGB+D 120 [
23] and UAV-Human [
24]. The experimental results show that our model achieves good performance for all datasets. The proposed model is lightweight while ensuring recognition accuracy.
The main contributions of this paper are outlined as:
A novel method to construct an enhanced adjacency matrix is proposed, improving the ability of capturing the global features among joints and providing global regularization for graph learning;
A feature selection fusion module is designed to provide a more suitable fusion ratio for multi-stream input features;
A combinatorial attention mechanism is proposed to enhance the model’s ability to understand the semantic information in different graph convolution layers;
Extensive experiments on three large-scale datasets, namely NTU RGB+D 60, NTU RGB+D 120 and UAV-Human, verify the validity of our proposed method.
3. Methodology
In this section, we illustrate the main steps, framework and key modules of our proposed method.
3.1. Graph Structure
By definition, the typical graph structure of an undirected graph can be represented by G = (V, A, X), where V = {v1, v2, ···, vN} represents a set of nodes, and A ∈ {0, 1}N×N is an adjacency matrix. If there is a connection between vi and vj, then Aij = 1, otherwise Aij = 0. X ∈ RN×D represents the feature subset of each node, where N is the number of nodes and D is the number of feature channels.
3.2. Data Preprocessing
According to the previous research methods [
16,
17], data preprocessing is a very important phase for skeleton-based action recognition. In order to make the preprocessed skeleton data more consistent with human movement, inspired by Song et al. [
17], we used a new preprocessing method to divide the skeleton data into three factors: (1) joint position, (2) motion speed, (3) bone characteristics. We assumed that the original 3D coordinate set of a skeleton sequence is
, where
C,
T and
N represent the original 3D coordinates of nodes, frames and joints, respectively. The relative positions
can be obtained, where
The original 3D coordinate set and relative position set of nodes are connected into a single sequence as the input branch of joint positions. Then, two sets of velocities can be obtained using the original coordinate set of the skeleton sequence
and
where
A feature vector obtained by connecting the two groups of velocity
F and
S is used as the input branch of the velocity. Finally, the bone feature input branch includes bone length
and the angle of bone
where
iadj refers to the adjacent joint of the i-th joint, and w is the 3D coordinate of the central node of the human skeleton.
3.3. Spatial–Temporal Graph Convolutional Network (ST-GCN)
Yan et al. [
12] proposed a human spatial–temporal skeleton graph used to simulate structured information between joints in spatial and temporal dimensions which referred to the joints in the same frame and the same joints in all frames, respectively.
Figure 1a is an example of a constructed spatial–temporal skeleton graph where joints are represented by vertices and their natural connections in the human body are represented by spatial edges, i.e., the black lines in
Figure 1a. Each temporal edge is the connection of corresponding joints in two consecutive frames, i.e., each red dotted line in
Figure 1a. The ST-GCN was composed of 9 basic blocks, each of which performed spatial and temporal dimensional convolution. The spatial dimension convolution operation of each frame in the skeleton sequence data could be expressed as:
where K
v represents the size of the convolution kernel in spatial dimensional convolution, namely the number of matrices. In the ST-GCN, the neighbor nodes of each node were divided into three categories according to the respective partition strategy.
Figure 1b shows the partition strategy of neighbor nodes, where the red × sign represents the center of gravity of the human skeleton. Specifically, each node and its neighbors form a neighbor set (the region circled by the blue line in
Figure 1b). The partition strategy divided the neighbor set into three subsets: (1) the root node itself (the red circle); (2) a centripetal subset containing adjacent nodes that are closer to the center of gravity of the skeleton than the root node (yellow circle); and (3) a centrifugal subset containing adjacent nodes further away from the center of gravity of the skeleton than the root node (green circle). Therefore, there were three types of adjacency matrices of graphs, K
v = 3 in Formula (7). f
in and f
out denote input and output features and
denotes element multiplication.
, where
represents the K-interface adjacency matrix; it is used to extract the connection vertices in a specific subset from the input feature f
in as the corresponding weight vector. A
k represents the adjacency matrix without self-connection, I represents the identity matrix and
is used for the normalization of
. W
k and M
k are learnable parameters, W
k is the
weight vector for the
convolution operation and M
k can be used to adjust the importance level of each edge.
For temporal dimension convolution, the number of neighbors of each node was 2, so traditional convolution operations could be used to achieve feature extraction in the temporal dimension. Specifically, a convolution layer of was designed to extract contextual features between adjacent frames, where L is a predefined parameter used to define the length of the time window. By performing convolution operations in spatial and temporal dimensions, a basic graph convolution module was constructed.
3.4. The Proposed Model—Enhanced Adjacency Matrix-Based Graph Convolution Network
Although the ST-GCN model has become the baseline network, it has certain limitations: (1) The partitioning strategy is not the best choice for action recognition tasks. As shown in
Figure 1b, there are few neighbor nodes surrounding root nodes, which means that there are many 0 values in the adjacency matrix A
k based on the partitioning strategy. (2) The dot multiplication operation between the input feature f
in and the adjacency matrix A
k means that if one element in the adjacency matrix A
k is 0, the result of dot multiplication between them will always be 0 regardless of the value of the input feature f
in; as a result, the graph convolutional network can only aggregate the feature information of neighbor nodes of the root node, which leads to a lack of awareness of the global node information of the model. A similar situation occurs for the mask matrix M
k.
In order to address this issue, we propose a parameterized quasi-adjacency matrix, mainly aimed at assigning different virtual connection strengths to nonadjacent nodes in the skeleton graph in the corresponding positions (with 0 s) of the original adjacency matrix. We then combined this parameterized quasi-adjacency matrix with the original adjacency matrix to form a new adjacency matrix. The values of the elements in the parameterized quasi-adjacency matrix were not obtained by specific calculation but given certain initial values before the model started training, which can be adjusted and optimized according to the training situation to cope with different action categories; the values of the elements in the training process were restricted between 0 and 1. This not only maintains the predefined skeleton graph topology but also establishes distant dependency relationships in the predefined graph topology and extends the feature extraction capability of the graph convolutional network from adjacent nodes to all nodes, so as to realize the enhanced adjacency matrix.
Specifically, the construction of the quasi-adjacency matrix and the implementation of the adjacency matrix enhancement strategy are as follows: Firstly, the adjacency matrix
is obtained according to the predefined human skeleton graph, where A
k represents the adjacency matrix without self-joining and I represents the identity matrix,
k = 1, 2, 3, and the quasi-adjacency matrix QAM(
) with the same dimension as
is also created at the same time. Then, the parameterized quasi-adjacency matrix E
k is added to the adjacency matrix
. The above steps can be expressed by the following formulas:
Finally, we achieve the enhanced adjacency matrix G
k. Therefore, the graph convolution formula based on the adjacency matrix enhancement is modified as Formula (10):
Figure 2 shows the architecture of the EAMGCN unit. The value of elements in the new adjacency matrix G
k represent whether there is a connection between joints, as well as the strength of the connection. At the same time, as a self-learning method of connection strength between nodes, the adjacency matrix enhancement strategy can effectively reduce both the calculation amount and the number of parameters of the model. When using the adjacency matrix enhancement strategy, we need to pay attention to initializing the elements in the created quasi-adjacency matrix E
k as 0, so that the adjacency matrix A
k dominates the initial stage of training, which is conducive to stabilizing the training process and improving the stability of the model. In conclusion, the enhanced adjacency matrix strategy not only maintains the predefined graph topology but also makes up for the shortcoming that there is no new connection in the predefined graph topology, enhances the model’s ability to capture the features of global nodes, and reduces the complexity of the model. In addition, the bottleneck block structure which has shown its effectiveness in the work of Song et al. [
17] was introduced into the EAMGCN to further improve the efficiency of the model.
3.5. Feature Selection Fusion Module
So far, multi-stream framework models have been used with great potential to extract rich behavior features and achieve excellent performance in tasks of action recognition. For example, Shi et al. [
16] used joint information to model a two-stream frame model, which significantly improved accuracy of recognition. The essence of multi-stream frameworks is data enhancement, which provides rich feature information for models. However, the richer the feature information, the greater the number of parameters to be adjusted during training, which increases calculation cost and the difficulty of parameter optimization. An early fused multi-input branch structure was proposed [
17], where the fused features were input into a mainstream network, significantly reducing the complexity of the model while retaining rich input features. However, human behavior contains a series of continuous motions, and an action stream contains movement information in multiple stages; different input characteristics have different degrees of importance for different sub-stages of movement. The above-mentioned early fused multi-input branch architecture could reduce the complexity of the model and the number of training parameters, but its direct fusion of multiple input features still caused input feature redundancy, which indirectly led to difficulty in extracting rich identification features, thus affecting the recognition accuracy. Sometimes, the motion speed of adjacent nodes is more crucial for distinguishments like that between “walk” and “run”, so in this situation, the model should pay more attention to the input feature of the motion speed of nodes during training rather than other more redundant information.
To solve the above issue, we designed a module called FSFM, its essence is a channel attention. Via the FSFM, the model can match an appropriate fusion ratio for each input feature before multi-input feature fusion, enriching the discriminant features of the recognition model. The calculation formula can be expressed as in (11):
where f
in and f
out represent input and output features, AvgPool() represents the average pooling operation,
α represents the sigmoid function and W represents the weight vector of the convolution operation.
3.6. Combinatorial Attention Mechanism
From the perspective of data structure, skeleton data comprise a temporal sequence composed of 3D coordinates of human joints. For different action categories, different attention mechanisms should be adopted based on the joints, frames and channels of skeleton sequences in different training conditions in order to extract the relatively critical features. For instance, the spatial–temporal-channel attention module was introduced to determine the distinguishing features in skeleton sequences based on data structures [
21]. On the other hand, movement has integrity and continuity. When a person is moving, movement is usually dominated by some parts of his body with other body parts’ cooperation to complete a series of actions. For example, when it comes to drinking water, the upper limbs of the human body are definitely more important than the lower limbs. Therefore, we need to pay different attention to the node features of different limb parts in skeleton sequences during human movement. Some skeleton-based action recognition methods usually use a single attention mechanism to discover the key information of skeleton sequences and do not comprehensively consider the node features of human movement from multiple perspectives. Moreover, different graph convolution layers contain different semantic information; if only a single attention mechanism were taken into account, the model would become dependent on the single attention mechanism and smooth the output features with the deepening of the graph convolution layers.
In view of above-mentioned facts, we propose a combinatorial attention mechanism that focuses on the node features of human motion from multiple perspectives and enhances the understanding ability of the model regarding the semantic information contained in different graph convolution layers. Specifically, in the multi-input branch stage of the model, a spatial–temporal attention module is designed to determine the key node features in skeleton sequences from the perspective of data structure and process the semantic information in the spatial and temporal dimensions of the graph convolution, so that rich feature information can be extracted prior to the stage of feature fusion. The calculation formula can be expressed as:
where f
in and f
out represent input and output features, AvgPool() represents the average pooling operation,
β represents the sigmoid function and W
s and W
t represent the weight vectors of convolution operations in the spatial and temporal dimensions, respectively.
In addition, in order to determine those body parts carrying more effective information in skeleton sequences of human movement, a limb attention module is designed in the mainstream network stage of the model, and its structure is shown in
Figure 3. In this new design, relations between local and global semantic information in graph convolutional layers can be better learned. The implementation steps are as follows: (1) All node features of the human body are averagely pooled in the temporal dimension, and then node features are extracted in the spatial dimension through the 2D convolution layer plus the BatchNorm layer. (2) The skeleton is divided into trunk, upper limb and lower limb sections according to the human body structure. The trunk includes the head node and spine node of the skeleton graph, the upper limb section includes the left and right arm nodes, and the lower limb section includes the left and right leg nodes. This allows for a more compact representation of the skeleton graph of the human body and a better understanding of human behavior by classifying the symmetrical body parts into the same part corresponding to the coordination features of human movement. (3) The attention matrix is calculated by the 2D convolutional layer plus a BatchNorm layer, and the body parts are determined by the Softmax classification function. (4) the node features of the three sections are connected into an output feature matrix with different attention weights. The limb attention module can be represented by the following formulas:
where f
in and f
out represent the input and output features,
represents element-level multiplication, Pool() represents the adaptive average pooling operation,
λ represents the Softmax function,
μ represents the ReLU activation function, W denotes the weight vector of convolution operations on all node features and W
l represents the weight vector of convolution operations on each limb part.
In summary, the input branch first obtains rich features through the spatial–temporal attention module and then fuses them in a proper ratio in the feature selection fusion module proposed in
Section 3.5. The fused features are further extracted from the mainstream network via the limb attention module. Finally, through the combination of these two attention modules, we obtain the proposed enhanced adjacency matrix-based graph convolutional network with a combinatorial attention mechanism (CA-EAMGCN) which significantly improves recognition performance.
3.7. Model Architecture
Figure 4 shows the overall structure of the network. It is mainly composed of a multi-input branch module and a mainstream network. The multi-input branch module comprises 3 stacked basic blocks, and the numbers of output channels in each block are 64, 64 and 32, respectively. The mainstream network consists of 6 stacked basic blocks with 128, 128, 128, 256, 256 256 channels, respectively.
Figure 2 shows the structural composition of each basic block, with the LAM embedded only in the basic block of the mainstream network. We start by adding a BatchNorm layer to normalize the input data, and end with a fully connected layer to classify action categories. The area bordered by the green dashed line indicates the spatial–temporal attention module, the area bordered by the black dashed line represents the feature selection fusion module and the area bordered by the blue dashed line represents the graph convolution basic block embedded with the limb attention module.