Human Pose Estimation Based on a Spatial Temporal Graph Convolutional Network

Wu, Meng; Shi, Pudong

doi:10.3390/app13053286

Open AccessArticle

Human Pose Estimation Based on a Spatial Temporal Graph Convolutional Network

by

Meng Wu

^1,2,*

and

Pudong Shi

¹

School of Information and Control Engineering, Xi’an University of Architecture and Technology, Xi’an 710055, China

²

Institute for Interdisciplinary and Innovate Research, Xi’an University of Architecture and Technology, Xi’an 710055, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(5), 3286; https://doi.org/10.3390/app13053286

Submission received: 1 February 2023 / Revised: 25 February 2023 / Accepted: 2 March 2023 / Published: 4 March 2023

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

To address the problem of poor detection and under-utilization of the spatial relationship between nodes in human pose estimation, a method based on an improved spatial temporal graph convolutional network (ST-GCN) model is proposed. Firstly, upsampling and segmented random sampling strategies are used to effectively solve the problems of class imbalance and the large sequence length of the dataset. Secondly, an improved detection transformer (DETR) structure is added to effectively suppress the generation of non-maximal suppression (NMS) and anchor points, a multi-head attention (M-ATT) module is introduced into each ST-GCN cell to capture richer feature information, and a residual module is introduced into the 9th ST-GCN cell to avoid possible network degradation in deep networks. In addition, strategies such as warmup, regularization, loss functions, and optimizers are configured to improve the model’s performance. The experimental results show that the average percentage of correct keypoints (PCK) of this method are 93.2% and 92.7% for the FSD and MPII datasets, respectively, which is 1.9% and 1.7% higher than the average PCK of the original ST-GCN method. Moreover, the confusion matrix corresponding to this method also indicated that the model has high recognition accuracy. In addition, comparison experiments with ST-GCN and other methods show that the computation of the model corresponding to this method is about 1.7 GFLOPs and the corresponding MACs are about 6.4 GMACs, which is a good performance.

Keywords:

human pose estimation; graph convolutional network; transformer; attention mechanism; residual network

1. Introduction

In recent years, computer vision has become increasingly mature, with the successful development of artificial intelligence technology and a dramatic increase in the computing power of hardware devices. A more fundamental task in this field, human pose estimation, is gradually becoming a hot topic. Thanks to the development of video capture devices and networks, it is essential to analyze and understand human posture from video data or image data [1] and then prepare it for tasks such as motion recognition, anomaly detection, autonomous driving, and human–computer interactions.

The essence of human pose estimation is to detect the essential parts and major joints of the human body in the video or picture, such as the eyes, nose, hands, shoulders, knees, feet, etc. It generally includes single-person scene detection and multi-person scene detection, the former of which is relatively simple and only requires the algorithm to extract the feature points of body parts and create a connected pose. In contrast, multi-person scene detection is more complex, mainly consisting of “top-down” and “bottom-up” methods. “Top-down” refers to identifying all the bodies in the video or picture first, followed by identifying the feature points of each person. The “bottom-up” approach first identifies all the feature points in the video or image [2] and then corresponds all the feature points to the relative human body through an algorithm.

In this paper, we take the application scenario of human pose estimation, i.e., an anomaly detection task, as there is no fixed basis for determining “abnormal human behavior”. For example, actions occurring at different positions in a fixed scene, actions occurring at different moments in a fixed scene, and different actions occurring at the same position and at the same moment in a specific scene can be judged as “abnormal behavior”. This is because whether human or an animal, behavior often depends on posture, movement, and the environment. For example, drawing on a blackboard is normal behavior, while drawing on a monument is judged as “abnormal”.

In addition, the lower probability of abnormal behavior compared to normal behavior makes data acquisition more difficult, which leads to an imbalance of positive and negative samples and makes it difficult for the model to learn enough features of abnormal behavior. In addition, the model performance is limited by the lighting, angle, and object movement speed [3]. Most detection methods also face the problems of complex behavior recognition, a multi-parameter search space, and high computational cost [4]. This is due to the differences in the appearance of the human body and the need to ensure multiple targets associated with time sequence are not skipped or missed when detecting different behaviors in video or image data, which leads to a sharp increase in the algorithm parameters and computation, making the training process increasingly complicated and ultimately reducing the performance of the model.

For example, the supervised convolutional neural network proposed by Newell et al. [5] to accomplish detection and grouping tasks can perform pose estimation in multi-person scenes. The AlphaPose [6] algorithm proposed by Shanghai Jiao Tong University can perform self-body estimation tasks while ensuring a balanced real-time performance and pose estimation accuracy. Park et al. [7] used convolutional neural networks to connect 2D pose results with image features for end-to-end learning to solve 3D human pose estimation tasks.

In addition, the human behavior detection algorithm proposed by Satybaldina [8] to detect the front of ATMs based on gesture features has poor model performance due to the influence of the surrounding environment, such as pedestrians, fallen leaves, and lighting. Although the algorithm proposed by Chengfei [9] to estimate the human posture of students in schools has improved significantly in model performance, it has many parameters and the model training process is complicated. The DNN model proposed by Toshev et al. [10] uses the AlexNet network to capture different joint features; however, the weight parameters obtained by this method are more coupled with the training data distribution.

In contrast, graph convolutional networks are more adept at handling typical non-Euclidean structured data such as human skeleton information. For example, the ST-GCN algorithm based on human skeleton joints proposed by Yan [11] combines a graph convolutional network (GCN) [12] and a temporal convolutional network (TCN) [13]. It extends to the spatial-temporal graph model, which forms a hierarchical representation of the skeleton sequence through a spatial-temporal graph. The GCN module is used to learn the local features of adjacent joints in space, and the TCN module is used to learn the local features of joint changes with time.

The space-time-separable graph convolutional network (STS-GCN) proposed by Sofianos et al. [14] includes the temporal evolution and the spatial joint interaction within a single-graph framework, which favors the cross-talk of space and time, while bottlenecking the space-time interaction allows to better learn the fully-trainable joint and time interactions. A novel graph-based method was proposed by Y. Cai et al. [15] to tackle the problem of 3D human body and 3D hand pose estimation from a short sequence of 2D joint detections. The method first uses a pre-trained cascaded pyramid network [16] for 2D pose prediction and then feeds it into a spatial temporal graph convolution network. The spatial temporal graph convolution connects each joint to its corresponding joint in adjacent frames in the time dimension and connects the joint points in each frame that have direct and indirect kinematic dependencies (adjacent and symmetric) in the spatial dimension. The joints are also classified into six classes, and different convolution kernels are used for training for different node classes.

The manuscript structure is shown in Figure 1 and the rest of this paper is organized as follows. In Section 2, we introduce the work related to data preprocessing and model structure design. In Section 3, we introduce the improved algorithm in detail including the model structure, the model configuration strategy, and the model training strategy. Section 4 is the experimental part, including an introduction to the dataset used in the improved algorithm, model evaluation indicators, comparison experiments, ablation experiments, and experimental results analysis. Furthermore, finally, we conclude the paper in Section 5.

2. Related Work

2.1. Data Preprocessing

In this paper, we will test two datasets, MPII [17] and FSD [18], as shown in Figure 2, which also shows the proportion of different categories of data in the FSD dataset. An up-sampling strategy is used in the data preprocessing stage to address the problem of unbalanced data categories in the FSD training set. It will randomly repeat a small number of sample categories so that the data in the processed training set have the same percentage in each category.

For a sample x in the minority class, the distance from the sample to all samples in the minority class sample set is calculated using the Euclidean distance as the criterion to obtain its k-nearest neighbors. Then, a sampling ratio is set according to the sample imbalance ratio to determine the sampling multiplier N (N should be less than the number of minority samples x minus 1). For each minority class sample x, some samples are randomly selected from its k-nearest neighbors. Assuming that the selected nearest neighbors are

\tilde{x}

, each randomly selected nearest neighbor

\tilde{x}

can be constructed with samples x according to Equation (1) (

x_{n e w}

):

x_{n e w} = x + r a n d (0, 1) * (\tilde{x} - x)

(1)

The model’s accuracy on the test set with or without the upsampling strategy is shown in Table 1, and it can be seen that the accuracy of the model on the test set is improved after processing with the upsampling strategy.

In addition, since the critical point sequence length in the FSD training set data varies greatly, as shown in Figure 3, it is necessary to select a suitable sequence length to input into the model. However, the maximum value of sequence length in the training set data is 2037 frames, and the minimum value is 42 frames, in which fewer than 500 frames account for 27% of the samples. Therefore, the sample data can be divided into 350 intervals according to the effective frame length. A random frame is sampled from each interval to form a sample frame with a frame length of 350 during training, and it is used as the input of the model, which can effectively improve the model performance, as shown in Table 2.

It is worth noting that since the original training data only contains human joint coordinates and the confidence degree, this paper draws on the angular feature extraction and human activity model design methods proposed by Zhenyue Qin et al. [19] for AngNet regarding joint angle and He Jian et al. [20] for a human activity model based on the Cartesian coordinate system, respectively.

2.2. Design Model Structure

A technical view of the improved ST-GCN network is shown in Figure 4, which contains the improved algorithm structure, the algorithm application, and the algorithm model deployment. The algorithm structure combines OpenPose, DETR [21], and M-ATT [22]. An improved algorithm structure diagram, that redesigns both the model configuration and the model training strategy to address the problems of a high computational complexity, the difficulty in modeling multi-level semantic information, and the overlooked relationship between non-physically connected nodes (which leads to lower model accuracy), and more details of this algorithm structure are described in detail in Section 3. The algorithm application section shows the scenario tasks to which the improved algorithm proposed in this paper can be applied, including anomalous behavior detection [23], motion tracking [24], intelligent security [25], and so on. The algorithm model deployment section shows the tool platforms on which the improved algorithm can be deployed, including OpenVino [26], TensorRT [27], MediaPipe [28], and so on.

3. Proposed Methods

The flowchart of this algorithm is shown in Figure 5, the general flow is as follows. Firstly, the skeletal points are extracted frame by frame from the dataset used in this paper by the OpenPose algorithm, and the skeletal points are used as the pre-input of the improved ST-GCN network. Then, the temporal and spatial dimensions are transformed by the ST-GCN unit, and the TCN and GCN modules are used alternately to transform the joint features. It is worth noting that this algorithm also introduces the multi-headed attention mechanism and residual network and redesigns the transformer DETR structure. Finally, after processing the ST-GCN unit, the output results go through the average pooling layer, the fully connected layer and finally the softmax to complete the output.

3.1. Overview of Model Structure

The network structure of the improved ST-GCN is shown in Figure 6, and its core idea contains the following three aspects. Firstly, in terms of the network model structure, the improved network model still maintains the original ST-GCN nine-layer structure. The TCN module is used alternately with the GCN module with the following differences: Firstly, an improved DETR structure is added, i.e., the OpenPose algorithm module is used to complete skeletal point extraction frame by frame and replace the previous CNN module, which can detect all objects in parallel and avoid the generation of NMS and anchor points. Secondly, the improved network changes the attention model before the graph convolution operation to M-ATT and adds batch normalization (BN) [29] and rectified linear units (ReLU) [30] between the GCN module and the TCN module. The structure of the first eight ST-GCN units remains the same. Thirdly, a 9th ST-GCN unit is introduced into the residual network. From the training strategies, the expressiveness, detection performance, and generalization ability of the model are significantly improved by using warmup [31], stepwise-cosine, mix-up [32], and regularization strategies.

3.1.1. End-to-End Detection with an Improved DETR

Most detection methods perform detection based on detectors such as an anchor or Dense Box. However, since these methods perform repeated detection of the same target body, i.e., redundant detection frames are predicted, this problem must be addressed by means such as NMS. In contrast, the DETR structure proposed by Carion et al. does not require prediction of a candidate anchor and directly uses the model for violent regression.

Therefore, this method combines DETR and improves it. The whole DETR architecture consists of extracting skeletal points frame by frame by the OpenPose algorithm and encoder-decoder architecture, which does not require any custom network structure and simplifies the training process. It only requires a fixed set of object queries (query object); the DETR will predict the results in parallel based on the relationship between the target object and the global image context. The structure of the improved DETR is shown in Figure 7.

The input data will first be extracted by the OpenPose algorithm frame by frame and then combined with positional encoding as the input data of the encoder. The decoder will use a small number of fixed query objects and the encoder output as the input of the decoder at the same time. The decoder-processed data will flow into the improved ST-GCN-1 unit and be used as the input of the improved ST-GCN network. In the original ST-GCN network, which contains nine layers, the GCN and TCN modules are used alternately to transform the temporal and spatial dimensions. In the GCN module, a random division is used to combine the graph convolution operations of three subgraphs into one, which effectively improves the computational performance of the algorithm.

The DETR uses a decoder to predict a fixed number of detection frames; it does not need to use post-processing means such as NMS. It uses the set loss function as a supervised signal for end-to-end training, where the set loss function uses the bipartite matching algorithm to match the prediction object with the ground truth object, as shown in Equation (2):

\hat{σ} = a r g min \sum_{i}^{N} L_{m a t c h} (y_{i}, {\hat{y}}_{σ_{i}})

(2)

where y denotes the ground truth target set,

{\hat{y}}_{σ_{i}}

denotes the N elements in the prediction set, and

L_{m a t c h} (y_{i}, {\hat{y}}_{σ_{i}})

denotes the loss value of the prediction and ground truth matching about element i of

σ_{i}

. The matching effect is shown in Figure 8.

Figure 8 contains the correspondence between two sets of prediction and ground truth, each set contains N target bodies, and each target body contains two elements, which are the confidence level of the class to which the target body belongs (

p r o b s_{i}

, where

i \in [1, N]

) and the location and size of the target body (

b b o x_{i}

, where

i \in [1, N]

, containing the center point of location coordinates and the width and height of the detection box). After matching the elements in the two sets, for example, (

p r o b s_{1}, b b o x_{1}

) in the prediction sets corresponds to (

p r o b s_{3}, b b o x_{3}

) in the ground truth, etc. (where ⌀ indicates that the class is empty), the Hungarian algorithm can be used to calculate its loss value, which abstracts the target detection task as a set prediction problem and then predicts all targets simultaneously and trains them end-to-end by set loss, thus simplifying the model training process and effectively avoiding problems such as anchor and NMS.

3.1.2. Stabilizing the Training Process Using Multi-Head Attention

The original ST-GCN network is designed with a layer of attention model to weigh the human torso before the graph convolution operation on the data. In contrast, M-ATT is introduced in the improved ST-GCN proposed in this paper, which can ensure that the DETR notices the information of different subspaces, which is beneficial for the network to capture richer feature information. It uses GCN and TCN alternately to transform the temporal and spatial dimensions, up-dimensionalize the human joint feature dimension, and down-dimensionalize the key frame dimension. Finally, the output result of the improved ST-GCN unit is output after the average pooling layer, the fully connected layer, and SoftMax, as shown in Figure 9, which is the M-ATT structure diagram.

Each head in M-ATT contains its parameters, and the integration of each M-ATT output is as follows: ST-GCN-1 to ST-GCN-8 are integrated by “splicing”, and ST-GCN-9 is integrated by “averaging”. When given

q \in R^{d_{q}}

,

k \in R^{d_{k}}

, and

v \in R^{d_{v}}

, each attention

h e a d_{i} (i = 1, 2 \dots h)

is calculated as shown in Equation (3):

h e a d_{i} = f (W_{i}^{(q)} q, W_{i}^{(k)} k, W_{i}^{(v s .)} v) \in R^{P_{v}}

(3)

where f denotes a function of convergent attention and

W_{i}^{(q)}

,

W_{i}^{(k)}

, and

W_{i}^{(v s .)}

are learnable parameters.

Next, the results of each attention are stitched together and subjected to a linear transformation to obtain the final result, as shown in Equation (4).

M u l t i H e a d (q, k, v) = C o n c a t (h e a d_{1}, h e a d_{2} \dots h e a d_{h}) W^{o}

(4)

The application of M-ATT in improved ST-GCN is shown in Figure 10 (Figure 10 shows the ST-GCN-1 unit).

3.1.3. Using the Residual Network

Each cell of the original ST-GCN network will have a residual module and use dropouts for feature processing to enhance the spatial temporal information. This method is based on the structure of the original ST-GCN network. It adds a residual network in the ninth cell, which is used to solve the phenomenon of network degradation in deep networks and improve the model’s accuracy, as shown in Equation (5), which is the primary representation of a residual block.

x_{l + 1} = h (x_{l}) + F (x_{l}, W_{l})

(5)

h (x_{l}) = W_{l}^{^{'}} x

(6)

where

h (x_{l})

denotes the direct mapping, which is defined in Equation (6).

W_{l}^{^{'}}

in Equation (6) denotes the 1*1 convolution operation and

F (x_{l}, W_{l})

is the residual part, as shown in Figure 11 (a structure diagram of the residual network designed in this method).

3.2. Model Configuration Strategies

3.2.1. Loss Function

In the selection of the loss function, this paper combines the characteristics of the pose estimation task, assuming that the set of real poses of a human body is

\{g_{1}, g_{2} \dots g_{n}\}

, where the parameter n is the batch size. The anchor pose is denoted as

\{a_{1}, a_{2} \dots a_{m}\}

, where the parameter m indicates the number of anchor poses. The set of anchor pose labels is

\{l_{1}, l_{2} \dots l_{n}\}

, the set of output results corresponding to the regression branch is

\{y_{1}, y_{2} \dots y_{n}\}

, the set of output results corresponding to the classification branch is

\{p_{1}^{i}, p_{2}^{i}, \dots p_{m}^{i} ∣ i \in [1, n]\}

, and the value corresponding to the regression amount is the difference between the true pose and the corresponding anchor pose

\{g_{i}^{^{'}} = g_{i} - a_{l_{i}} ∣ i \in [1, n]\}

. The loss function of the regression branch adopts the mean square error loss

L_{r e g r e s s i o n} = \frac{1}{n} \sum_{i = 1}^{n} {‖ y_{i} - g_{i}^{^{'}} ‖}_{2}^{2}

, and the classification loss adopts the cross entropy loss

L_{c r o s s e n t r o p y} = - \frac{1}{n} \sum_{i = 1}^{n} \sum_{j = 0}^{m} l_{i} l n (p_{j}^{i})

, where

y_{i}

denotes the output value corresponding to the ith data regression branch,

g_{i}^{^{'}}

denotes the regression volume of the ith data,

l_{i}

indicates the label value of the i th data anchor pose, and

p_{i}^{j}

indicates the probability that the output of the classification branch of the ith data belongs to the jth anchor pose, as shown in Equation (7), which is the loss function used in this model.

L = L_{r e g r e s s i o n} \times θ_{l o c} + L_{c r o s s e n t r o p y}

(7)

Among these metrics, the likelihood function of the regression branch obeys a Gaussian distribution with a mean value of the regression branch output and variance of

θ_{l o c}

, where

θ_{l o c}

is a learnable parameter. In the process of model inference, the output result of the regression branch is used as the final output result of

x_{i} = y_{i} + a_{l_{i}}

, where

y_{i}

denotes the output result of the regression branch and

a_{l_{i}}

denotes the anchor pose corresponding to the input data.

3.2.2. Optimizer

When choosing an optimizer, the momentum-based SGD [33] algorithm is used in this paper instead of the original SGD algorithm because the update direction of the original SGD algorithm entirely depends on the gradient calculated by the current batch, which in turn leads to a less stable model. In contrast, the momentum-based SGD algorithm simulates the inertia of an object in motion by borrowing the concept of momentum from physics, preserves the direction of the previous update to some extent during the update, and fine-tunes the final update direction using the gradient of the current batch, i.e., replaces the real gradient with the momentum before accumulation.

The SGD algorithm based on momentum observes the historical gradient. It reduces the gradient in the current direction if the current gradient is not consistent with the historical gradient direction; conversely, if the current gradient is consistent with the historical gradient direction, the gradient in the current direction is increased. In general, at the beginning of model iteration, the current gradient of the model is consistent with the direction of the historical gradient, so momentum will play an important role in helping the model reach the optimal point more quickly. In contrast, at the later stage of model iterations, the model’s current gradient is inconsistent with the direction of the historical gradient. It oscillates around the convergence value, so momentum will play a decelerating role in increasing the stability of the model and thus stop the model from falling into a local optimum.

Assuming that SGD depends on the gradient of the current batch, the current gradient,

m_{t}

, is calculated as shown in Equation (8), where

g_{t}

denotes the gradient and

η

denotes the learning rate.

m_{t} = μ * m_{t - 1} + g_{t}

(8)

where

μ

is the momentum factor. In addition to this, the warmup strategy is also used as the method of learning the rate decay. The variation in learning rate with epoch value is shown in Figure 12. It can be observed that the learning rate gradually increases at the early stage of network model training, i.e., when

e p o c h < 10

, when the model will correct the data distribution, which helps to slow down the overfitting phenomenon that may occur in the initial stage of the model. When

e p o c h > 10

, the learning rate gradually decreases, when the model will converge better, which helps to maintain the stability of the model at a deeper level.

3.3. Model Training Strategies

In the model training strategy, the values of the model-related parameters are as follows: the

M a x E p o c h

value is 300, the

W a r m u p E p o c h

value is 10, the

W a r m u p S t a r t l r

value is 0.005, the

C o s i n e B a s e l r

value is 0.05, the

a l p h a

value is 0.2, and the

l s E p s

value is 0.1. For the epoch and

B a t c h S i z e

values, in training with the validation set data, the model accuracy increases with the increase in epoch value and increases with the decrease in

B a t c h S i z e

value. After comparison, the optimal value of epoch is 300 and the optimal value of

B a t c h S i z e

is 8.

In addition, the graphics card used for model training is an NVIDIA GeForce RTX 3070 Ti with the following software configuration: Python version 3.8, torch version 1.7.1, CUDA version 11.0, cuDNN version 8.0.5.39, torchvision version 0.82, ptflops version 0.6.9, pytorchModelSummary version 0.1.2, numpy version 1.18.5, matplotlib version 3.3.2, and PyCharm version 2020.2.

4. Experiments

4.1. Dataset Details

This paper will test the MPII and FSD datasets to enhance the model’s generalization performance. The MPII dataset was collected from YouTube videos covering 410 human activities (such as rock climbing, ice skating, fishing, etc.), containing 25,000 images with annotations. The annotation points contain ankle, knee, hip, shoulder, elbow, wrist, and 16 other types of annotations. The FSD10 dataset was constructed from the 2017 to 2018 World Figure Skating Championships, and the video frame rate was normalized to 30 frames per second with 10 action categories, including ChComboSpin4, 3Axel, FlyCameSpin4, 3Flip ChoreoSequence1, 3Loop, StepSequence3, 3Lutz, 2Axel, and 3Lutz-3Toeloop, with a resolution of 1080 × 720.

4.2. Evaluation Indicators

For the algorithmic models mentioned in this paper, the following four evaluation metrics will be used to evaluate them:

P C K

, Confusion Matrix, model computation, and

M A C s

.

The

P C K

indicator is used to measure the proportion of correctly estimated key points. Taking the total number of key points to be evaluated as

k (k > 0)

, the

P C K

indicator for the

i (i \in (0, k))

th key point is calculated as shown in Equation (9):

P C K_{i}^{k} = \frac{\sum_{p}^{} δ (\frac{d_{p i}}{d_{p}^{d e f}} \leq T_{k})}{\sum_{p}^{} 1}

(9)

where the parameter p denotes the pth individual, the parameter

T_{k}

denotes the manually set threshold (

T_{k} \in [0 : 0.01 : 0.1]

), the parameter k denotes the kth threshold, the parameter

d_{p i}

denotes the Euclidean distance between the predicted value of the ith key point of the pth individual and the manually labeled value, and the parameter

d_{p}^{d e f}

denotes the scale factor of the pth individual. In the MPII dataset, this parameter denotes the Euclidean distance between the upper left point of the head and the right Euclidean distance of the lower left point of the head, and the parameter

δ

takes a value of 1 if the condition holds and 0 otherwise.

The confusion matrix indicator is used to more clearly and intuitively portray the error indicators in the specific classification results. In a binary classification problem, if the classifier judges a positive case as a positive case, a true positive case (TP) is generated; if the classifier judges a negative case as a negative one, it is considered a true negative case (TN). The other two cases are called False Negative (FN) and False Positive (FP). The corresponding confusion matrices are shown in Table 3.

The amount of model computation indicates the number of computations of the model on the hardware unit, generally expressed as

F L O P s

(floating point operations per second), which for a convolutional layer is shown in Equation (10):

F L O P s_{c o n v} = k_{H} \times k_{W} \times C_{i n} \times C_{o u t} \times H_{o u t} \times W_{o u t}

(10)

where

k_{W}

and

k_{H}

denote the width and height of the convolution kernel, respectively,

C_{i n}

and

C_{o u t}

denote the number of channels of the input feature map and the output feature map, respectively, and

W_{o u t}

and

H_{o u t}

denote the width and height of the output feature map, respectively. For a fully connected layer, the

F L O P s

calculation is shown in Equation (11).

F L O P s_{F C} = C_{i n} \times C_{o u t}

(11)

MACs (multiply–accumulate computations) represent the cumulative number of multiplications and additions of the model.

1 M A C s

contains one multiplication and one addition, which are calculated as shown in Equation (12).

M A C s = C_{o u t} \times H_{o u t} \times W_{o u t} \times C_{i n} \times K_{H} \times K_{W}

(12)

In this paper, the ptflops and

p y t o r c h_m o d e l_s u m m a r y

modules provided by the torch deep learning framework will be used to calculate the computation and MACs of the model.

4.3. Experimental Results Analysis

4.3.1. Comparison Experiments

As shown in Table 4 and Table 5, the PCK values of the four network models of AGCN [34], CTR-GCN [35], ST-GCN, and the improved ST-GCN (proposed model) on the FSD and MPII test set data, respectively, are shown, where the values in the columns (Ankle*, Knee*, etc.) indicate the average values of the corresponding left and right nodes for that node. It can be seen that the average PCK of the AGCN network is 91.8% and 91.5% on both FSD and MPII datasets, respectively, the average PCK of the CTR-GCN network is 91.7% and 91.1% on both FSD and MPII datasets, respectively, and the average PCK of the ST-GCN network is 91.3% and 91.0% on both FSD and MPII datasets, respectively. In addition to the targeted preprocessing of the original dataset to improve as much as possible the model performance due to the irregularity of the dataset, the method mentioned in this paper redesigned the network model structure, network model configuration, model training, etc., so that the performance of the improved network model was improved to a certain extent on the FSD and MPII datasets. The average PCKs on the FSD and MPII datasets are 93.2% and 92.7%, respectively, as shown in Figure 13, and it can be seen that the method has a high recognition accuracy.

As shown in Figure 14 and Figure 15, the confusion matrices of the method on the FSD and MPII datasets show that the method has the highest probability of misclassifying the 3Loop action as FlyCamelSpin4 action among the 10 classifications of the FSD dataset (16%) and the highest probability of misclassifying the R-wrist node as L-knee action among the 16 classifications of the MPII dataset (14%). Among the 16 classifications of the MPII dataset, this method has the highest probability of misclassifying R-wrist nodes as L-knee actions, at 14%. In addition, the confusion matrix produced by this method on these two datasets has the largest element values on the diagonal, so it can be considered that this method has very good classification results.

The performance of the ST-GCN, AGCN, and CTR-GCN networks and the proposed method in this paper, which contains two indexes, model computation, and MACs, is shown in Table 6. It can be seen that the model computation of the proposed method is about 1.7 G, which is similar to the model computation of the proposed method in the CTG-GCN network. The MACs index of the proposed method is about 6.4 G, which is smaller than that of the CTG-GCN network; that is, the proposed method in this paper has a better performance.

4.3.2. Ablation Experiments

Before conducting the ablation experiment, this paper will compare the average PCK values of different numbers of attention heads on both FSD and MPII datasets, as shown in Table 7. It can be seen that when the number of attention heads is six, the average PCK values are the largest in both datasets, which are 92.1% and 92.0%, respectively.

In order to evaluate the experimental effectiveness of the proposed method more accurately, the average PCK values of the proposed method on both FSD and MPII datasets with different experimental settings are shown in Table 8.

I_{o r i g i n a l}

denotes the original ST-GCN network model,

I_{o r i g i n a l} + I_{D E T R}

denotes the network model after adding the improved DETR structure,

I_{o r i g i n a l} + I_{M A T T} (h = 6)

denotes the network model after adding the MATT module,

I_{o r i g i n a l} + I_{R e s}

denotes the network model after adding the residual network, and

I_{o r i g i n a l} + I_{R e s}

denotes the network model after adding the DETR, MATT

(h = 6)

, and residual modules, i.e., the method proposed in this paper. It can be seen that although adding

I_{o r i g i n a l}

,

I_{M A T T} (h = 6)

, or

I_{R e s}

can all improve the average PCK values of the corresponding network models on both FSD and MPII datasets,

I_{D E T R}

is more effective compared to

I_{M A T T}

and

I_{R e s}

, while the corresponding average PCK values are the largest when the methods mentioned in this paper are used simultaneously.

5. Conclusions

This paper proposes a human pose estimation algorithm based on an improved ST-GCN network. From the network model structure, an improved DETR structure is introduced. The OpenPose algorithm module is used to complete the skeletal point extraction frame by frame and replace the previous CNN module, which effectively suppresses the generation of NMS and anchor points. M-ATT is introduced to ensure that the transformer DETR can capture richer feature information. Introducing a residual network in the 9th ST-GCN unit in the improved ST-GCN avoids possible network degradation during training, and thus improves the model performance. From the training strategy, the problem of unbalanced FSD data categories is solved by the up-sampling strategy in the data processing stage, and the problem of inconsistent FSD data sequences is solved by the segmented random sampling. In addition, the loss function of the regression branch adopts the mean square error loss, the classification loss adopts the cross-entropy loss, and the momentum-based SGD algorithm is used as the optimizer and configured with warmup, mixup, and regularization strategies. The experimental results show that although

I_{D E T R}

,

I_{M A T T} (h = 6)

, and

I_{R e s}

all help to improve the average PCK of the model on the datasets, our proposed method corresponds to the largest average PCK values, with 93.2% and 92.7% on the FSD and MPII datasets, respectively, which is higher than the average PCK of the original ST-GCN method on these two datasets by 1.9% and 1.7%, respectively. The values of the elements on the diagonal of the confusion matrix obtained for these two datasets are the largest, so this method has a very good recognition effect. In addition, the corresponding model computation of this method is about 1.7 GFLOPs and the corresponding MACs is about 6.4 GMACs, which has a better performance compared with the other three methods analyzed in the paper.

Author Contributions

Conceptualization, M.W. and P.S.; methodology, M.W. and P.S.; software, P.S.; validation, P.S.; formal analysis, M.W.; writing—review and editing, P.S. and M.W.; visualization, P.S.; supervision, M.W. and M.W.; funding acquisition, M.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (no. 61701388), the Interdisciplinary Foundation of Xi’an University of Architecture and Technology (no. X20220082), and the Natural Science Foundation of Xi’an University of Architecture and Technology (no. ZR21032).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The datasets presented in this study are openly available on the website. MPII available online: http://human-pose.mpi-inf.mpg.de/#overview accessed on 1 March 2023, FSD available online: https://github.com/PaddlePaddle/PaddleVideo/blob/develop/docs/en/dataset/fsd.md#Download accessed on 1 March 2023.

Conflicts of Interest

The authors declare no conflict of interest.

References

Huang, K.Q.; Chen, X.T.; Kang, Y.F.; Tan, T.N. Intelligent Visual Surveillance: A Review. Chin. J. Comput. 2015, 38, 1093–1118. [Google Scholar]
Zhang, B.R. Research on Human Posture Estimation Model and Method Based on Machine Version. Master’s Thesis, North China University of Water Resources and Electric Power, Zhengzhou, China, 2021. [Google Scholar] [CrossRef]
Hu, K.; Jin, J.; Zheng, F.; Weng, L.; Ding, Y. Overview of Human Behavior Recognition Based on Deep Learning. Comput. Eng. Appl. 2022, 58, 14–26. [Google Scholar]
Su, J.Y.; Song, X.N.; Wu, X.J.; Yu, D. Skeleton Based Action Recognition Algorithm on Multi-modal Lightweight Graph Convolutional Network. J. Front. Comput. Sci. Technol. 2021, 15, 733–742. [Google Scholar]
Newell, A.; Huang, Z.A.; Deng, J. Associative embedding: End-to-end learning for joint detection and grouping. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; MIT Press: Cambridge, UK, 2017; pp. 2277–2287. [Google Scholar]
Fang, H.S.; Xie, S.; Tai, Y.W.; Lu, C. RMPE: Regional multi-person pose estimation. In Proceedings of the IEEE Iternational Conference on Computer Vision, Venice, Italy, 22–29 October 2017; IEEE Computer Society Press: Los Alamitos, CA, USA, 2017; pp. 2353–2362. [Google Scholar]
Park, S.; Hwang, J.; Kwak, N. 3D human pose estimation us-ing convolutional neural networks with 2D pose information. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; Springer: Heidelberg, Germany, 2016; pp. 156–169. [Google Scholar]
Satybaldina, D.Z.; Glazyrina, N.S.; Kalymova, K.A.; Stepanov, V.S. Development of an algorithm for abnormal human behavior detection in intelligent video surveillance system. In Conference Series: Materials Science and Engineering; IOP Publishing: Bristol, UK, 2021; Volume 1069, p. 012046. [Google Scholar]
Wu, C.; Cheng, Z. A novel detection framework for detecting abnormal human behavior. Math. Probl. Eng. 2020, 2020, 1–9. [Google Scholar] [CrossRef]
Toshev, A.; Szegedy, C. Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1653–1660. [Google Scholar]
Yan, S.; Xiong, Y.; Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
Hewage, P.; Behera, A.; Trovati, M.; Pereira, E.; Ghahremani, M.; Palmieri, F.; Liu, Y. Temporal convolutional neural (TCN) network for an effective weather forecasting using time-series data from the local weather station. Soft Comput. 2020, 24, 16453–16482. [Google Scholar] [CrossRef] [Green Version]
Sofianos, T.; Sampieri, A.; Franco, L.; Galasso, F. Space-time-separable graph convolutional network for pose forecasting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 11209–11218. [Google Scholar]
Cai, Y.; Ge, L.; Liu, J.; Cai, J.; Cham, T.J.; Yuan, J.; Thalmann, N.M. Exploiting Spatial-Temporal Relationships for 3D Pose Estimation via Graph Convolutional Networks. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2272–2281. [Google Scholar] [CrossRef] [Green Version]
Chen, Y.; Wang, Z.; Peng, Y.; Zhang, Z.; Yu, G.; Sun, J. Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7103–7112. [Google Scholar]
Andriluka, M.; Pishchulin, L.; Gehler, P.; Schiele, B. 2d human pose estimation: New benchmark and state of the art analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 3686–3693. [Google Scholar]
Liu, S.; Liu, X.; Huang, G.; Feng, L.; Hu, L.; Jiang, D.; Zhang, A.; Liu, Y.; Qiao, H. FSD-10: A dataset for competitive sports content analysis. arXiv 2020, arXiv:2002.03312. [Google Scholar]
Qin, Z.; Liu, Y.; Ji, P.; Kim, D.; Wang, L.; McKay, B.; Anwar, S.; Gedeon, T. Fusing higher-order features in graph neural networks for skeleton-based action recognition. arXiv 2021, arXiv:2105.01563. [Google Scholar] [CrossRef] [PubMed]
He, J.; Guo, Z.; Liu, L.; Su, Y. Human Activity Recogni-tion Technology Based on Sliding Window and Convo-lutional Neural Network. J. Electron. Inf. Technol. 2022, 44, 168–177. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
Fang, N.; Fang, X.W.; Lu, K. Anomalous Behavior Detection Based on the Isolation Forest Model with Multiple Per-spective Business Processes. Electronics 2022, 11, 3640. [Google Scholar] [CrossRef]
Irwin, C.; Gary, R. Systematic Review of Fitbit Charge 2 Validation Studies for Exercise Tracking. Transl. J. Am. Coll. Sport. Med. 2022, 7, 1–7. [Google Scholar] [CrossRef] [PubMed]
Li, H.; Zhu, M.; Yang, Y.; Tu, L.; Zhou, X.; Du, S. Research and application of nuclear fuel embedded intelligent security monitoring technology. Energy Rep. 2022, 8, 73–86. [Google Scholar] [CrossRef]
Demidovskij, A.; Tugaryov, A.; Kashchikhin, A.; Suvorov, A.; Tarkan, Y.; Mikhail, F.; Yury, G. OpenVINO Deep Learning Workbench: Towards Analytical Platform for Neural Networks Inference Optimization. J. Phys. Conf. Ser. 2021, 1828, 012012. [Google Scholar] [CrossRef]
Serdar, Y. Nvidia’s New TensorRT Speeds Machine Learning Predictions. 2017. Available online: https://www.infoworld.com/article/3203938/nvidias-new-tensorrt-speeds-machine-learning-predictions.html (accessed on 27 June 2017).
Samaan, G.H.; Wadie, A.R.; Attia, A.K.; Asaad, A.M.; Kamel, A.E.; Slim, S.O.; Abdallah, M.S.; Cho, Y.-I. MediaPipe’s Landmarks with RNN for Dynamic Sign Language Recognition. Electronics 2022, 11, 3228. [Google Scholar] [CrossRef]
Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, Lille, France, 7–9 July 2015; pp. 448–456. [Google Scholar]
Glorot, X.; Bordes, A.; Bengio, Y. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, Fort Lauderdale, FL, USA, 11–13 April 2011; pp. 315–323. [Google Scholar]
Xiong, R.; Yang, Y.; He, D.; Zheng, K.; Zheng, S.; Xing, C.; Zhang, H.; Lan, Y.; Wang, L.; Liu, T. On layer normalization in the transformer architecture. In Proceedings of the International Conference on Machine Learning, Virtual Event, 13–18 July 2020; pp. 10524–10533. [Google Scholar]
Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond empirical risk minimization. arXiv 2017, arXiv:1710.09412. [Google Scholar]
Ruder, S. An overview of gradient descent optimization algorithms. arXiv 2016, arXiv:1609.04747. [Google Scholar]
Li, R.; Wang, S.; Zhu, F.; Huang, J. Adaptive graph convolutional neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Chen, Y.; Zhang, Z.; Yuan, C.; Li, B.; Deng, Y.; Hu, W. Channel-wise topology refinement graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 13359–13368. [Google Scholar]

Figure 1. Manuscript structure.

Figure 2. Percentage of different classes in FSD.

Figure 3. Distribution of video length.

Figure 4. Overview of improved ST-GCN.

Figure 5. Flowchart of the improved algorithm.

Figure 6. Network structure of the improved ST-GCN.

Figure 7. Structure of the improved DETR.

Figure 8. Relations between prediction and ground truth sets.

Figure 9. Structure of multi-head attention.

Figure 10. M-ATT in improved ST-GCN.

Figure 11. ResNet in improved ST-GCN.

Figure 12. Trends in learning rate.

Figure 13. Experiment results on MPII of our proposed model.

Figure 14. Confusion matrix results on FSD of our proposed model.

Figure 15. Confusion matrix results on MPII of our proposed method.

Table 1. Effects of up-sampling on model accuracy.

Up-Sampling Strategy	Test Set Accuracy
Yes	98.35%
No	97.23%

Table 2. Effects of frame length on model accuracy.

Frame Length	Test Set Accuracy
250	95.74%
400	94.39%
800	94.77%

Table 3. Confusion Matrix.

Confusion Matrix		True Results
Confusion Matrix		Positive	Negative
Prediction Results	Positive	TP	FP
Prediction Results	Negative	FN	TN

Table 4. PCK of different methods on the FSD dataset.

Method	Ankle*	Knee*	Hip*	Shoulder*	Wrist*	Elbow*	Thorax	Head Top	Mean
AGCN	90.2%	95.3%	90.2%	98.1%	88.6%	87.5%	89.1%	95.6%	91.8%
CTR-GCN	98.1%	96.1%	89.5%	90.7%	87.4%	89.8%	90.0%	91.8%	91.7%
ST-GCN	94.3%	93.0%	87.9%	87.6%	86.2%	90.9%	96.7%	94.0%	91.3%
Proposed	99.7%	99.1%	99.8%	87.6%	89.0%	91.2%	83.2%	95.7%	93.2%

Table 5. PCK of different methods on the MPII dataset.

Method	Ankle*	Knee*	Hip*	Shoulder*	Wrist*	Elbow*	Thorax	Head Top	Mean
AGCN	98.7%	95.4%	87.6%	89.7%	97.3%	89.2%	83.0%	91.2%	91.5%
CTR-GCN	93.4%	94.1%	89.4%	90.3%	90.9%	91.3%	85.2%	93.9%	91.1%
ST-GCN	90.0%	89.4%	93.7%	97.5%	89.7%	90.3%	84.0%	94.2%	91.0%
Proposed	99.1%	96.4%	99.8%	83.8%	90.3%	91.9%	85.4%	95.0%	92.7%

Table 6. Performance of the different methods.

Methods	FLOPs (G)	MACs (G)
AGCN	~2.1	~7.0
CTR-GCN	~1.7	~6.8
ST-GCN	~1.8	~6.7
Proposed	~1.7	~6.4

Table 7. Mean PCK of different attention head counts in FSD and MPII datasets.

h (Multi-Head)	Mean PCK of FSD	Mean PCK of MPII
2	90.6%	91.2%
4	91.7%	91.8%
6	92.1%	92.0%
8	91.0%	90.7%

Table 8. Mean PCK of our methods with different settings in the MPII dataset.

Methods	Mean PCK of FSD	Mean PCK of MPII
I $_{o r i g i n a l}$	91.3%	91.0%
I $_{o r i g i n a l}$ + I $_{D E T R}$	92.8%	92.3%
I $_{o r i g i n a l}$ + I $_{M A T T}$ (h = 6)	92.1%	92.0%
I $_{o r i g i n a l}$ + I $_{R e s}$	91.7%	91.4%
Proposed (I $_{o r i g i n a l}$ + I $_{D E T R}$ + I $_{M A T T}$ + I $_{R e s}$ )	93.2%	92.7%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, M.; Shi, P. Human Pose Estimation Based on a Spatial Temporal Graph Convolutional Network. Appl. Sci. 2023, 13, 3286. https://doi.org/10.3390/app13053286

AMA Style

Wu M, Shi P. Human Pose Estimation Based on a Spatial Temporal Graph Convolutional Network. Applied Sciences. 2023; 13(5):3286. https://doi.org/10.3390/app13053286

Chicago/Turabian Style

Wu, Meng, and Pudong Shi. 2023. "Human Pose Estimation Based on a Spatial Temporal Graph Convolutional Network" Applied Sciences 13, no. 5: 3286. https://doi.org/10.3390/app13053286

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Human Pose Estimation Based on a Spatial Temporal Graph Convolutional Network

Abstract

1. Introduction

2. Related Work

2.1. Data Preprocessing

2.2. Design Model Structure

3. Proposed Methods

3.1. Overview of Model Structure

3.1.1. End-to-End Detection with an Improved DETR

3.1.2. Stabilizing the Training Process Using Multi-Head Attention

3.1.3. Using the Residual Network

3.2. Model Configuration Strategies

3.2.1. Loss Function

3.2.2. Optimizer

3.3. Model Training Strategies

4. Experiments

4.1. Dataset Details

4.2. Evaluation Indicators

4.3. Experimental Results Analysis

4.3.1. Comparison Experiments

4.3.2. Ablation Experiments

5. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI