Next Article in Journal
Adaptive 3D Reversible Data Hiding Technique Based on the Cumulative Peak Bins in the Histogram of Directional Prediction Error
Previous Article in Journal
DANet: A Semantic Segmentation Network for Remote Sensing of Roads Based on Dual-ASPP Structure
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

MSTPose: Learning-Enriched Visual Information with Multi-Scale Transformers for Human Pose Estimation

School of Information Science and Engineering, Zhejiang Sci-Tech University, Hangzhou 310018, China
*
Author to whom correspondence should be addressed.
Electronics 2023, 12(15), 3244; https://doi.org/10.3390/electronics12153244
Submission received: 26 June 2023 / Revised: 26 July 2023 / Accepted: 26 July 2023 / Published: 27 July 2023
(This article belongs to the Section Computer Science & Engineering)

Abstract

:
Human pose estimation is a complex detection task in which the network needs to capture the rich information contained in the images. In this paper, we propose MSTPose (Multi-Scale Transformer for human Pose estimation). Specifically, MSTPose leverages a high-resolution convolution neural network (CNN) to extract texture information from images. For the feature maps from three different scales produced by the backbone network, each branch performs the coordinate attention operations. The feature maps are then spatially and channel-wise flattened, combined with keypoint tokens generated through random initialization, and fed into a parallel Transformer structure to learn spatial dependencies between features. As the Transformer outputs one-dimensional sequential features, the mainstream two-dimensional heatmap method is abandoned in favor of one-dimensional coordinate vector regression. The experiments show that MSTPose outperforms other CNN-based pose estimation models and demonstrates clear advantages over CNN + Transformer networks of similar types.

1. Introduction

Human pose estimation is a crucial component in the field of computer vision, aiming to predict anatomical keypoints of the human body in 2D images. With the advancement of deep convolutional neural networks, the performance of pose estimation models has made significant progress. These models have gradually been applied to more complex scenarios, such as motion analysis [1,2,3] and human–computer interaction [4,5,6].
Currently, the mainstream models for pose estimation predominantly rely on CNN as encoders to extract texture features. Subsequently, the feature maps are decoded into higher-resolution sizes using methods, such as heatmap-based approaches or direct keypoint regression, which has become a widely adopted paradigm in most pose estimation models. The Hourglass model [7], for instance, stacks multiple Hourglass modules, where each module utilizes symmetric up-sampling and down-sampling processes combined with intermediate supervision to generate high-resolution feature maps. HRNet [8] employs parallel branches for different resolution feature mappings while consistently maintaining the highest-resolution branch. However, due to the nature of convolutional kernels, CNN exhibits local convolutional properties, restricting its ability to capture global dependencies within a limited receptive field. Although CNN excels at extracting texture features from images, it often lacks the capacity to learn spatial features effectively. As a consequence, the network fails to fully comprehend the information contained in the image. These limitations greatly constrain the potential of CNN-based models.
In recent years, Transformer [9] has achieved remarkable success in the field of natural language processing (NLP), continuously breaking records and topping various leaderboards. As a sequence-to-sequence model, Transformer exhibits strong modeling capabilities for dependencies between sequences. Furthermore, in the field of computer vision, Transformer excels at capturing the spatial features of images. The introduction of Vision Transformer (ViT) [10] marked the first application of Transformer to computer vision. The authors divided images into smaller patches, flattened them into sequences, and trained Transformer on these sequences. This simple yet effective approach quickly attracted the attention of many researchers. However, the high resolution of images poses computational challenges for pure Transformer methods, leading to the emergence of CNN + Transformer networks. One of the most representative models in this category is TFPose [11]. The authors employ CNN as an encoder, flatten the extracted features along the channel dimension, and feed them into Transformer. Finally, they use regression methods to predict keypoints. We believe that CNN + Transformer is a more optimal solution that leverages the strengths of both networks, striking a balance between speed and accuracy. However, mainstream CNN + Transformer models are still in their early stages, leaving room for exploration in terms of network integration and regression approaches. Consequently, the performance of both networks is not fully realized. Based on these considerations, this paper proposes a novel network architecture called MSTPose, which aims to address the limitations of existing models.
MSTPose utilizes HRNet [8] as the backbone network, where the output of the backbone network undergoes the coordinate attention operations [12]. Considering the semantic differences between different branches, the feature maps from the three branches are then fed into the MST module, which consists of a parallel Transformer structure. In the output phase, the conventional heatmap method is discarded to avoid the drawbacks of repeated dimensionality changes and the destruction of spatial structure when combining heatmaps with Transformer. Instead, this paper adopts the one-dimensional vector representation (VeR) validated in SimCC [13] to represent the keypoints. In summary, the MSTPose learns texture features through CNN, captures spatial features of the images through the MST module, and employs one-dimensional vector regression to preserve the position-sensitive spatial sequential mapping structure of the Transformer output. This work addresses the limitations of previous networks in insufficient image feature extraction and loss of spatial information in the Transformer output sequences. The main contributions of this paper are:
  • We propose MSTPose in this study, which fully leverages the characteristics of CNN and Transformer to enable the network to learn rich visual representations, thereby significantly improving the network’s modeling ability in complex scenes.
  • The coordinate attention mechanism is introduced at the output location of the backbone network to obtain position-sensitive feature maps, which helps the Transformer extract spatial features from images.
  • Considering the semantic differences between different branches, we propose the MST module. By using a parallel structure, different-scale branches are separately fed into the Transformer for training. This allows the network to capture more complex semantic information and improve its detection ability for different instances.
  • Conventional heatmap methods are discarded to overcome the drawback of repetitive dimensionality changes that disrupt the spatial structure of feature maps when combined with Transformer. Furthermore, we successfully integrate the VeR method with Transformer for the first time, resulting in improved predictive accuracy.
  • In this study, we test MSTPose on the primary public benchmark datasets, COCO and MPII, and achieve better performance compared to CNN-based and CNN + Transformer networks.

2. Related Work

2.1. CNN-Based Human Pose Estimation

In the field of human pose estimation, CNN-based methods have achieved tremendous success. Many early works aim to extract image features by using CNN as an encoder. DeepPose [14] firstly introduces CNN to address the problem of pose estimation, they propose a cascaded structure of deep neural networks. In SimpleBaseline [15], the authors utilize transpose convolution in the output part of the backbone network to generate higher-resolution feature maps for better pose estimation.
Due to pose estimation being different from simple detection tasks, capturing the global dependencies between features is crucial. Varun Ramakrishna et al. [16] propose a sequential prediction algorithm that simulates the mechanism of message passing to predict the confidence of each variable (part), iteratively improving the estimation at each stage. Tompson et al. [17] utilize the structural relationships between human keypoints and incorporate the idea of Markov random fields to optimize the prediction results. Wei et al. [18] introduce the CPM (convolutional pose machines) network with VGG [19] as the backbone, employing a jointly trained multi-stage, intermediate supervision architecture to learn the dependencies between keypoints. George Papandreou et al. [20] propose a box-free system based on fully convolutional networks, learning the offsets of keypoints through a greedy decoding process and grouping keypoints into human pose instances.
However, due to the local convolutional nature of CNN, its ability to capture global dependencies is limited. Another approach is to enlarge the receptive field of feature maps, and there are various ways to achieve this, such as multi-scale fusion [21,22,23,24] and high-resolution representation [25]. Yilun Chen et al. [23] present a cascaded pyramid model to obtain multi-scale features, ultimately performing pose estimation by up-sampling to high-resolution feature maps. Bowen Cheng et al. [25] propose HigherNRNet, which utilizes transpose convolutions to obtain higher-resolution feature maps to perceive small-scale objects.
As networks become increasingly complex, there is a need for better methods to more comprehensively capture image information. Compared to previous works that solely rely on CNN, the emergence of Transformer highlights new possibilities to pose estimation.

2.2. Transformer-Based Human Pose Estimation

Transformer is a feed-forward network based on the self-attention mechanism, which has achieved significant success in the field of NLP [26,27,28,29,30,31]. In recent years, with the introduction of Transformer into the visual domain, we have witnessed the rise of Transformer [10,32,33].
In the field of image segmentation, W. Wang et al. [34] propose a method called Attention-Guided Object Segmentation (AGOS) and Dynamic Visual Attention Prediction (DVAP) for unsupervised video object segmentation. T. Zhou et al. [35] introduce Matnet, which employs a two-stream encoder to transform surface features into mobile attention features at each stage of convolution. The bridge network is used for multi-level feature map fusion and acquisition, resulting in better segmentation results.
In the domain of object detection, N. Carion et al. [32] present the DETR model, which achieves higher detection accuracy by incorporating Transformer and employing a unique set prediction loss. To address the slow convergence speed and limited feature spatial resolution issues in [32], X. Zhu et al. [33] propose Deformable DETR, where the attention module focuses only on a small group of key sampling points around the reference, leading to improved performance.
In the field of human pose estimation, S. Yang et al. [36] introduce TransPose, using CNN as the encoder and incorporating Transformer for precise localization of human keypoints, capturing both short and long-range dependencies between keypoints. W. Mao et al. [11] propose TFPose, which builds upon [36] and employs direct regression of keypoints for pose estimation. K. Li et al. [37] develop the end-to-end PRTR model, which employs cascaded Transformer networks for direct regression of keypoints. B. Shan et al. [38] propose the MSRT network, which performs segmentation and superimposition of feature maps at different scales using the FAM module and utilizes Transformer for keypoints decoding.

3. Proposed Method

The structure of MSTPose is illustrated in Figure 1. MSTPose employs CNN as the encoder to extract features from the images. For the feature maps of the three output branches, the coordinate attention operation [12] is applied to each branch. Subsequently, they are passed through the MST module. Finally, the keypoint coordinates are decoded using one-dimensional vector regression [13].

3.1. Backbone Network

We adopt the first three stages of HRNetW48 as the feature extractor for MSTPose, named HRNetW48-s. The network starts with a Stem consisting of two 3 × 3 convolutions, reducing the resolution to 1/4 of the original image size. Assume the input image is X R 3 × H × W . For the third stage of the backbone network, the outputs of each branch are X 1 , 1 R C 1 × H 1 × W 1 , X 1 , 2 R C 2 × H 2 × W 2 , and X 1 , 3 R C 3 × H 3 × W 3 , where H 1 , W 1 = H / 4 , W / 4 , H 2 , W 2 = H / 8 , W / 8 , and H 3 , W 3 = H / 16 , W / 16 . These feature maps currently contain rich texture information.

3.2. ATTM

Subsequently, the outputs X 1 , i R C i × H i × W i , i ( 1 , 2 , 3 ) of the backbone network are fed into the attention module (ATTM). The ATTM consists of three parallel coordinate attention mechanisms, as illustrated in Figure 2.
Specifically, channel attention commonly utilizes global pooling to encode global spatial information, compressing the global information into a scalar, which makes it difficult to preserve important spatial details. Therefore, the input feature maps are first subjected to two one-dimensional pooling operations, and the formula is as follows:
Z c h ( h ) = 1 W 0 i < w x c ( h , i )
Z c w ( w ) = 1 H 0 j < h x c ( j , w )
where the feature maps are aggregated separately along the vertical and horizontal directions, resulting in two distinct direction-aware feature maps. Then, concatenate the above outputs and use 1 × 1 convolution, batch normalization, and non-linear activation for feature transformation.
f = δ ( F 1 ( [ z h , z w ] ) )
Then, f is subsequently divided into two independent features using an additional two 1 × 1 convolutions and a sigmoid function for feature transformation, ensuring its dimensionality matches that of the input size.
g h = σ ( F h ( f h ) )
g w = σ ( F w ( f w ) )
Each attention map captures long-range dependencies along the spatial direction of the input feature maps. The attention maps are then applied to the input feature maps through multiplication, resulting in direction-aware and position-sensitive feature maps.
y c ( i , j ) = x c ( i , j ) × g c h ( i ) × g c w ( j )
These feature maps aid in more accurate object localization and perception of objects of interest by the network. We represent the ATTM output y c ( i , j ) as X 2 , i R C i × H i × W i , i ( 1 , 2 , 3 ) .

3.3. MST Module

Next, the outputs X 2 , i of the ATTM module are fed into the MST module, whose structure is illustrated in Figure 3. Transformer processes sequential data, and it is necessary to map X 2 , i into one-dimensional sequences. Since visual features encompass both texture and spatial information, to enable the network to fully capture visual information, we refer to the method in CSIT [39] to, respectively, feed X 2 , i into the channel encoder and spatial encoder, to generate channel tokens and spatial tokens. At this point, their shapes are X c , i R C i × L c and X s , i R C i × L s , where i ( 1 , 2 , 3 ) . Meanwhile, the network initializes and generates learnable keypoint vectors as keypoints tokens, with the shape of X k R M × L , where M represents the number of keypoints labeled for each human instance, and L denotes the length of each sequence, set to 192 in this paper.
For channel tokens and spatial tokens, the network controls the length of each token to be L = 192 through the linear mapping. Consequently, the newly generated tokens have shapes of X c , i R C i × L and X s , i R C i × L , where i ( 1 , 2 , 3 ) . The advantage of this approach is that it significantly reduces the computational complexity of Transformer O ( C × L 2 ) , where C denotes the quantity of tokens, and L denotes the length of tokens, while preserving fine-grained information for each sequence. Since the self-attention mechanism in the Transformer lacks positional awareness, position encoding is then applied to the channel tokens and spatial tokens. Specifically, we will label each position as p e i [9], and then each input sequence {channel tokens, spatial tokens } = { j = 0 L X c , 1 , j + p e j , j = 0 L X c , 2 , j + p e j , j = 0 L X c , 3 , j + p e j , j = 0 L X s , 1 , j + p e j , j = 0 L X s , 2 , j + p e j , j = 0 L X s , 3 , j + p e j } will undergo position encoding. After encoding, the three types of tokens are concatenated and jointly fed into the Transformer for feature learning. The concatenated input sequences are denoted as X T R C T × L , where C T = M + C i + C i , i ( 1 , 2 , 3 ) .
In Transformer, the first step is to perform linear projections on the input sequences to generate Q (query), K (key), and V (value). The specific formula is shown as follows:
Q = X T × W Q , K = X T × W K , V = X T × W V
where W Q , W K , and W V represent the corresponding weight matrices. Then, the spatial dependencies of the features are captured using the multi-head self-attention mechanism, with the following formula:
M H S A ( Q , K , V ) = s o f t m a x ( Q × X T L )
where L represents the dimension of the keys. Each query needs to be paired with all the keys. Subsequently, softmax is employed to compute attention scores, with each score determining the attention level of the current query token.

3.4. VeR Module

As shown in Figure 4, in the VeR module, we directly process the one-dimensional sequences output by the Transformer to preserve its fine-grained information to the maximum extent. In this paper, we first sum up the outputs of three branches along the channel dimension to generate X V R M × L , where the number of sequences is M. Then, they are separately fed into the X and Y vector classifiers, generating X x R M × ( H · K ) and X y R M × ( W · K ) . Due to the existence of the scale factor k, the lengths of the generated X x and X y sequences will be larger than the original image’s width and height, thereby achieving sub-pixel level localization. Subsequently, X x and X y are decoded to generate predicted coordinates, as shown in the following formulas:
O x i = a r g m a x [ x 0 , x 1 , , x W · ( k 1 ) ] R W · k , x i = 1 2 π σ e x p ( ( i x ) 2 2 σ 2 )
O y j = a r g m a x [ y 0 , y 1 , , y W · ( k 1 ) ] R W · k , y j = 1 2 π σ e x p ( ( j y ) 2 2 σ 2 )
where σ represents the standard deviation, ( x , y ) refers to the pixel point on the ( X i , Y j ) vector, ( x i , y j ) denotes the supervised signal generated through the one-dimensional Gaussian distribution, and ( O x i , O y j ) represents the predicted coordinates of the keypoint.

4. Experiments

4.1. Experimental Details

4.1.1. Datasets and Evaluation Indicators

Our experiments utilize the widely adopted benchmark datasets COCO [40] and MPII [41]. In order to verify the performance of MSTPose, we extensively train and validate the model on these two datasets. The following sections provide a detailed introduction to each dataset.
COCO dataset: COCO is a large-scale and versatile dataset widely used in the field of computer vision, which is proposed by Microsoft. It consists of 200k images and 250k annotated human instances, with each human instance labeled with 17 keypoints. In this paper, we train our model on the COCO train2017 set and perform validation and ablation experiments on the COCO Validation set, which contains 5k images. We also test our model on the COCO test-dev set, consisting of 20k images, and compare its performance with state-of-the-art models. The evaluation metrics used in the COCO dataset are average precision (AP) and average recall (AR), which are derived based on the object keypoint similarity (OKS). The formula for OKS is as follows:
O K S = i e x p ( d i 2 2 s 2 k i 2 ) δ ( v i > 0 ) i δ ( v i > 0 )
where d i 2 represents the square of the Euclidean distance between the ground truth and predicted values, i denotes the i-th keypoint, s 2 represents the square of the area occupied by the human instance, k i represents a constant that controls the attenuation for each keypoint, v i indicates the visibility of the keypoint, and δ represents a logical function. The formula for A P is shown below:
A P t = p δ ( O K S > t ) p 1
when t is set to 0.5 and 0.75, they are denoted as A P 50 and A P 75 , respectively. When 32 2 < s < 96 2 and s > 96 2 , they are denoted as A P M and A P L , and the same applies to A R .
MPII dataset: The MPII dataset is one of the most commonly used benchmarks in the field of human pose estimation. It consists of a total of 25k images and 40k human instances, with each instance annotated with 16 keypoints. The evaluation metric used in the MPII dataset is PCK (percentage of correct keypoints), which is calculated using the following formula:
P C K σ p ( d 0 ) = 1 | τ | τ δ ( | | x p f y p f | | 2 < σ )
where d 0 represents a human detector, and σ denotes a threshold that indicates the degree of match between the ground truth and predicted values.

4.1.2. Implementation Details

This paper follows a top-down paradigm. Firstly, single-person images are detected from multiple-person images using a human body detector [42], followed by single-person keypoint detection. During the training process, the total number of epochs is set to 210, with an initial learning rate of 1 × 10 3 , which is reduced to 1 × 10 4 at the 90th epoch and further reduced to 1 × 10 5 at the 120th epoch. The experiments are conducted on a server equipped with four NVIDIA GeForce RTX 3090 24G GPUs. In the experiment setting, the deep learning framework is implemented using Pytorch 1.13.1 in combination with cuda11.6 + cudnn8.3.2 in Ubuntu 20.04.5 LTS.

4.2. Experimental Results

4.2.1. Quantitative Experimental Results

The test results of MSTPose on the COCO Validation dataset are shown in Table 1. It can be observed that MSTPose achieves the best results in the major metrics. Compared to the SOTA model ViTPose, our method achieves a reduction of 275.8 million parameters with only a 0.9% sacrifice in accuracy. In comparison to TFPose [11], MSTPose exhibits a decrease of 39.7% in GFLOPs, yet it surprisingly improves the A P by 4.8%. Furthermore, when compared to PRTR [37], MSTPose achieves a 3.9% increase in A P while utilizing only 38.6% of PRTR [37]’s GFLOPs. Additionally, compared to the MSRT [38] network of the same type, A P also increased by 5%, showing a significant improvement. The excellent performance of MSTPose on the COCO Validation dataset confirms its feasibility.
The test results of MSTPose on COCO test-dev are shown in Table 2. MSTPose outperforms pure CNN-based networks. Furthermore, compared to a human pose estimation model that utilizes Transformer, MSTPose demonstrates remarkable competitiveness. In comparison to TFPose [11], despite a 5.8% decrease in GFLOPs, MSTPose achieves a 2.5% improvement in A P . When compared to PRTR [37], MSTPose achieves a 2.6% increase in A P and a 23.2% decrease in GFLOPs. These results highlight the significant advantages of MSTPose in terms of both speed and accuracy, whether compared to pure CNN-based networks or Transformer-based human pose estimation networks.
The test results of MSTPose on the MPII dataset are shown in Table 3. It can be observed that except for Hip, MSTPose achieves the best performance in all other metrics. Specifically, the mean score reaches 90.2%, which is a notable improvement of 2 percentage points compared to PRTR-R101 [37]. This represents a significant enhancement in the MPII dataset.

4.2.2. Qualitative Experimental Results

From Figure 5, it can be observed that MSTPose is capable of accurately predicting human keypoints in various occlusion scenarios, including self-occlusion and mutual occlusion. Furthermore, in densely populated scenes with indistinct human features, MSTPose performs effectively by extracting rich information embedded in the images. This enables accurate identification of each individual instance and its skeletal structure. In the top-left corner of the image, the person sitting on the far left experiences severe mutual occlusion, but MSTPose is capable of reconstructing the overall target using local features as much as possible.

4.3. Ablation Experiments

In order to enable the network to learn rich visual information, this paper proposes the MSTPose, which is achieved through ATTM, the MST module, and the VeR module. To enhance the persuasiveness of the network, extensive ablation experiments are conducted in this study to validate the effectiveness of each module. In addition, since mainstream models only operate on the highest-resolution branch, we also validated the effectiveness of each module for each branch.

4.3.1. Ablation Experiment of ATTM

The mainstream pose estimation networks based on HRNet utilize the output of the highest-resolution branch as the final output of the entire backbone network, disregarding the other branches. However, MSTPose considers each branch by applying coordinate attention to each of them and then feeding them into the parallel Transformer module. To verify the effect of coordinate attention in each branch, ablation experiments are conducted, as shown in Table 4. Due to the presence of multiple network branches, there are a total of eight possible combinations. Among them, we select five representative combinations for the ablation experiments. While controlling other variables, all methods employ HRNetW48-s as the backbone network and utilize the MST module and VeR module.
It can be observed that the best performance is achieved when the coordinate attention mechanism is applied to the highest-resolution branch, contributing an improvement of 0.3% A P for the network. Branch2 contributes 0.3% A P , and Branch3 contributes 0.2% A P to the network. When all three branches adopt the coordinate attention mechanism, the entire network achieves 0.5% A P improvement. Therefore, it can be concluded that the coordinate attention mechanism works better for branches with higher resolutions, which explains why previous works often focused on using only the highest-resolution branch. Although branches with lower resolutions make a smaller contribution, the overall performance improvement is significant when the coordinate attention mechanism is applied to the entire network, thus affirming the effectiveness of coordinate attention for each branch.

4.3.2. Ablation Experiment of MST Module

We follow the approach outlined in Section 4.3.1 to conduct ablation experiments on the parallel structure of Transformer modules. The results are shown in Table 5. It can be observed that when only Branch1 is trained with Transformer, it contributes 0.6% A P to the network, Branch2 contributes 0.4% A P , and Branch3 contributes 0.4% A P . This further confirms previous findings that favored using the highest resolution as the network’s output. Although the branch with the highest resolution contributes significantly to the network, when combined with the low-resolution branches, the network achieves 0.9% A P improvement, demonstrating a remarkable enhancement. Hence this validates the effectiveness of the parallel branch Transformer.

4.3.3. Ablation Experiment of VeR Module

In this study, the MSTPose adopts the one-dimensional vector regression approach to predict keypoints, abandoning the conventional heatmap method. To demonstrate the superiority of the one-dimensional vector regression method and affirm the suitability of the VeR approach for human pose estimation based on Transformer networks, the comparative experiment is conducted between the two methods, as shown in Table 6.
When using the heatmap method, the A P is 75.1%. However, when employing the VeR method, the A P significantly increases to 77.2%, indicating a noticeable improvement of 2.1 percentage points. This clear enhancement supports the author’s earlier claim that utilizing a one-dimensional vector representation for keypoints, based on the one-dimensional sequences output from the Transformer, is a better choice compared to the heatmap method.

4.3.4. Ablation Experiment of MSTPose

The preceding sections involve ablation experiments conducted on each individual module. Subsequently, the focus shifts to the overall performance, and the results of ablation experiments on the entire MSTPose are presented in Table 7. Comparing method1 with method2, we observe that without using the MST module, ATTM contributes an A P of 0.4%. Comparing method3 with method4, when the MST module is utilized, ATTM contributes an A P of 0.5%. Comparing method1 with method3, in the absence of ATTM, the MST module contributes an A P of 0.8%. Comparing method2 with method4, when ATTM is employed, the MST module contributes an A P of 0.9%.
From these findings, it is evident that ATTM, the MST module, and the VeR module significantly contribute to the network. The coordination among different components facilitates the enhancement of the network’s ability to extract complex features, thereby further improving its overall performance.

5. Conclusions

In this paper, we propose a human pose estimation network based on a multi-scale parallel structure. Compared with the SOTA method ViTPose [44], our method significantly reduces computational complexity and parameter size with slightly lower accuracy. We apply coordinate attention operations to three branches of the backbone network’s output. Subsequently, these branches are fed into Transformer modules. Finally, we discard the conventional heatmap-based approach and instead adopt coordinate vector regression to predict the final keypoints. Remarkably, our method achieves satisfactory results. We conduct extensive tests on mainstream datasets, validating the outstanding performance of MSTPose. Additionally, we perform numerous ablation experiments to verify the effectiveness of each module.

Author Contributions

Conceptualization, C.W., X.W. and S.L.; Data curation, X.W.; Formal analysis, X.W.; Funding acquisition, C.W.; Investigation, A.Z.; Methodology, X.W.; Project administration, C.W.; Software, S.L.; Supervision, A.Z.; Validation, C.W.; Visualization, S.L.; Writing—original draft, S.L.; Writing—review and editing, C.W. and A.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the First Batch of “Pioneer” and “Leading Goose” R&D Programs of Zhejiang Province in 2023 under grant 2023C01041.

Data Availability Statement

Publicly archived datasets used in the study are listed below. COCO: http://cocodataset.org (accessed on 15 November 2022); MPII: http://human-pose.mpi-inf.mpg.de (accessed on 10 March 2023).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Meng, Z.; Zhang, M.; Guo, C.; Fan, Q.; Zhang, H.; Gao, N.; Zhang, Z. Recent Progress in Sensing and Computing Techniques for Human Activity Recognition and Motion Analysis. Electronics 2020, 9, 1357. [Google Scholar] [CrossRef]
  2. Agostinelli, T.; Generosi, A.; Ceccacci, S.; Khamaisi, R.K.; Peruzzini, M.; Mengoni, M. Preliminary Validation of a Low-Cost Motion Analysis System Based on RGB Cameras to Support the Evaluation of Postural Risk Assessment. Appl. Sci. 2021, 11, 10645. [Google Scholar] [CrossRef]
  3. Maskeliūnas, R.; Damaševičius, R.; Blažauskas, T.; Canbulut, C.; Adomavičienė, A.; Griškevičius, J. BiomacVR: A Virtual Reality-Based System for Precise Human Posture and Motion Analysis in Rehabilitation Exercises Using Depth Sensors. Electronics 2023, 12, 339. [Google Scholar] [CrossRef]
  4. Liu, H.; Liu, T.; Zhang, Z.; Sangaiah, A.K.; Yang, B.; Li, Y. ARHPE: Asymmetric relation-aware representation learning for head pose estimation in industrial human–computer interaction. IEEE Trans. Ind. Inform. 2022, 18, 7107–7117. [Google Scholar] [CrossRef]
  5. Liu, H.; Li, D.; Wang, X.; Liu, L.; Zhang, Z.; Subramanian, S. Precise head pose estimation on HPD5A database for attention recognition based on convolutional neural network in human-computer interaction. Infrared Phys. Technol. 2021, 116, 103740. [Google Scholar] [CrossRef]
  6. Wang, K.; Zhao, R.; Ji, Q. Human computer interaction with head pose, eye gaze and body gestures. In Proceedings of the 13th IEEE International Conference on Automatic Face & Gesture Recognition, FG, Xi’an, China, 15–19 May 2018; p. 789. [Google Scholar]
  7. Newell, A.; Yang, K.; Deng, J. Stacked hourglass networks for human pose estimation. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Part VIII 14. pp. 483–499. [Google Scholar]
  8. Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 5693–5703. [Google Scholar]
  9. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4 December 2017; pp. 6000–6010. [Google Scholar]
  10. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16×16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  11. Mao, W.; Ge, Y.; Shen, C.; Tian, Z.; Wang, X.; Wang, Z. Tfpose: Direct human pose estimation with transformers. arXiv 2021, arXiv:2103.15320. [Google Scholar]
  12. Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Online, 19–25 June 2021; pp. 13713–13722. [Google Scholar]
  13. Li, Y.; Yang, S.; Liu, P.; Zhang, S.; Wang, Y.; Wang, Z.; Yang, W.; Xia, S.T. SimCC: A Simple Coordinate Classification Perspective for Human Pose Estimation. In Proceedings of the Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Part VI. pp. 89–106. [Google Scholar]
  14. Toshev, A.; Szegedy, C. Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1653–1660. [Google Scholar]
  15. Xiao, B.; Wu, H.; Wei, Y. Simple baselines for human pose estimation and tracking. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 466–481. [Google Scholar]
  16. Ramakrishna, V.; Munoz, D.; Hebert, M.; Andrew Bagnell, J.; Sheikh, Y. Pose machines: Articulated pose estimation via inference machines. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Part II 13. pp. 33–47. [Google Scholar]
  17. Tompson, J.J.; Jain, A.; LeCun, Y.; Bregler, C. Joint training of a convolutional network and a graphical model for human pose estimation. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4 December 2017. [Google Scholar]
  18. Wei, S.E.; Ramakrishna, V.; Kanade, T.; Sheikh, Y. Convolutional pose machines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 4724–4732. [Google Scholar]
  19. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  20. Papandreou, G.; Zhu, T.; Chen, L.C.; Gidaris, S.; Tompson, J.; Murphy, K. Personlab: Person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 269–286. [Google Scholar]
  21. Pfister, T.; Charles, J.; Zisserman, A. Flowing convnets for human pose estimation in videos. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1913–1921. [Google Scholar]
  22. Yang, W.; Li, S.; Ouyang, W.; Li, H.; Wang, X. Learning feature pyramids for human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1281–1290. [Google Scholar]
  23. Chen, Y.; Wang, Z.; Peng, Y.; Zhang, Z.; Yu, G.; Sun, J. Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7103–7112. [Google Scholar]
  24. Chu, X.; Yang, W.; Ouyang, W.; Ma, C.; Yuille, A.L.; Wang, X. Multi-context attention for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1831–1840. [Google Scholar]
  25. Cheng, B.; Xiao, B.; Wang, J.; Shi, H.; Huang, T.S.; Zhang, L. Higherhrnet: Scale-aware representation learning for bottom-up human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 5386–5395. [Google Scholar]
  26. Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
  27. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
  28. Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. 2018. Available online: https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf (accessed on 14 October 2022).
  29. Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv 2019, arXiv:1910.13461. [Google Scholar]
  30. Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 5485–5551. [Google Scholar]
  31. Zhang, S.; Roller, S.; Goyal, N.; Artetxe, M.; Chen, M.; Chen, S.; Dewan, C.; Diab, M.; Li, X.; Lin, X.V.; et al. Opt: Open pre-trained transformer language models. arXiv 2022, arXiv:2205.01068. [Google Scholar]
  32. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Part I 16. pp. 213–229. [Google Scholar]
  33. Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
  34. Wang, W.; Song, H.; Zhao, S.; Shen, J.; Zhao, S.; Hoi, S.C.; Ling, H. Learning unsupervised video object segmentation through visual attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 3064–3074. [Google Scholar]
  35. Zhou, T.; Li, J.; Wang, S.; Tao, R.; Shen, J. Matnet: Motion-attentive transition network for zero-shot video object segmentation. IEEE Trans. Image Process. 2020, 29, 8326–8338. [Google Scholar] [CrossRef] [PubMed]
  36. Yang, S.; Quan, Z.; Nie, M.; Yang, W. Transpose: Keypoint localization via transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 11802–11812. [Google Scholar]
  37. Li, K.; Wang, S.; Zhang, X.; Xu, Y.; Xu, W.; Tu, Z. Pose recognition with cascade transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Online, 19–25 June 2021; pp. 1944–1953. [Google Scholar]
  38. Shan, B.; Shi, Q.; Yang, F. MSRT: Multi-scale representation transformer for regression-based human pose estimation. Pattern Anal. Appl. 2023, 26, 591–603. [Google Scholar] [CrossRef]
  39. Li, S.; Zhang, H.; Ma, H.; Feng, J.; Jiang, M. CSIT: Channel Spatial Integrated Transformer for human pose estimation. IET Image Process. 2023. [Google Scholar] [CrossRef]
  40. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Doll’ar, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
  41. Andriluka, M.; Pishchulin, L.; Gehler, P.; Schiele, B. 2d human pose estimation: New benchmark and state of the art analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 3686–3693. [Google Scholar]
  42. Zhou, X.; Wang, D.; Krähenbühl, P. Objects as points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
  43. Li, Y.; Zhang, S.; Wang, Z.; Yang, S.; Yang, W.; Xia, S.T.; Zhou, E. Tokenpose: Learning keypoint tokens for human pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 11313–11322. [Google Scholar]
  44. Xu, Y.; Zhang, J.; Zhang, Q.; Tao, D. Vitpose: Simple vision transformer baselines for human pose estimation. Adv. Neural Inf. Process. Syst. 2022, 35, 38571–38584. [Google Scholar]
  45. Tian, Z.; Chen, H.; Shen, C. Directpose: Direct end-to-end multi-person pose estimation. arXiv 2019, arXiv:1911.07451. [Google Scholar]
  46. Wei, F.; Sun, X.; Li, H.; Wang, J.; Lin, S. Point-set anchors for object detection, instance segmentation and pose estimation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Part X 16. pp. 527–544. [Google Scholar]
  47. Sun, X.; Xiao, B.; Wei, F.; Liang, S.; Wei, Y. Integral human pose regression. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 529–545. [Google Scholar]
Figure 1. The structure diagram of MSTPose.
Figure 1. The structure diagram of MSTPose.
Electronics 12 03244 g001
Figure 2. The module schematic diagram of the coordinate attention mechanism.
Figure 2. The module schematic diagram of the coordinate attention mechanism.
Electronics 12 03244 g002
Figure 3. Schematic diagram of an individual branch in the MST module.
Figure 3. Schematic diagram of an individual branch in the MST module.
Electronics 12 03244 g003
Figure 4. Schematic diagram of the VeR module.
Figure 4. Schematic diagram of the VeR module.
Electronics 12 03244 g004
Figure 5. Visualization of the testing results of MSTPose on COCO Validation dataset.
Figure 5. Visualization of the testing results of MSTPose on COCO Validation dataset.
Electronics 12 03244 g005
Table 1. The testing results of MSTPose on COCO Validation dataset. The best result is bolded.
Table 1. The testing results of MSTPose on COCO Validation dataset. The best result is bolded.
MethodBackboneGFLOPs#ParamsInput Size AP AP 50 AP 75 AP M AP L
SimCC [13]ResNet503.825.7M256 × 19270.8----
Simple Baseline [15]ResNet508.934.0M256 × 19270.488.678.367.177.2
Simple Baseline [15]ResNet10112.453.0M256 × 19271.489.379.368.178.1
Simple Baseline [15]ResNet15215.768.6M256 × 19272.089.379.868.778.9
TFPose [11]ResNet5020.4-384 × 28872.4----
PRTR [37]ResNet10133.460.4M512 × 34872.089.379.467.379.7
PRTR [37]HRNetW3221.657.2M384 × 28873.189.379.467.379.8
PRTR [37]HRNetW3237.857.2M512 × 34873.389.279.969.080.9
MSRT [38]ResNet101--512 × 34872.289.179.268.179.4
TransPose [36]HRNetW32-s-8.0M256 × 19274.2----
TokenPose [43]HRNetW32-s5.713.5M256 × 19274.789.881.471.381.4
ViTPose [44]ViT-L-307M256 × 19278.3----
MSTPoseHRNetW48-s14.631.2M256 × 19277.292.984.173.981.7
Table 2. The testing results of MSTPose on COCO test-dev dataset, where T is Transformer. The best result is bolded.
Table 2. The testing results of MSTPose on COCO test-dev dataset, where T is Transformer. The best result is bolded.
MethodBackboneGFLOPs#ParamsInput Size AP AP 50 AP 75 AP M AP L AR
DeepPose [14]ResNet1017.7-256 × 19257.486.564.255.062.8-
DeepPose [14]ResNet15211.3-256 × 19259.387.666.756.864.9-
CenterNet [42]Hourglass---63.086.869.658.970.4-
DirectPose [45]ResNet50---62.286.468.256.769.8-
PointSetNet [46]HRNetW48---68.789.976.364.875.3-
Integral Pose [47]ResNet10111.0-256 × 25667.888.274.863.974.0-
TFPose [11]ResNet50+T20.4-384 × 28872.290.980.169.178.874.1
PRTR [37]HRNetW48+T---64.987.071.760.272.578.8
PRTR [37]HRNetW32+T21.657.2M384 × 28871.790.679.667.678.479.4
PRTR [37]HRNetW32+T37.857.2M512 × 38472.190.479.668.179.0
SimCC [13]ResNet5020.225.7M384 × 28872.791.280.169.279.078.0
TransPose [36]HRNetW32+T-8.0M256 × 19273.491.681.170.179.3-
TokenPose [43]HRNetW32+T5.713.5M256 × 19274.091.981.570.679.879.1
MSTPoseHRNetW48+T14.631.2M256 × 19274.791.981.771.480.179.8
Table 3. The testing results of MSTPose on MPII dataset. The best result is bolded.
Table 3. The testing results of MSTPose on MPII dataset. The best result is bolded.
MethodBackboneHeaShoElbWriHipKneAnkMean
Simple Baseline [15]ResNet5096.495.389.083.288.484.079.688.5
Simple Baseline [15]ResNet10196.995.989.584.488.484.580.789.1
Simple Baseline [15]ResNet15297.095.990.085.089.285.381.389.6
HRNet [8]HRNetW3296.996.090.685.888.786.682.690.1
MSRT [38]ResNet10197.094.989.084.089.685.780.389.1
PRTR-R101 [37]ResNet10196.395.088.382.488.183.677.487.9
PRTR-R152 [37]ResNet15296.494.988.482.688.684.178.488.2
MSTPoseHRNetW4897.196.090.886.889.586.882.890.2
Table 4. The ablation experiment of coordinate attention mechanism, where CA is the coordinate attention.
Table 4. The ablation experiment of coordinate attention mechanism, where CA is the coordinate attention.
MethodBranch1Branch2Branch3 AP
CA 76.7
CA🗸 77.0
CA 🗸 77.0
CA 🗸76.9
CA🗸🗸🗸77.2
Table 5. The ablation experiment of MST module.
Table 5. The ablation experiment of MST module.
MethodBranch1Branch2Branch3 AP
Transformer 76.3
Transformer🗸 76.9
Transformer 🗸 76.7
Transformer 🗸76.7
Transformer🗸🗸🗸77.2
Table 6. The ablation experiment of VeR module.
Table 6. The ablation experiment of VeR module.
MethodVeRHeatmap AP
method1🗸 77.2
method2 🗸75.1
Table 7. The ablation experiment of MSTPose.
Table 7. The ablation experiment of MSTPose.
MethodATTMMST ModuleVeR AP
method1 🗸75.9
method2🗸 🗸76.3
method3 🗸🗸76.7
method4🗸🗸🗸77.2
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wu, C.; Wei, X.; Li, S.; Zhan, A. MSTPose: Learning-Enriched Visual Information with Multi-Scale Transformers for Human Pose Estimation. Electronics 2023, 12, 3244. https://doi.org/10.3390/electronics12153244

AMA Style

Wu C, Wei X, Li S, Zhan A. MSTPose: Learning-Enriched Visual Information with Multi-Scale Transformers for Human Pose Estimation. Electronics. 2023; 12(15):3244. https://doi.org/10.3390/electronics12153244

Chicago/Turabian Style

Wu, Chengyu, Xin Wei, Shaohua Li, and Ao Zhan. 2023. "MSTPose: Learning-Enriched Visual Information with Multi-Scale Transformers for Human Pose Estimation" Electronics 12, no. 15: 3244. https://doi.org/10.3390/electronics12153244

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop