AiPE: A Novel Transformer-Based Pose Estimation Method

Lu, Kai; Min, Dugki

doi:10.3390/electronics13050967

Open AccessArticle

AiPE: A Novel Transformer-Based Pose Estimation Method

by

Kai Lu

and

Dugki Min

^*

Department of Computer Science and Engineering, College of Engineering, Konkuk University, Seoul 05029, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(5), 967; https://doi.org/10.3390/electronics13050967

Submission received: 26 December 2023 / Revised: 8 February 2024 / Accepted: 27 February 2024 / Published: 2 March 2024

(This article belongs to the Special Issue Modern Computer Vision and Image Analysis)

Download

Browse Figures

Versions Notes

Abstract

:

Human pose estimation is an important problem in computer vision because it is the foundation for many advanced semantic tasks and downstream applications. Although some convolutional neural network-based pose estimation methods have achieved good results, these networks are still limited for restricted receptive fields and weak robustness, leading to poor detection performance in scenarios with blur or low resolution. Additionally, their highly parallelized strategy is likely to cause significant computational demands, requiring high computing power. In comparison to the convolutional neural networks, the transformer-based methods offer advantages such as flexible stacking, global perspective, and parallel computation. Based on the great benefits, a novel transformer-based human pose estimation method is developed, which employees multi-head self-attention mechanisms and offset windows to effectively suppress the quick growth of the computational complexity near human keypoints. Experimental results under detailed visual comparison and quantitative analysis demonstrate that the proposed method can efficiently deal with the pose estimation problem in challenging scenarios, such as blurry or occluded scenes. Furthermore, the errors in human skeleton mapping caused by keypoint occlusion or omission can be effectively corrected, so the accuracy of pose estimation results is greatly improved.

Keywords:

pose estimation; transformer model; attention; computer vision

1. Introduction

Human pose estimation has diverse applications, including systems designed to provide assistance in the daily lives of elderly individuals living alone. Such systems can assist caregivers in promptly detecting whether a senior citizen living alone has experienced an emergency or any other situation that requires attention [1,2,3]. In water parks, a drowning detection system utilizing human pose estimation can aid lifeguards in timely identification of individuals in distress [4,5,6,7]. Furthermore, in healthcare and fitness applications, it can score and evaluate users’ movements to determine if they align with established standards [8,9].

Currently, pose estimation is primarily conducted using convolutional neural networks, with the dominant approach being the bottom-up method [10,11,12,13]. A representative of the bottom-up approach is the OpenPose algorithm proposed by Cao et al. [1]. In this method, all limbs in the input image are first detected, and then they are assembled to obtain the human skeleton. While this approach exhibits strong adaptability to complex scenes, it suffers from low accuracy and slow estimation speed, making real-time human pose estimation impractical.

Transformers have achieved significant success in the field of computer vision. Scholars have applied them to specific tasks, such as human pose estimation. Li et al. [14], drawing inspiration from the architecture of Vision Transformer (ViT) [15], introduced a transformer-based 2D human pose estimation model that incorporates learnable keypoint annotations. Zheng et al. [8] proposed the PE-former model, a transformer-based 3D human pose estimation network that employs a 2D-to-3D lifting approach for 3D human pose estimation. Fang et al. [16] presented a transformer-based top-down approach called alpha pose. In this approach, the positions of human bodies in the input image are first detected, and then they are localized using anchor boxes. These methods exhibit fast speeds and support multi-person pose estimation, but they have limited adaptability to complex scenes and impose high hardware requirements.

To address the challenges of mapping errors in the human skeleton caused by keypoint occlusion or omission, as well as the computational burden in the pose estimation process, we adopt the transformer as the foundational framework. Leveraging the widely used and more mature OpenPose method as the backbone of a branch network, we propose a novel algorithm. To reduce redundancy in the computations of the transformer’s multi-head self-attention mechanism, we employ swin transformer blocks with introduced W-MSA (windowed multi-head self-attention) and SW-MSA (shifted windowed multi-head self-attention) [17]. After feature extraction by the swin transformer, the features are fed into the branch convolutional neural network of OpenPose. One pathway is dedicated to detecting human keypoints, while the other pathway focuses on learning the Part Affinity Fields (PAF) [18] for the connections between keypoints. The results from each stage are concatenated and then forwarded to the next stage for further learning. The final output is a feature map annotated with human keypoints and limb connections. This method demonstrates more effective pose estimation, particularly in challenging scenarios such as blurry or occluded scenes.

In Section 2, we provide a detailed exposition of the structure of the proposed method, encompassing the backbone network utilized for feature extraction and the branch network employed for training. Moving on to Section 3, experimental validations of our proposed approach are conducted, wherein a thorough comparison with convolutional neural network methods is presented. Finally, in Section 4, we encapsulate the essence of this article through a comprehensive summary and outline prospects for future work.

In summary, the main contributions of this paper are as follows: We presents a new pose estimation method using the transformer technique to improve accuracy. We introduce a shifted window and self-attention mechanism, enhancing feature extraction and keypoint heatmap generation for more precise human skeleton mapping. Our method shows superior performance in various challenging scenarios, including blurry and occluded scenes, limb intersections, and body shape variations.

2. Related Works

Vision Transformer. Ashish Vaswani et al. proposed a seq2seq model called transformer [17], and this model is used to perform NLP (natural language processing) tasks, not only introducing a residual network structure but also proposing a multi-headed self-attention mechanism with excellent results. In 2020, Alexey Dosovitskiy et al. attempted to apply transformer to the field of computer vision and proposed the ViT (vision in transformer) model [15]. Unlike convolutional neural networks, the ViT model first slices the image into several patches, then turns the two-dimensional tensor into a one-dimensional sequence via linear transformation, and then feeds the processed image data into the network akin to a language for learning. After pre-training on Google’s private JFT dataset, its best model achieved 88.55% accuracy on the ImageNet1K dataset. This not only demonstrates that transformer works in the CV domain, but also that it works very well. Wang et al. introduced the pyramid structure into transformer and proposed the PVT (pyramid vision transformer) model [19], which combines the advantages of both convolutional neural networks and transformer and becomes a universal convolution-free backbone. It also successfully resolves the problem of ViT experiencing difficulties when handling pixel-based processing tasks such as target detection and segmentation. Liu et al. followed the scheme of introducing a pyramidal structure in the transformer and introduced the concept of W-MSA (windows multi-head self-attention) in transformer [17]. Swin transformer partitions the feature map into multiple non-intersecting regions (windows) and performs the self-attention calculation within each window to significantly reduce the computational effort. After that, SW-MSA (shifted windows multi-head self-attention) completes the information exchange between windows to improve accuracy. Swin transformer ranked first in performance in the task of target detection and instance segmentation on a COCO dataset.

Pose Estimation. Transformer has been very successful in the field of computer vision. We wish to apply it to specific tasks, such as human pose estimation in the vision domain. Regarding human pose recognition, the two mainstream approaches are top-down and bottom-up. The representative algorithm of the top-down approach is alpha-pose, proposed by Fang et al. [16]. In this type of approach, the location of the human body in the input image is first detected and then framed using an anchor box. This scheme is highly dependent on the accuracy of the human detection anchor frame. The bottom-up method is represented by the OpenPose algorithm proposed by Cao et al. [1]. In this type of method, all limbs in the input image are first detected and then stitched together to acquire the human skeleton. This method is much less computationally intensive than the top-down method in the case of multiple people, but if there are two or more people close to each other in the image, limb splicing will easily become problematic. Because its inclusion in the backbone is shown to result in higher accuracy, the introduction of the transformer method into an existing method of human pose estimation may greatly improve the accuracy.

Transformer in Pose Estimation. Li et al. proposed TokenPose [14], which incorporates a learnable key point token and basically refers to the architecture of ViT, and it is the first pure transformer-based 2D human pose estimation model. Zheng et al. proposed the PE-former model for 3D human pose estimation with a 2D to 3D lifting approach [8]. This is the first pure transformer-based 3D human pose estimation network. Mao et al. proposed TFPose [4], mainly referring to deformable DETR, for single-person pose estimation. Xiong et al. proposed SwinPose [20], an attempt to use swin transformer for pose estimation.

In this paper, a transformer-based pose estimation framework is proposed. Transformer uses the swin transformer block with excellent results and stacks it with reference to the specifications of swin-L and swin-B. After the input image is fed into the backbone network, the branching network is picked up to independently process the key points and PAF connections. Compared to swin-pose, which also draws inspiration from the swin transformer, our work significantly modifies the swin transformer. We have altered the stacking format of the various stages to facilitate a more effective introduction of the self-attention mechanism into the task of human pose estimation. In tandem, we incorporated a branch network that utilizes Part Affinity Fields (PAF) and Gaussian heatmaps for keypoints to improve human pose estimation. In contrast to the pyramid structure of feature fusion in swin-pose, we have maintained the original resolution size during the feature extraction phase, which enables us to capture more details when calculating the self-attention mechanism in different regions. Our method not only achieves favorable results in tasks conducted under numerous adverse conditions but also places a greater emphasis on applications limited by computational resources.

3. The Proposed Method

3.1. The Overall Framework of AiPE

In this paper, a transformer-based pose recognition method is proposed. Figure 1 illustrates the schematic diagram of the architecture. The architecture includes patch partition blocks, which divide the input image into multiple patches; linear embedding blocks, which transform 2D tensors into 1D data, subsequently partitioned into three stages. Each stage consists of several sets of consecutive swin transformer blocks and a patch merging block. Through the patch merging block, the model can perform feature extraction and learning at multiple scales. Finally, these features are fed into a branch network for learning human keypoints and Part Affinity Fields (PAF).

3.2. The Swin Transformer Block

To capture smaller image features without significantly increasing computational demands on the device, we chose swin transformer blocks for feature extraction [3,21]. The multi-head attention component of swin transformer introduces W-MSA and SW-MSA modules, where the input image must pass through a transformer module containing SW-MSA before going through a transformer module containing W-MSA. Therefore, as shown in Figure 2, the swin transformer block modules in the architecture always appear in pairs, and the quantity within each stage is consistently even.

3.3. The Backbone Network

In the model, a patch partition layer and a linear embedding layer are used to transform the two-dimensional three-channel input image into a one-dimensional sequence. To enable the model to capture smaller features without incurring redundant computational costs, swin transformer blocks are selected for feature extraction. Subsequently, an independent branch network, primarily based on OpenPose, processes keypoints and Part Affinity Field (PAF) connections.

The backbone network is composed of multiple consecutive swin transformer blocks. To improve accuracy and reduce resource utilization, two approaches are employed for the composition of blocks: parameter stacking and depth stacking. The PE-B and PE-L models use the first three stages of swin transformer for block stacking, reducing the resolution degradation stages and the number of patch merging layers. This reduces the computational workload and better aligns the output of the backbone with the input of the branch network. After the first two stages of swin transformer blocks, a patch merging layer is added to downsample the feature map and fuse features. A large number of swin transformer blocks are heavily stacked in the third stage of the model, enabling better feature learning and reducing the occurrence of network degradation. At the end of the third stage, a connecting layer is added to adjust the features extracted by the model through the backbone network and send them to the branch network. The structure of the backbone network is illustrated in Figure 3. In PE-B, the depth of the feature map is 128, with attention heads in the multi-head self-attention modules for each stage, as in [4,8,16]. In PE-L, the feature map’s depth is 192, with attention heads in the multi-head self-attention modules for each stage, as in [6,12,22].

After incorporating the swin transformer, the multi-head attention calculations for patches are not performed across the entire image but within a designated window. Following multi-head attention computations within each window, the next layer conducts attention calculations with a shifted window. The shifted window in the self-attention calculation (SAC) enables information exchange between different adjacent windows. Despite multi-head self-attention calculations not directly occurring across the entire image scale, this information exchange allows each patch to gather information as comprehensively as possible from the entire image.

Additionally, following each stage of the swin transformer block, there is a patch merging block. This module can halve the height and width of the feature map while doubling its depth. This operation enhances the patch’s perceptual range, aiding in patch perception of targets of different sizes.

We stacked the swin transformer blocks heavily in the third stage, which is common practice to introduce the backbone of the residual network. This is because it has been shown, through extensive experiments, that stacking in the third stage allows the model to better learn features and reduces the occurrence of network degradation. After stages 1 and 2 of swin transformer blocking, we added a patch merging layer to downsample the feature map and fuse the features. At the end of stage 3, we added a join-up layer to adjust the features extracted by AiPE through the backbone and send them to the branch network.

3.4. The Branch Network

For the branch network, the model adopts the structure of the OpenPose. As illustrated in Figure 4, F represents the feature map extracted by the swin transformer portion. After the connecting layer, the feature map undergoes size adjustment and then enters the first stage of the branch network. S1 is the set of feature maps for the first stage, consisting of 19 human keypoints, including the feature maps for 18 keypoints and 1 for the background. S1 is sent to the PAF branch to learn PAF and reaches the L1 of the first stage, which includes 38 feature maps. At the end of the first stage, S1 and L1 are concatenated and then sent to the next stage. As shown in Figure 5, through six consecutive refinement stages, a set of coordinates for human keypoints and 19 sets of limb connections with high confidence are ultimately obtained.

3.5. The Loss Function

The loss function employs the Mean Square Error (MSE) data term [23,24], where

Y_{i}

represents the target value,

f (x_{i})

represents the estimated value, x denotes the corresponding keypoint, and

N

is the total number of keypoints. The Mean Square Error data term can be expressed as:

l o s s = \frac{1}{N} \sum_{i = 1}^{N} {(Y_{i} - f (x_{i}))}^{2}

(1)

4. The Experimental Results and Discussion

4.1. Experiment Setting and Details

All the experiments are conducted on Ubuntu 18.04 using two Nvidia GeForce RTX 3090 Ti GPUs (Nvidia, Santa Clara, CA, USA) for training on the COCO2017 dataset [25,26]. The training was configured with 1000 epochs, a batch size of 16, and a learning rate of 0.0001. SGD optimizer was employed for training.

The COCO dataset is a large-scale dataset containing over 330,000 images (with annotations for 220,000 images), encompassing 1.5 million objects across 80 object categories (e.g., pedestrians, cars, elephants) and 91 material categories (e.g., grass, walls, sky). Each image has five image captions, and there are 250,000 annotations with keypoints for pedestrians. The COCO dataset supports multiple tasks, and for each task, there are different annotation types stored in JSON format. In our task of human keypoints, the annotations include information as shown in Figure 6. ‘num_keypoints’ represents the total number of annotated keypoints on the image, ‘area’ denotes the area of the region containing keypoints, and ‘iscrowd’ has a value of 0 for an individual person and 1 for a group annotation. The keypoints consist of a total of 51 items, with each triplet recording the x and y coordinates of the keypoint as well as its visibility. Specifically, a visibility value of ‘2’ indicates a visible keypoint, ‘1’ indicates an occluded keypoint, and ‘0’ indicates an unmarked keypoint.

In the evaluation of object detection, IoU (Intersection over Union) [22] serves as a metric for measuring the similarity between predicted results and the ground truth targets. With this value, thresholds can be set to calculate the AP (average precision) metric [27]. In this study, we have set the thresholds at 0.5 and 0.75, respectively, for comparison with other experimental methods.

4.2. Results and Discussion

Figure 7 displays the experimental results. It can be observed that the proposed method performs exceptionally well in single-person pose estimation. Even in the presence of props, animals, and occlusion, the method is capable of achieving single-person pose estimation and can even recover some occluded keypoints, connecting them into limbs. In the case of two-person and three-person pose estimation, if the individuals in the image are at a considerable distance and have minimal overlap, the proposed method can still accomplish pose estimation, performing comparably to single-person pose estimation. Even when two individuals are in close proximity, as long as the target individuals have sufficient pixels, this method can still perform well in pose estimation.

Based on the experimental results of the proposed method and VGG [28], Table 1 has been compiled for comparing the human pose estimation outcomes of these two different methods.

Figure 8 illustrates comparisons of single-person pose estimation in five scenarios. It can be observed that the proposed method can achieve more accurate human pose estimation compared to the original VGG method. In these examples, in the first comparison group, due to the right hand keypoints being outside the image boundaries, VGG fails to correctly identify the right wrist and elbow keypoints. In the second comparison group, with an overall grayscale tone in the image, the VGG method only reconstructs some keypoints of the upper body and once again incorrectly connects the wrist keypoints. In contrast, the proposed method not only delineates keypoints on the lower right side of the body but also infers keypoints for the completely occluded left hip. In the third comparison group, where the human body occupies less than a quarter of the entire image, both our method and the VGG method fail to correctly connect all keypoints. However, compared to VGG, the proposed method only loses a few keypoints, finding all the human keypoints, with only a mismatch in the connection between the left foot and right foot. In the fourth comparison group, the left forearm of the target girl is entirely occluded by the left upper arm, and the left shoulder keypoint is fully obscured. Even in this scenario, the proposed method successfully performs human pose estimation, recovering all the keypoints in the upper body. In contrast, VGG’s inference results noticeably miss the left wrist keypoint and left forearm. In the fifth comparison group, the target is a person on a screen. Pixelation and placement of the screen may introduce slight discrepancies between the human on the screen and a real person. In this set of photos, unlike the VGG method, the proposed method identifies the right hip keypoint and right hand keypoint. As the right hip and left hip keypoints are not on the screen, the proposed method recovers them as occluded to ensure the integrity of the upper body skeleton of the target person, a feature not captured by the VGG method.

As shown in Figure 9, there are four sets of multi-person pose estimations, with three or four individuals in each set. In the sixth comparison group, despite all individuals in the photo standing neatly together with no variation in body size, due to their close proximity, many joints are still occluded. The proposed method does not make hasty judgments on occluded arm keypoints during human pose estimation, while VGG results in many keypoints on the arms being incorrectly annotated in this situation. In the seventh comparison group, the proposed method struggles to execute perfect pose estimation amid complex clothing, accessories, and interference from safety nets. However, in the non-crossed limbs, the proposed method correctly identifies the ownership of keypoints. In the eighth comparison group, two animals occlude half of the bodies of the two targets, yet the proposed method accurately annotates all visible keypoints, correctly identifying the presence of distant human bodies and performing human pose estimation, which VGG fails to achieve. In the ninth set of photos, three targets are facing the camera from the side, front, and back, respectively. The proposed method accurately identifies all keypoints, especially around the vibrant-colored long skirt, correctly determining the keypoints of the lower body. In contrast, VGG only identifies knee keypoints and misjudges one foot keypoint.

Figure 10 displays input images for the second set of comparisons during model training. After 50 rounds of training, the proposed method is able to discern the occluded keypoints and accurately draw the skeleton annotations on the human body after 110 rounds of training. In contrast, the VGG method continues to hesitate on the position of the right hand keypoint between rounds 80 and 130.

As shown in the input images for the eighth set (Figure 11), estimating poses disturbed by animals is a more challenging task due to varying distances between human targets and different lighting conditions. It can be observed that the proposed method, after 50 rounds, identifies the human targets in the distance under sunlight and begins attempting human pose estimation. It determines an accurate drawing scheme for the human skeleton annotations within 100 rounds. In contrast, in the original method with VGG as the backbone, the network only becomes aware of the targets behind the animals at around 100 rounds. It is only able to draw the complete skeleton of the person after obtaining the best model. Thus, compared to the convolutional neural network of VGG, the transformer’s global perspective and global self-attention mechanism contribute to the network’s understanding of input images and efficient human pose estimation.

Table 2 compares the differences in the number of parameters and floating-point operations (FLOPs) between pose estimation models trained using the VGG method and those trained using our method. In terms of the number of parameters, the model trained with the VGG method has approximately 52,311,446 parameters (occupying about 199.55 MB of memory), while our method results in a model with approximately 2,867,966,656 parameters (occupying about 681.23 MB of memory). This indicates that our model is more complex and has greater flexibility during the training process to fit the training data. In terms of FLOPs, our model requires 11.73 billion floating-point operations to process a single image, compared to 14.80 billion for the VGG method. This means our model requires less computational time to run a prediction, offering advantages when deploying on hardware with limited computational capabilities. Additionally, the reduced number of floating-point operations makes our model more suitable for developing real-time human pose estimation tasks. Comparing with swin-pose, we can draw the same conclusion that our model has a larger number of parameters, is more complex and more versatile, and our model’s GFLOPs are 57% of those of swin-pose, which also proves that our model is more capable of handling practical applications and real-time computations.

In comparing our approach with six other methods, namely VGG method, CMU-pose [29], DeepPose [30], Dense-CNN [31], AE [32], and G-RMI [33], under two threshold settings of 0.5 and 0.75, our method demonstrates superior performance, as evident from the quantitative evaluation results in Table 3. When compared with VGG method, CMU-pose, DeepPose, Dense-CNN, and AE, our method consistently achieves optimal performance under both threshold conditions. Our method greatly outperforms Dense-CNN and DeepPose in terms of performance. Compared to the VGG method, our method improves the performance by 3.9% both on the threshold 0.50 and on the threshold 0.75. Comparing to the CMU-pose, our method improves the performance by 6.3% on the threshold 0.50 and 6.2% on the threshold 0.75. Compared to the AE, our method improves the performance by 4.5% on the threshold 0.50 and 7.1% on the threshold 0.75. While our method exhibits lower average precision than G-RMI at a threshold of 0.75, its performance surpasses G-RMI at a threshold of 0.5.

Through qualitative and quantitative comparisons, it can be concluded that the proposed method outperforms other methods, demonstrating better performance in locating human keypoints and connecting body limbs, especially in challenging scenarios such as image blurriness and low resolution.

5. Discussion

We propose a bottom-up human pose estimation method that is used in combination with transformer, taking advantage of the latest achievements in computer vision. Compared with the original version using VGG, our method uses a multi-headed self-attention mechanism to better localize the key points of the human body and connect the limbs. Even for some blurred, low-resolution images, we can still perform human pose estimation, and our method achieves very good results in single- and few-person situations. As observed in Figure 12, even in occluded and blurred scenes, our method still manages to recognize the correlations between various key points in the image and accurately perform human pose estimation. This makes our method widely applicable in a variety of adverse conditions. For example, a health care robot that monitors the physical condition of elderly people living alone could be susceptible to hackers stealing digital data produced by cameras. However, if the camera uses a lens that is inherently blurred, performing human pose estimation through our method not only yields accurate results but also maximally protects the privacy of users. This represents a unique advantage of our method. However, in dense crowd conditions, our method does not perform human pose estimation as well as VGG, which stems from the weakness of the transformer in analyzing small targets. This may also be due to the fact that we did not use the full block stack [2, 2, 18, 2] format, such as in the original swin transformer, in order to reduce the computational effort and control downsampling.

In future work, we may use the original swin transformer as the backbone of AiPE, but considering that swin transformer can also be used as an excellent human detection method backbone, perhaps we should first perform human number detection and choose under the premise of determining the number of people. If the number of people is high, we can use the VGG model to perform human pose estimation, and if the number of people is low, we can use the transformer and other methods containing a multi-head self-attention mechanism to perform pose estimation, which may also greatly improve the accuracy of AiPE.

In addition, we found that in many human keypoint determination errors and connection errors, the length ratio of human limbs was not taken into account. If we can determine an approximate human limb length ratio, the accuracy may in turn be improved using the greedy algorithm to determine the key points in the length range of the limbs near the determined human key points. This will, however, bring about significant computational effort, and whether the increased accuracy is worth the large computational effort needs to be taken into consideration.

However, there is no doubt that our AiPE is a method that can perform human pose estimation very effectively, but there is also a lot of space for improvement, and we will continue to stay invested in this method to obtain more research results.

Moreover, after we introduce the transformer into pose estimation, it means that the unified model of image and language can be implemented in the field of pose estimation. We can aim to introduce more accurate pose estimation in multimodal studies. Furthermore, after generating 2D coordinates, we can integrate depth information obtained from depth cameras. Through additional training and learning, we are committed to creating three-dimensional pose estimation diagrams. This will enhance the development of assisted living systems for the elderly. We plan to research and test these ideas in our future work.

6. Conclusions

To address the challenges of processing blurry and low-resolution images, better localize keypoints, and connect body limbs, a human pose estimation model is proposed, with transformer as the backbone network and OpenPose as the branch network. It can perform pose estimation more effectively in scenarios like blurry and occluded scenes, effectively addressing issues such as errors in human skeleton mapping due to occluded or missing keypoints, thereby enhancing the accuracy of pose estimation results. The method proposed in this paper is expected to find good applications in areas such as healthcare, fall detection, gait recognition, and dance instruction. Furthermore, compared to other methods, the proposed approach demonstrates stronger resilience in adverse conditions like blurred camera footage and obstructions.

In future work, we aim to refine the algorithm and apply it to group photos with more than five individuals, improving its accuracy in multi-person pose estimation. Additionally, we will address the observed challenge in practical scenarios where the proposed method sometimes falls short of achieving the expected results in scenes with more than five individuals. Furthermore, we will dedicate efforts to developing real-time processing capabilities and integrating the latest point cloud technologies to explore 3D pose estimation. This will enable our method to respond more quickly to real-time contingencies and conduct a more comprehensive assessment by leveraging depth information.

Author Contributions

K.L.: conceptualization, methodology, validation, data curation, writing, and original draft preparation. D.M.: supervision and project administration. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Foundation of Korea (NRF) grant funded by the Korea government (Ministry of Science and ICT (MIST)) (No. 2021R1A2C209494311). This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (No. 2020R1A6A1A03046811). This paper was supported by Korea Institute for Advancement of Technology (KIAT) grant funded by the Korea Government (MOTIE) (P0020536, HRD Program for Industrial Innovation).

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Song, L.; Yu, G.; Yuan, J.; Liu, Z. Human pose estimation and its application to action recognition: A survey. J. Vis. Commun. Image Represent. 2021, 76, 103055. [Google Scholar] [CrossRef]
Solomon, E.; Cios, K.J. FASS: Face anti-spoofing system using image quality features and deep learning. Electronics 2023, 12, 2199. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Mao, W.; Ge, Y.; Shen, C.; Tian, Z.; Wang, X.; Wang, Z. Tfpose: Direct human pose estimation with transformers. arXiv 2021, arXiv:2103.15320. [Google Scholar]
Wang, Z.; Yu, Z.; Zhao, C.; Zhu, X.; Qin, Y.; Zhou, Q.; Zhou, F.; Lei, Z. Deep spatial gradient and temporal depth learning for face anti-spoofing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 5042–5051. [Google Scholar]
Bafana, S.; Raghuraman, R.; Hussaini, S.A. Exploring Novel Object Recognition and Spontaneous Location Recognition Machine Learning Analysis Techniques in Alzheimer’s Mice. arXiv 2023, arXiv:2312.06914. [Google Scholar]
Nguyen, V.N.; Groueix, T.; Salzmann, M.; Lepetit, V. GigaPose: Fast and Robust Novel Object Pose Estimation via One Correspondence. arXiv 2023, arXiv:2311.14155. [Google Scholar]
Panteleris, P.; Argyros, A. Pe-former: Pose estimation transformer. In Proceedings of the International Conference on Pattern Recognition and Artificial Intelligence, Chengdu, China, 19–21 August 2022; Springer: Berlin/Heidelberg, Germany; pp. 3–14. [Google Scholar]
Zhang, H.; Zhang, L.; Qi, X.; Li, H.; Torr, P.H.; Koniusz, P. Few-shot action recognition with permutation-invariant attention. In Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16; Springer: Berlin/Heidelberg, Germany, 2020; pp. 525–542. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25. [Google Scholar] [CrossRef]
Yu, X.; Yu, Z.; Ramalingam, S. Learning strict identity mappings in deep residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4432–4440. [Google Scholar]
Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5693–5703. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems 28 (NIPS 2015), Montreal, QC, Canada, 7–12 December 2015; Volume 28. [Google Scholar]
Li, Y.; Zhang, S.; Wang, Z.; Yang, S.; Yang, W.; Xia, S.-T.; Zhou, E. Tokenpose: Learning keypoint tokens for human pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 11313–11322. [Google Scholar]
Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 87–110. [Google Scholar] [CrossRef] [PubMed]
Fang, H.-S.; Xie, S.; Tai, Y.-W.; Lu, C. Rmpe: Regional multi-person pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2334–2343. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar] [CrossRef]
Zappel, M.; Bultmann, S.; Behnke, S. 6D object pose estimation using keypoints and part affinity fields. In Robot World Cup; Springer: Berlin/Heidelberg, Germany, 2021; pp. 78–90. [Google Scholar]
Wang, W.; Xie, E.; Li, X.; Fan, D.-P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 568–578. [Google Scholar] [CrossRef]
Xiong, Z.; Wang, C.; Li, Y.; Luo, Y.; Cao, Y. Swin-Pose: Swin Transformer Based Human Pose Estimation. arXiv 2022, arXiv:2201.07384. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Zhang, H.; Wang, Y.; Dayoub, F.; Sunderhauf, N. Varifocalnet: An iou-aware dense object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8514–8523. [Google Scholar]
Ren, J.; Zhang, M.; Yu, C.; Liu, Z. Balanced mse for imbalanced visual regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7926–7935. [Google Scholar]
Han, X.; Papyan, V.; Donoho, D.L. Neural collapse under mse loss: Proximity to and dynamics on the central path. arXiv 2021, arXiv:2106.02073. [Google Scholar]
Lin, T.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft coco: Common objects in context. In European Conference on Computer Vision 2014; Springer: Cham, Switzreland, 2014; pp. 740–755. [Google Scholar]
Jin, S.; Xu, L.; Xu, J.; Wang, C.; Liu, W.; Qian, C.; Ouyang, W.; Luo, P. Whole-body human pose estimation in the wild. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16; Springer: Berlin/Heidelberg, Germany, 2020; pp. 196–214. [Google Scholar]
Hung, W.-C.; Kretzschmar, H.; Casser, V.; Hwang, J.-J.; Anguelov, D. LET-3D-AP: Longitudinal error tolerant 3D average precision for camera-only 3D detection. arXiv 2022, arXiv:2206.07705. [Google Scholar]
Hsu, G.-S.; Shie, H.-C.; Hsieh, C.-H.; Chan, J.-S. Fast landmark localization with 3D component reconstruction and CNN for cross-pose recognition. IEEE Trans. Circuits Syst. Video Technol. 2017, 28, 3194–3207. [Google Scholar] [CrossRef]
Lanusse, F.; Ma, Q.; Li, N.; Collett, T.E.; Li, C.-L.; Ravanbakhsh, S.; Mandelbaum, R.; Póczos, B. CMU DeepLens: Deep learning for automatic image-based galaxy–galaxy strong lens finding. Mon. Not. R. Astron. Soc. 2018, 473, 3895–3906. [Google Scholar] [CrossRef]
Toshev, A.; Szegedy, C. Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE Conference on Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1653–1660. [Google Scholar]
Pishchulin, L.; Insafutdinov, E.; Tang, S.; Andres, B.; Andriluka, M.; Gehler, P.; Schiele, B. Deepcut: Joint subset partition and labeling for multi person pose estimation. In Proceedings of the IEEE Conference on Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4929–4937. [Google Scholar]
Newell, A.; Huang, Z.; Deng, J. Associative embedding: End-to-end learning for joint detection and grouping. Adv. Neural Inf. Process. Syst. 2017, 30, 2278–2288. [Google Scholar]
Papandreou, G.; Zhu, T.; Kanazawa, N.; Toshev, A.; Tompson, J.; Bregler, C.; Murphy, K. Towards accurate multi-person pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4903–4911. [Google Scholar]

Figure 1. The framework of the proposed method. (AiPE).

Figure 2. Two successive swin transformer blocks.

Figure 3. The structure of the backbone blocks.

Figure 4. The architecture of the branch network.

Figure 5. The branch network performs human keypoint analysis and PAF analysis on the input images. (a) Input image. (b) Generated heat map of human key points with determination of their most likely positions. (c) Analysis of the PAF position of the “neck” key point and the “right hip” key point of the torso and connection these two key points to complete the drawing of the right torso skeleton.

Figure 6. The annotations of human keypoints.

Figure 7. The experimental results of the proposed method.

Figure 8. The comparison of the experimental results for the proposed method and VGG (single person).

Figure 9. The comparison of the experimental results of the proposed method and VGG (multiple people).

Figure 10. The training process of the proposed method and VGG (1).

Figure 11. The training process of the proposed method and VGG (2).

Figure 12. Results in occluded and blurry scenes (left: our method, right: VGG method).

Table 1. The scenarios represented by each group.

The Proposed Method	VGG	Scenario
1-a	1-b	Single-person human pose estimation
2-a	2-b	Single-person human pose estimation {key points of the human body are obscured}
3-a	3-b	Single-person human pose estimation {limbs intersecting}
4-a	4-b	Single-person human pose estimation {limbs overlapping}
5-a	5-b	Single-person human pose estimation {the target is the person on the screen}
6-a	6-b	Multi-person human pose estimation
7-a	7-b	Multi-person human pose estimation {complex clothing and close target}
8-a	8-b	Multi-person human pose estimation {with animals and different distances between the human and the camera}
9-a	9-b	Multi-person human pose estimation {including front body, back body and side body}

Table 2. Comparison of parameters and GFLOPs across various models.

Method	Params (M)	GFLOPs
Swin-pose	197.9	204.5
VGG method	199.5	148.0
The proposed method	681.2	117.3

Table 3. The average precision of the proposed method and other methods.

Method	Average Precision IoU = 0.50	Average Precision IoU = 0.75
VGG method	82.4	65.0
CMU-pose	79.2	65.1
DeepPose	69	61
Dense-CNN	60.2	54.1
AE	81.8	61.8
G-RMI	85.5	71.3
The proposed method	86.3	68.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lu, K.; Min, D. AiPE: A Novel Transformer-Based Pose Estimation Method. Electronics 2024, 13, 967. https://doi.org/10.3390/electronics13050967

AMA Style

Lu K, Min D. AiPE: A Novel Transformer-Based Pose Estimation Method. Electronics. 2024; 13(5):967. https://doi.org/10.3390/electronics13050967

Chicago/Turabian Style

Lu, Kai, and Dugki Min. 2024. "AiPE: A Novel Transformer-Based Pose Estimation Method" Electronics 13, no. 5: 967. https://doi.org/10.3390/electronics13050967

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AiPE: A Novel Transformer-Based Pose Estimation Method

Abstract

1. Introduction

2. Related Works

3. The Proposed Method

3.1. The Overall Framework of AiPE

3.2. The Swin Transformer Block

3.3. The Backbone Network

3.4. The Branch Network

3.5. The Loss Function

4. The Experimental Results and Discussion

4.1. Experiment Setting and Details

4.2. Results and Discussion

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI