Efficient Sampling of Two-Stage Multi-Person Pose Estimation and Tracking from Spatiotemporal
Abstract
:1. Introduction
- This paper introduces a sophisticated two-stage joint framework for human pose estimation and tracking. In the initial stage, a precise target detector is employed to detect human bodies in video frames using a top-down approach. Subsequently, human keypoint coordinates are obtained by combining heatmaps with offset regression. The second stage utilizes a bipartite graph matching algorithm to achieve keypoint tracking between consecutive video frames.
- This study pioneers a unified framework for human pose estimation and tracking, achieving commendable performance on widely adopted benchmarks.
- The framework exhibits remarkable scalability, allowing seamless integration into diverse human visual applications, such as 3D human pose estimation, human parsing, action reconstruction, and other related domains.
2. Related Work
2.1. Multi-Person Pose Estimation in Images
2.1.1. Two-Stage Approaches
- Bottom-Up Approaches: These approaches start by detecting all possible keypoints in an image and then group them to form individual skeletons. Several improvements have been made in bottom-up approaches, such as optimizing keypoints regression [8] and improving heat map prediction methods [9]. However, these approaches often ignore global information from other body parts and the overall target, relying solely on keypoints detection. They also face challenges in handling scale variation in images with small individuals.
- Top-Down Approaches: In contrast to the bottom-up approach, top-down approaches first detect the bounding boxes for all persons in the image and then perform single-person pose estimation within each detected bounding box. However, when the detection target is severely occluded, the deviation caused by bounding box detection can lead to failure. To address this issue, subsequent work has explored techniques based on Convolutional Neural Networks (CNNs) or transformer frameworks. For example, an Image Guided Progressive Graph Convolutional Networks (GCN) module [10] was proposed to infer the position of invisible joints based on action types and image context. However, top-down approaches typically involve two stages: initial pose estimation and pose correction, which affect their efficiency. Transformer-based methods, such as ViT-Pose [11], offer simplicity, scalability, and flexibility, but their larger model size and input resolution hinder their effective application to video-based pose estimation. A common limitation of top-down approaches is that their inference time increases with the number of people in the image.
- A hybrid solution [6] that combines the two approaches has been explored, where bottom-up pose estimation models are used as detectors instead of classic object detectors. These models provide detection boxes and pose proposals as conditions for subsequent attention-based top-down models. However, the performance of different sampling strategies for these conditions, such as empirical sampling and generative sampling, requires further exploration.
2.1.2. One-Stage Approaches
2.2. Multi-Person Pose Estimation and Tracking in Videos
2.2.1. Multi-Person Pose Estimation
- Bottom-Up Approaches: These approaches estimate poses by performing keypoint detection and grouping them on a per-frame basis [14,15,24,25]. However, most approaches struggle with body part association in crowded scenes with severe occlusion or complex movements. For instance, while DeciWatch [15] employs DenoiseNet to diminish the noise attributed to motion blur, its efficacy may be compromised in cases of non-periodic and intricate human motion.
- Top-Down Approaches: These approaches begin by detecting individuals in all frames and then conduct single-person pose estimation on each frame based on image-based techniques. Numerous CNN-based approaches [16,26,27] have been introduced to address occlusion challenges in videos. Recently, transformer-based approaches have demonstrated superior performance on well-established keypoint estimation benchmarks [17,28].
2.2.2. Multi-Person Pose Tracking
- Bottom-Up Approaches: These approaches start by detecting human body parts and then group them to form individuals. The grouped parts are connected and associated across frames to generate complete poses. However, bottom-up approaches may struggle with body part association in occluded scenes. In [4,5], the multi-person pose tracking challenge is introduced, and a spatial graph is extended to a spatiotemporal graph based on bottom-up methods [29]. While [4] achieves plausible results in complex videos by solving a minimum-cost multicut problem, the handcrafted features in probabilistic graphical models are not necessarily optimal for long video clips. Optimizing this sophisticated integer programming problem is a Nondeterministic Polynomial-time hard problem considered computationally infeasible, even with state-of-the-art optimizers.
- Top-Down Approaches: Similar to image-based top-down approaches, these methods face the challenge of reliably detecting individual persons, especially under significant occlusion [7,26]. To address missed occluded joints, ref. [30] propose a Graph Neural Network (GNN)-based network that predicts poses aggregated with the detected poses. The process of posture prediction learning captures two types of human motion features through different types of edges: the relative motion and spatial structure of human joints within the frame, and the temporal human dynamic features between consecutive frames. Additionally, ref. [18] propose a gated attention transformer for multi-person pose tracking, considering posture-based similarity and appearance-based similarity. This approach automatically adjusts the impact of the two similarities on pose tracking accuracy and shows significant improvements in motion blur, crowded scenes, and occlusions, especially in assigning highly occluded persons in unusual poses. These algorithms and models primarily focus on improving the accuracy of pose tracking without addressing the issue of lightweight deployment. Specifically, this approach facilitates seamless model deployment and achieves commendable performance in human keypoint detection and tracking, all while maintaining low algorithmic complexity.
3. Background
3.1. Upper Bounds of Analytical Framework
- Perfect detection bounding boxes: We utilize the ground truth bounding boxes as the predicted results of the detector. Subsequently, we estimate the body joint landmarks within these ground truth boxes. The predicted landmarks are then matched using a bounding box overlap cost criterion. As illustrated in Table 1, we observe a notable 5.1% improvement in MOTA (57.8 → 62.9). The lack of fusion of potential information from adjacent frames indicates that the performance of the image-level detector is insufficient under the circumstances of frame degeneration, and its influence cannot be overlooked.
- Perfect body joint landmarks: In this scenario, we first employ the detector to predict a series of bounding boxes. We then calculate the Intersection-over-Union (IoU) between the predicted boxes and the ground truth boxes. We assign ground truth keypoints to the predicted boxes with an IoU greater than 0.7. As depicted in Table 1, we achieve a significant 10.8% improvement in MOTA (57.8 → 68.6). This highlights the crucial role played by the quality of pose estimation in the evaluation of tracking performance.
- Perfect track ID: Similar to the approach described in (2), we obtain boxes (IoU > 0.7) and assign the ground truth track IDs to the final boxes. As outlined in Table 1, we observe a modest 1.8% improvement in MOTA (57.8 → 59.6). This suggests that a simple and commonly used greedy matching algorithm, based on the bounding box overlap cost criterion, is already approaching the upper bound performance.
- Perfect bounding boxes and body joint landmarks: Finally, we combine the perfect bounding boxes and body joint landmarks. As indicated in Table 1, this results in a substantial boost in MOTA (57.8 → 74.9). This emphasizes that the most critical challenge in the articulated pose tracking task for the top-down pipeline lies in establishing a robust and reliable person detector and pose estimator.
3.2. Motivation
4. Method
4.1. Evaluation and Datasets
4.2. Network Architecture
4.2.1. Feature Extraction Network
4.2.2. Feature Sampling
4.2.3. Feature Aggregation
4.2.4. Person Detection Network
4.3. Training
5. Experiments and Results
5.1. Multi-Person Pose Estimation and Tracking
5.1.1. Training
5.1.2. Inference
Algorithm 1 Inference algorithm of spatiotemporal sampling module |
Require: |
input: video frames , a specific sampling and aggregation range S |
1: for t = 0 to S do |
2: |
3: end for |
4: for t = 0 to ∞ do |
5: for t + k = max (0, t − S) to t + S do |
6: |
7: |
8: end for |
9: |
10: |
11: |
12: end for |
Ensure: |
output: person detection results: |
5.1.3. Results on the PoseTrack Benchmark
5.2. Ablation Studies
5.2.1. Feature Sampling and Aggregation Module
- Method (a) represents the single-frame baseline, which achieves a 73.1 mAP for multi-person pose estimation and a 58.3 MOTA for multi-person pose tracking. This baseline already demonstrates competitiveness compared to the state-of-the-art results on the PoseTrack2017 validation dataset.
- Method (b) is a degraded version that excludes the feature sampling module and sets all adaptive weights to 1/2S (as described in Equation (2)). This results in a decrease in mAP to 71.4 and MOTA to 57.3, indicating the importance of motion information on the feature level for video object detection.
- Method (c) enhances (b) by incorporating adaptive weights calculated using Equation (2). This leads to an increase in mAP to 73.0 and MOTA to 58.1, surpassing the performance of (b). This highlights the criticality of designing an adaptive-weighted sub-network in the feature aggregation module.
- Method (d) is our proposed feature sampling and feature aggregation method, which adds the feature sampling module to (c). This further improves the mAP to 73.5 and MOTA to 62.8. These results demonstrate the effectiveness of our feature sampling and aggregation modules in leveraging motion information from adjacent frames to address challenges like motion blur and occlusion.
5.2.2. The Design Choices of Embedding Network
- We explore different structures for the embedding sub-network and assess their impact on performance. We use four different structures, and the results are presented in Table 8.
- The findings indicate that using a fully convolutional network to project features into a new high-dimensional embedding space for similarity measurement does not significantly impact performance.
- Based on these results, we select Design (C) as our embedding sub-network, as it offers the best performance with minimal computational requirements.
5.2.3. The Number of Supporting Frames
- Due to GPU memory limitations, we extract the features of each image individually and then feed them into memory before inputting them into the sampling and aggregation modules.
- We experiment with different numbers of supporting frames (5, 7, 9, 11, 13, and 15) during inference and 2 or 4 supporting frames per mini-batch during training.
- The results in Table 9 indicate that using more supporting frames during training does not lead to higher accuracy, and the improvement saturates at 13 frames during the inference stage.
- Consequently, we default to sampling 2 supporting frames during training and aggregating features from 13 supporting frames during inference.
5.3. Qualitative Results
6. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Zhou, L.; Meng, X.; Liu, Z.; Wu, M.; Gao, Z.; Wang, P. Human Pose-based Estimation, Tracking and Action Recognition with Deep Learning: A Survey. arXiv 2023, arXiv:2310.13039. [Google Scholar]
- Doering, A.; Chen, D.; Zhang, S.; Schiele, B.; Gall, J. PoseTrack21: A Dataset for Person Search, Multi-Object Tracking and Multi-Person Pose Tracking. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 20931–20940. [Google Scholar]
- Chen, H.; Feng, R.; Wu, S.; Xu, H.; Zhou, F.; Liu, Z. 2D Human pose estimation: A survey. Multimed. Syst. 2023, 29, 3115–3138. [Google Scholar] [CrossRef]
- Insafutdinov, E.; Andriluka, M.; Pishchulin, L.; Tang, S.; Levinkov, E.; Andres, B.; Schiele, B. ArtTrack: Articulated multi-person tracking in the wild. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1293–1301. [Google Scholar] [CrossRef]
- Andriluka, M.; Iqbal, U.; Insafutdinov, E.; Pishchulin, L.; Milan, A.; Gall, J.; Schiele, B. PoseTrack: A Benchmark for Human Pose Estimation and Tracking. In Proceedings of the 31st Meeting of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5167–5176. [Google Scholar] [CrossRef]
- Zhou, M.; Stoffl, L.; Mathis, M.W.; Mathis, A. Rethinking pose estimation in crowds: Overcoming the detection information bottleneck and ambiguity. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 14643–14653. [Google Scholar]
- Girdhar, R.; Gkioxari, G.; Torresani, L.; Paluri, M.; Tran, D. Detect-and-Track: Efficient Pose Estimation in Videos. In Proceedings of the 31st Meeting of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, Salt Lake City, UT, USA, 18–22 June 2018; pp. 350–359. [Google Scholar] [CrossRef]
- Li, J.; Wang, Y.; Zhang, S. PolarPose: Single-Stage Multi-Person Pose Estimation in Polar Coordinates. IEEE Trans. Image Process. 2023, 32, 1108–1119. [Google Scholar] [CrossRef] [PubMed]
- Cheng, Y.; Ai, Y.; Wang, B.; Wang, X.; Tan, R.T. Bottom-up 2D pose estimation via dual anatomical centers for small-scale persons. Pattern Recognit. 2023, 139, 109403. [Google Scholar] [CrossRef]
- Qiu, L.; Zhang, X.; Li, Y.; Li, G.; Wu, X.; Xiong, Z.; Han, X.; Cui, S. Peeking into Occluded Joints: A Novel Framework for Crowd Pose Estimation. In Proceedings of the 16th European Conference on Computer Vision, ECCV 2020, Glasgow, UK, 23–28 August 2020; Springer Science and Business Media Deutschland GmbH: Cham, Switzerland, 2020; Volume 12364 LNCS, pp. 488–504. [Google Scholar] [CrossRef]
- Xu, Y.; Zhang, J.; Zhang, Q.; Tao, D. ViTPose++: Vision Transformer for Generic Body Pose Estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 1212–1230. [Google Scholar] [CrossRef] [PubMed]
- Maji, D.; Nagori, S.; Mathew, M.; Poddar, D. YOLO-Pose: Enhancing YOLO for Multi Person Pose Estimation Using Object Keypoint Similarity Loss. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2022, New Orleans, LA, USA, 19–20 June 2022; pp. 2636–2645. [Google Scholar]
- Jiang, T.; Lu, P.; Zhang, L.; Ma, N.; Han, R.; Lyu, C.; Li, Y.; Chen, K. RTMPose: Real-Time Multi-Person Pose Estimation based on MMPose. arXiv 2023, arXiv:2303.07399. [Google Scholar]
- Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.-E.; Sheikh, Y. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 43, 172–186. [Google Scholar] [CrossRef] [PubMed]
- Zeng, A.; Ju, X.; Yang, L.; Gao, R.; Zhu, X.; Dai, B.; Xu, Q. DeciWatch: A Simple Baseline for 10× Efficient 2D and 3D Pose Estimation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 607–624. [Google Scholar]
- Xiu, Y.; Li, J.; Wang, H.; Fang, Y.; Lu, C. Pose Flow: Efficient Online Pose Tracking. In Proceedings of the British Machine Vision Conference, Newcastle, UK, 3–6 September 2018. [Google Scholar]
- Yuan, Y.; Fu, R.; Huang, L.; Lin, W.; Zhang, C.; Chen, X.; Wang, J. HRFormer: High-Resolution Transformer for Dense Prediction. arXiv 2021, arXiv:2110.09408. [Google Scholar]
- Doering, A.; Gall, J. A Gated Attention Transformer for Multi-Person Pose Tracking. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW) 2023, Paris, France, 2–6 October 2023; pp. 3181–3190. [Google Scholar]
- Mao, W.; Tian, Z.; Wang, X.; Shen, C. FCPose: Fully Convolutional Multi-Person Pose Estimation with Dynamic Instance-Aware Convolutions. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2021, Nashville, TN, USA, 19–25 June 2021; pp. 9030–9039. [Google Scholar] [CrossRef]
- Miao, H.; Lin, J.; Cao, J.; He, X.; Su, Z.; Liu, R. SMPR: Single-stage multi-person pose regression. Pattern Recognit. 2023, 143, 109743. [Google Scholar] [CrossRef]
- Yang, J.; Zeng, A.; Liu, S.; Li, F.; Zhang, R.; Zhang, L. Explicit Box Detection Unifies End-to-End Multi-Person Pose Estimation. arXiv 2023, arXiv:2302.01593. [Google Scholar]
- Shi, D.; Wei, X.; Yu, X.; Tan, W.; Ren, Y.; Pu, S. InsPose: Instance-Aware Networks for Single-Stage Multi-Person Pose Estimation. In Proceedings of the 29th ACM International Conference on Multimedia, MM 2021, Virtual, 20–24 October 2021; Association for Computing Machinery, Inc.: New York, NY, USA; pp. 3079–3087. [Google Scholar] [CrossRef]
- Liu, H.; Chen, Q.; Tan, Z.; Liu, J.-J.; Wang, J.; Su, X.; Li, X.; Yao, K.; Han, J.; Ding, E.; et al. Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV) 2023, Paris, France, 1–6 October 2023; pp. 14983–14992. [Google Scholar]
- Jin, S.; Liu, W.; Ouyang, W.; Qian, C. Multi-Person Articulated Tracking with Spatial and Temporal Embeddings. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, 16–20 June 2019; pp. 5657–5666. [Google Scholar]
- Geng, Z.; Sun, K.; Xiao, B.; Zhang, Z.; Wang, J. Bottom-up human pose estimation via disentangled keypoint regression. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2021, Nashville, TN, USA, 19–25 June 2021; pp. 14671–14681. [Google Scholar] [CrossRef]
- Xiao, B.; Wu, H.; Wei, Y. Simple Baselines for Human Pose Estimation and Tracking. arXiv 2018, arXiv:1804.06208. [Google Scholar]
- Wang, M.; Tighe, J.; Modolo, D. Combining Detection and Tracking for Human Pose Estimation in Videos. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020, Seattle, WA, USA, 13–19 June 2020; pp. 11085–11093. [Google Scholar]
- Xu, Y.; Zhang, J.; Zhang, Q.; Tao, D. ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation. arXiv 2022, arXiv:2204.12484. [Google Scholar]
- Insafutdinov, E.; Pishchulin, L.; Andres, B.; Andriluka, M.; Schiele, B. DeeperCut: A Deeper, Stronger, and Faster Multi-person Pose Estimation Model. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016. [Google Scholar]
- Yang, Y.; Ren, Z.; Li, H.; Zhou, C.; Wang, X.; Hua, G. Learning Dynamics via Graph Neural Networks for Human Pose Estimation and Tracking. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021, Nashville, TN, USA, 20–25 June 2021; pp. 8070–8080. [Google Scholar]
- Milan, A.; Leal-Taixé, L.; Reid, I.D.; Roth, S.; Schindler, K. MOT16: A Benchmark for Multi-Object Tracking. arXiv 2016, arXiv:1603.00831. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2015, Boston, MA, USA, 7–12 June 2015; pp. 770–778. [Google Scholar]
- Dai, J.; Li, Y.; He, K.; Sun, J. R-FCN: Object Detection via Region-based Fully Convolutional Networks. In Proceedings of the Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 379–387. [Google Scholar]
- Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV) 2017, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
- Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.S.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2014, 115, 211–252. [Google Scholar] [CrossRef]
- Rush, A.M.; Chopra, S.; Weston, J. A Neural Attention Model for Abstractive Sentence Summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015. [Google Scholar]
- Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into High Quality Object Detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
- Lin, T.-Y.; Maire, M.; Belongie, S.J.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014. [Google Scholar]
- Ruder, S. An overview of gradient descent optimization algorithms. arXiv 2016, arXiv:1609.04747. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Pishchulin, L.; Insafutdinov, E.; Tang, S.; Andres, B.; Andriluka, M.; Gehler, P.; Schiele, B. DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 4929–4937. [Google Scholar]
- Lu, P.; Jiang, T.; Li, Y.; Li, X.; Chen, K.; Yang, W. RTMO: Towards High-Performance One-Stage Real-Time Multi-Person Pose Estimation. arXiv 2023, arXiv:2312.07526. [Google Scholar]
- Li, Y.; Yang, S.; Liu, P.; Zhang, S.; Wang, Y.; Wang, Z.; Yang, W.; Xia, S.T. SimCC: A Simple Coordinate Classification Perspective for Human Pose Estimation. In Proceedings of the 17th European Conference on Computer Vision, ECCV 2022, Tel Aviv, Israel, 23–27 October 2022; Springer Science and Business Media Deutschland GmbH: Berlin, Germany, 2022; Volume 13666 LNCS, pp. 89–106. [Google Scholar] [CrossRef]
- Gu, K.; Yang, L.; Yao, A. Removing the Bias of Integral Pose Regression. In Proceedings of the 18th IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, 11–17 October 2021; pp. 11047–11056. [Google Scholar] [CrossRef]
Ours | ID | Boxes | Landmarks | Boxes + Landmarks |
---|---|---|---|---|
(MOTA) 57.8 | 59.6 | 62.9 | 68.6 | 74.9 |
Method | Dataset | AP | AP (50) | AP (75) | AR | AR (50) | Latency (ms) |
---|---|---|---|---|---|---|---|
RTMO-s [42] | COCO | 67.7 | 87.8 | 73.7 | 71.5 | 90.8 | 18.5 |
RTMO-m [42] | COCO | 70.9 | 89.1 | 77.8 | 74.7 | 92.1 | 22.6 |
RTMO-l [42] | COCO | 72.4 | 89.5 | 78.8 | 76.2 | 92.6 | 37.5 |
RTMO-t [42] | mix | 56.7 | 80.1 | 61.2 | 61.1 | 83.6 | 14.2 |
RTMO-s [42] | mix | 68.7 | 87.8 | 74.6 | 72.5 | 90.6 | 18.5 |
RTMO-m [42] | mix | 72.4 | 89.5 | 78.9 | 76.4 | 92.9 | 22.6 |
RTMO-l [42] | mix | 74.8 | 91.7 | 81.9 | 78.5 | 92.8 | 37.5 |
YOLO-POSE-t [12] | COCO | 51.9 | 78.3 | 55.6 | 57.1 | 83.5 | 32.2 |
YOLO-POSE-s t [12] | COCO | 64.4 | 87.2 | 71.2 | 68.1 | 90.5 | 45.6 |
YOLO-POSE-m t [12] | COCO | 69.5 | 89.7 | 76.4 | 73.7 | 92.6 | 51.3 |
YOLO-POSE-l t [12] | COCO | 71.3 | 90.8 | 78.4 | 74.9 | 92.8 | 60.9 |
debias-ipr-resnet-50 [43] | COCO | 67.5 | 87.2 | 71.4 | 76.5 | 91.4 | 89.5 |
Simcc-s-vipnas-mobilenetv3 [44] | COCO | 69.8 | 88.2 | 72.7 | 75.6 | 92.7 | 39.5 |
ours: ResNet-50 | COCO | 76.7 | 92.5 | 83.2 | 79.9 | 94.1 | 59.2 |
ours: ResNet-50 | mix | 78.5 | 93.1 | 84.9 | 80.2 | 95.1 | 59.2 |
Method | Dataset | Head mAP | Shou mAP | Elb mAP | Wri mAP | Hip mAP | Knee mAP | Ankl mAP | Total mAP |
---|---|---|---|---|---|---|---|---|---|
Girdhar et al. [7] | validation | 67.5 | 70.2 | 62.0 | 51.7 | 60.7 | 58.7 | 49.8 | 60.6 |
Xiu et al. [16] | validation | 66.7 | 73.3 | 68.3 | 61.1 | 67.5 | 67.0 | 61.3 | 66.5 |
Xiao et al. [26]: ResNet-50 | validation | 79.1 | 80.5 | 75.5 | 66.0 | 70.8 | 70.0 | 61.7 | 72.4 |
Xiao et al. [26]: ResNet-152 | validation | 81.7 | 83.4 | 80.0 | 72.4 | 75.3 | 74.8 | 67.1 | 76.7 |
Ours: ResNet-50 | validation | 78.7 | 80.8 | 76.8 | 68.5 | 70.6 | 70.6 | 62.8 | 73.1 |
Ours: ResNet-50 | validation | 77.4 | 79.5 | 76.9 | 70.4 | 72.9 | 73.4 | 65.7 | 73.9 |
Ours: ResNet-152 | validation | 81.1 | 83.7 | 79.9 | 72.5 | 75.8 | 75.6 | 67.6 | 77.0 |
Xiu et al. [16] | testing | 64.9 | 67.5 | 65.0 | 59.0 | 62.5 | 62.8 | 57.9 | 63.0 |
Xiao et al. [26]: ResNet-50 | testing | 76.4 | 77.2 | 72.2 | 65.1 | 68.5 | 66.9 | 60.3 | 70.0 |
Xiao et al. [26]: ResNet-152 | testing | 79.5 | 79.7 | 76.4 | 70.7 | 71.6 | 71.3 | 64.9 | 73.9 |
Ours: ResNet-152 | testing | 79.8 | 80.0 | 82.0 | 76.6 | 71.7 | 78.0 | 65.6 | 74.2 |
Method | Dataset | MOTA Head | MOTA Shou | MOTA Elb | MOTA Wri | MOTA Hip | MOTA Knee | MOTA Ankl | MOTA Total |
---|---|---|---|---|---|---|---|---|---|
Girdhar et al. [7] | validation | 61.7 | 65.5 | 57.3 | 45.7 | 54.3 | 53.1 | 45.7 | 55.2 |
Xiu et al. [16] | validation | 59.8 | 67.0 | 59.8 | 51.6 | 60.0 | 58.4 | 50.5 | 58.3 |
Xiao et al. [26]: ResNet-50 | validation | 72.1 | 74.0 | 61.2 | 53.4 | 62.4 | 61.6 | 50.7 | 62.9 |
Xiao et al. [26]: ResNet-152 | validation | 73.9 | 75.9 | 63.7 | 56.1 | 65.5 | 65.1 | 53.5 | 65.4 |
Ours: ResNet-50 | validation | 71.1 | 73.6 | 62.4 | 54.0 | 62.5 | 60.1 | 48.9 | 62.8 |
Ours: ResNet-50 | validation | 72.8 | 74.4 | 65.6 | 55.2 | 63.9 | 63.6 | 52.8 | 63.9 |
Ours: ResNet-152 | validation | 73.3 | 74.2 | 65.2 | 57.9 | 64.8 | 68.8 | 54.7 | 65.8 |
Xiu et al. [16] | testing | 52.0 | 57.4 | 52.8 | 46.6 | 51.0 | 51.2 | 45.3 | 51.0 |
Xiao et al. [26]: ResNet-50 | testing | 65.9 | 67.0 | 51.5 | 48.0 | 56.2 | 54.6 | 46.9 | 56.4 |
Xiao et al. [26]: ResNet-152 | testing | 67.1 | 68.4 | 52.2 | 48.9 | 56.1 | 56.6 | 48.8 | 57.6 |
Ours: ResNet-152 | testing | 66.6 | 68.1 | 56.1 | 50.2 | 57.3 | 55.6 | 43.7 | 57.3 |
Detector | Backbone | Sampling? | mAP | MOTA |
---|---|---|---|---|
R-FCN | ResNet-50 | × 1 | 67.2 | 57.8 |
R-FCN | ResNet-50 | √ | 71.1 | 61.8 |
Cascade R-CNN | ResNet-101 | × | 73,1 | 58.3 |
Cascade R-CNN | ResNet-101 | √ | 73.5 | 62.8 |
Faster R-CNN | ResNet-50 | × | 69.8 | 58.9 |
Faster R-CNN | ResNet-50 | √ | 72.7 | 60.4 |
Faster R-CNN | ResNet-50 DCN + FPN | × | 71.8 | 60.3 |
Faster R-CNN | ResNet-50 DCN + FPN | √ | 73.7 | 63.1 |
latency (batch) | 2 | 4 | 8 | 16 |
ST-baseline | 1× | 1× | 1× | 1× |
ST (ours) | 1.46× | 1.39× | 1.28× | 1.17× |
Lite-Pose | 1.01× | 0.98× | 0.96× | 0.95× |
VoxelTrack | 1.24× | 1.19× | 1.15× | 1.09× |
AOP | 1.19× | 1.13× | 1.11× | 1.05× |
Location-free | 1.12× | 1.08× | 1.05× | 1.01× |
Location-Global | 1.15× | 1.13× | 1.11× | 1.07× |
Nfeat | ResNet-101 | |||
---|---|---|---|---|
Ndet | Cascade R-CNN | |||
methods | ||||
multi-frame feature aggregation? | √ 1 | √ | √ | |
adaptive weighs? | √ | √ | ||
spatiotemporal sampling? | √ | |||
mAP | 73.1 | 71.4 | 73.0 | 73.5 |
MOTA | 58.3 | 57.3 | 58.1 | 62.8 |
Setting | Design (A) | Design (B) | Design (C) | Design (D) |
---|---|---|---|---|
Layer # 1 | conv1-512 | conv1-512 | conv1-512 | conv1-512 |
Layer # 2 | conv3-512 | conv3-512 | conv3-512 | conv3-512 |
Layer # 3 | conv1-2048 | conv1-4096 | conv3-1024 | conv3-1024 |
Layer # 4 | - | - | conv1-2048 | conv3-1024 |
Layer # 5 | - | - | - | conv1-2048 |
mAP | 72.9 | 73.0 | 73.1 | 73.1 |
MOTA | 58.2 | 58.2 | 58.3 | 58.3 |
# training | 2 | 4 | ||||||||
# inference | 7 | 9 | 11 | 13 | 15 | 7 | 9 | 11 | 13 | 15 |
mAP | 72.0 | 72.5 | 72.8 | 73.1 | 73.1 | 72.1 | 72.5 | 72.9 | 73.1 | 73.1 |
MOTA | 57.8 | 58.1 | 58.2 | 58.3 | 58.3 | 57.9 | 58.1 | 58.2 | 58.3 | 58.3 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Lin, S.; Hou, W. Efficient Sampling of Two-Stage Multi-Person Pose Estimation and Tracking from Spatiotemporal. Appl. Sci. 2024, 14, 2238. https://doi.org/10.3390/app14062238
Lin S, Hou W. Efficient Sampling of Two-Stage Multi-Person Pose Estimation and Tracking from Spatiotemporal. Applied Sciences. 2024; 14(6):2238. https://doi.org/10.3390/app14062238
Chicago/Turabian StyleLin, Song, and Wenjun Hou. 2024. "Efficient Sampling of Two-Stage Multi-Person Pose Estimation and Tracking from Spatiotemporal" Applied Sciences 14, no. 6: 2238. https://doi.org/10.3390/app14062238
APA StyleLin, S., & Hou, W. (2024). Efficient Sampling of Two-Stage Multi-Person Pose Estimation and Tracking from Spatiotemporal. Applied Sciences, 14(6), 2238. https://doi.org/10.3390/app14062238