An In-Depth Analysis of 2D and 3D Pose Estimation Techniques in Deep Learning: Methodologies and Advances

Sun, Ruiyang; Lin, Zixiang; Leng, Song; Wang, Aili; Zhao, Lanfei

doi:10.3390/electronics14071307

Open AccessReview

An In-Depth Analysis of 2D and 3D Pose Estimation Techniques in Deep Learning: Methodologies and Advances

by

Ruiyang Sun

¹,

Zixiang Lin

²,

Song Leng

^3,*,

Aili Wang

²

and

Lanfei Zhao

²

¹

Institue of Martial Arts, Harbin Sport University, Harbin 150008, China

²

Heilongjiang Province Key Laboratory of Laser Spectroscopy Technology and Application, Harbin University of Science and Technology, Harbin 150080, China

³

Institue of Kinesiology and Health, Harbin Sport University, Harbin 150008, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(7), 1307; https://doi.org/10.3390/electronics14071307

Submission received: 8 February 2025 / Revised: 4 March 2025 / Accepted: 25 March 2025 / Published: 26 March 2025

(This article belongs to the Special Issue New Insights in 2D and 3D Object Detection and Semantic Segmentation)

Download

Browse Figures

Versions Notes

Abstract

:

Pose estimation (PE) is a cutting-edge technology in computer vision, essential for AI-driven sport analysis, advancing technological applications, enhancing security, and improving the quality of life. Deep learning has markedly advanced accuracy and efficiency in the field while propelling algorithmic frameworks and model architectures to greater complexity, yet rendering their underlying interrelations increasingly opaque. This review examines deep learning-based PE techniques, classifying them from two perspectives: two-dimensional (2D) and three-dimensional (3D), based on methodological principles and output formats. Within each category, advanced techniques for single-person, multi-person, and video-based PE are explored according to their applicable conditions, highlighting key differences and intrinsic connections while comparing performance metrics. We also analyze datasets across 2D, 3D, and video domains, with comparisons presented in tables. The practical applications of PE in daily life are also summarized alongside an exploration of the challenges facing the field and the proposal of innovative, forward-looking research directions. This review aims to be a valuable resource for researchers advancing deep learning-driven PE.

Keywords:

deep learning; 2D pose estimation; 3D pose estimation; pose estimation applications

1. Introduction

PE represents a pivotal research direction in computer vision, concentrating on the identification of keypoints (e.g., joint positions) in images or videos to enable the analysis and comprehension of human posture and movement, serving as an indispensable technology in sport analytics [1,2], action recognition [3], video processing [4,5], gaming [6,7], and beyond. For instance, advanced judgment support systems in sport analytics rely on PE as a core technology to capture precise 3D poses, facilitating nuanced comparisons and accurate scoring. In entertainment, the cornerstone lies in a machine’s ability to precisely interpret human body postures, enabling sophisticated imitation and interactive responses, among other applications. Early technologies predominantly rely on traditional approaches, which involve manual feature design and structural modeling of the human body, followed by optimization-based fitting. These methods are not only labor intensive but also exhibit limited performance. With the advent of large-scale datasets, ample data and labeled information became available, prompting a shift toward deep learning approaches. Advances in hardware have further accelerated the development and computation of deep networks, yielding significant improvements in PE that surpass traditional methods. Notably, deep learning-based PE techniques excel in related tasks, incorporating object detection [8], instance segmentation [9], and human tracking [10]. In recent years, publications on human pose estimation have surged, with a pronounced increase in the past two years, as depicted in Figure 1. This trend underscores its escalating prominence as a research focal point and the pressing societal demand for PE technology, spurring continuous methodological innovation and marked gains in accuracy. Nonetheless, challenges persist due to complex external environments, diverse poses, unusual clothing, and large-scale action scenes. PE, as a challenging yet impactful field, requires further innovation to address these complexities and advance the state of the art.

Early work in PE concentrates on 2D information, representing human actions and postures by detecting the 2D coordinates of human joints (e.g., head, shoulders, elbows, and knees) in images or video frames. The emergence of convolutional neural networks (CNNs), particularly after AlexNet’s groundbreaking performance in the ImageNet competition [11], propel PE forward. DeepPose [12] introduced the first deep neural network for end-to-end PE, employing a multi-stage regression approach to iteratively refine joint positions, initiating a surge of deep learning applications in the field. The development of multi-modal data has expanded PE into the 3D domain [13], where the objective is to reconstruct human joint positions in 3D space from 2D images or video. While 2D PE emphasizes accurate joint detection—robust to occlusions, multiple people, and complex backgrounds—3D PE confronts the additional challenge of recovering depth information to infer 3D poses, particularly in single-view or limited-label scenarios. Furthermore, 3D PE must address issues like depth blur, occlusions, and complex joint motions while balancing real-time processing requirements and model efficiency. This review provides a structured overview of these two dimensions, analyzing the advantages, limitations, and performance of various network architectures while providing a clear comparison of 2D and 3D PE approaches, highlighting their conceptual and technical differences. The framework of this review is presented in Figure 2.

Several related reviews and surveys, including [14,15,16] and others, focus on human action PE within deep learning, with some briefly addressing PE. Although works like [17,18] target PE, they primarily examine traditional approaches with limited discussion on deep learning methods. Studies such as [19,20] explore deep learning but are limited to 2D PE, while [21,22,23] concentrate solely on 3D PE, and [24] provides an overview of monocular human pose estimation. Despite thorough discussions of their respective topics, these studies lack an integrated perspective on the relationships and distinctions between 2D and 3D methodologies, generally focusing on isolated aspects of human pose estimation. Additionally, although [25] summarizes notable 2D and 3D PE methods, it lacks analysis from a 2D-to-3D conversion viewpoint. Therefore, an updated and comprehensive review is urgently needed. This review conducts a rigorous analysis of PE, addressing key research gaps. Unlike previous studies, it covers both 2D and 3D PE, examining methods for single-person, multi-person, and video-based PE. The manuscript explores the nuanced relationships between 2D and 3D approaches, the interdependence of single-person and multi-person methods, and the crucial role of multi-view techniques. Additionally, this review systematically classifies PE applications, providing a clear and structured overview. This comprehensive and systematic analysis gives this review more depth and breadth in content. The distinctive contributions of this review are outlined as follows:

A comprehensive review of PE techniques, encompassing both 2D and 3D estimation, including single-view, multi-view, and dynamic video-based PE.
An exhaustive summary of 2D and 3D datasets and evaluation metrics, offering detailed performance comparisons of prominent algorithms across standard datasets.
An overview of practical applications in motion analysis, sports and training, human–computer interaction, entertainment and the arts, and health monitoring and rehabilitation.
An in-depth analysis of various algorithms, examining their internal structures and application domains. This analysis highlights key challenges in PE and explores potential future developments based on practical application demands.

In the subsequent sections, human pose estimation techniques are systematically classified into 2D and 3D domains, with additional subcategories that delineate their respective effects and intended applications. Algorithm performance is evaluated through datasets and key performance metrics. Section 2 presents 2D human PE, beginning with 2D single-person PE techniques based on regression and heatmaps, followed by multi-person 2D PE and then 2D PE in video contexts. Section 3 explores 3D PE, covering single-person, single-view multi-person, multi-attempt multi-person, and video-based scenarios. The chapter concludes with a comprehensive synthesis of 3D PE techniques. Section 4 provides an overview of evaluation metrics, datasets, and comparative performance metrics of these algorithms. Section 5 addresses practical applications. Section 6 concludes the paper, analyzes key challenges, and outlines potential future directions.

2. Two-Dimensional Pose Estimation

Two-dimensional PE seeks to detect and localize keypoints (e.g., head, shoulders, knees) on the human body within input images or videos, deriving the global 2D coordinates of each joint to analyze the subject’s pose and structure. Early 2D PE methods relied on probabilistic models and hand-crafted features, optimized through resource-intensive algorithms with limited efficiency and robustness. The emergence of deep learning has revolutionized human pose estimation, enabling remarkable performance improvements. This section reviews deep learning-based techniques for 2D single-person and multi-person PE in images and videos.

2.1. Two-Dimensional Single-Person Pose Estimation

Two-dimensional single-person PE focuses on localizing human joints in images containing a single subject, as shown in Figure 3. In deep learning, Toshev et al. [12] introduce a cascaded structure based on deep neural networks (DNNs) that directly recovers keypoint locations from images. Despite employing the same network in the cascade structure, the model captures feature information progressively, layer by layer, marking a pioneering use of DNNs in PE. Figure 3 illustrates two core single-person PE methods: (a) regression-based and (b) heatmap-based approaches.

2.1.1. Regression-Based Single-Person Pose Estimation

The regression approach introduced by Toshev et al. [12] is straightforward, easy to implement, and has thus garnered significant scholarly interest. However, since these methods directly infer pose information upon image input, error control emerges as a critical challenge. To address this, Xiao Sun et al. [26] enhance the regression method by proposing human pose integral regression, which integrates well with heatmap-based techniques and utilizes ResNet50 [27] as the backbone network. Recognizing issues in these regression methods, such as network degradation, slow convergence, and gradient loss, Gu et al. [28] and Su et al. [29] further optimize the approach with integral compensation techniques. Specifically, Gu et al. [28] implement bias compensation, while Su et al. [29] propose a regression correction module that integrates an additional one-dimensional bias to correct non-global integral regressions. This module can seamlessly embed into end-to-end networks, improving keypoint localization accuracy. Recently, Pranjal Kumar et al. [30] analyze the general regression problem and introduce an optimized standard for backbone networks and loss functions. Using DenseNet [31] combined with a KLD module, they achieve superior results, improving performance by 1.1% over the ResNet50 + KLD configuration [27]. This advancement offers a promising new direction for further research in the field.

2.1.2. Heatmap-Based Single-Person Pose Estimation

Heatmap-based methods generate a heatmap for each joint, with pixel values representing the probability of the joint’s location, as shown in Figure 3b. These pixel values are modeled by a 2D Gaussian kernel [32] center on the joint’s actual location. The peak of this Gaussian distribution aligns with the true joint position, while the standard deviation controls the spread, indicating the prediction’s uncertainty. Although more complex than regression-based methods, heatmap-based approaches better capture input information, significantly improving estimation accuracy.

PE has progressed from early CNNs to more advanced and efficient frameworks, such as the stacked hourglass (SHG) network, generative adversarial networks (GANs), and ResNet. Newell et al. [33] introduce the stacked hourglass network, which iteratively pools and up-samples to generate intermediate heatmaps, enabling progressive supervision. By linking multiple hourglass modules end-to-end, the network achieves multi-scale reasoning, enabling refined predictions across stages. This distinctive architecture has inspired various adaptations. Zou et al. [34] incorporate global and local content-aware features to enhance the hourglass network, while Dong et al. [35] add global and local attention modules to better handle background and fine details, improving joint detection. Kamel et al. [36] integrate pose refinement and correction networks, fusing heatmaps horizontally and vertically to reduce false detections. However, the hourglass structure’s complexity requires computational supervision at each layer, leading to substantial parameter growth and feature stacking at higher levels. This intensive computation has made lightweight network design an ongoing challenge. Kim [37] address this by incorporating multi-dilated light residual blocks within the hourglass framework, expanding the receptive field while reducing redundancy. Qin et al. [38] replace the residual blocks with Res2Net_depth blocks, integrating up-sampling and attention mechanisms to improve portability. By substituting the multi-branch pyramid residual module [39], they further enhance CNN scale invariance, creating a more efficient architecture for PE.

The advent of GANs [40] has significantly improved network output quality and realism through adversarial interactions between generator and discriminator models. This architecture enhances authenticity and effectively handles partial occlusions by inferring missing details and assigning appropriate confidence levels. Chen et al. [41] design a dual-discriminator GAN integrated with an hourglass network to create a pose-aware model that distinguishes compliant from non-compliant poses. Similarly, Zhu et al. [42] combine GANs with heatmap generation and hard joint mining to adaptively locate challenging joints, improving robustness against occlusions and overlapping limbs. Additionally, Tian et al. [43] incorporate a graph convolutional structure within a GAN framework, leveraging skeletal relationships and cascade features to iteratively refine pose predictions. This graph-based approach captures complex joint dependencies more effectively while maintaining computational efficiency. Expanding beyond general PE, GANs have also been tailored to estimate specific body parts for specialized applications. For example, Sahar Rahimi et al. [44] employ a Multi-Stage Generative Adversarial Network (MSGAN) to recover head poses in low-resolution facial images, enhancing accuracy with pose-aware adversarial loss and feedback. Overall, GANs effectively mitigate occlusions, missing data, and low image quality, demonstrating their value in advanced PE tasks.

By learning residual mappings instead of direct input-to-output mappings, ResNet effectively addresses vanishing gradient issues in deep networks. Its cross-layer connectivity allows access to raw feature information, making it a preferred backbone for PE frameworks. Many baseline models leverage ResNet with targeted architectural modifications to enhance performance. For instance, Xiao et al. [45] develop a streamlined approach by estimating heatmaps within feature maps, employing ResNet-50 with an added deconvolution layer to predict final poses, demonstrating ResNet’s adaptability for integration with other networks. It is the idea and structure of this network that is very flexible and can be used as the backbone to combine with other networks. Chen et al. [46] further extend ResNet by combining it with PointNet for head PE, leveraging multi-modal information for enhanced feature classification and inference, thereby increasing PE accuracy and robustness. To address challenges in multi-scale feature fusion, Wang et al. [47] propose an attention refinement network that enhances multi-scale information integration, yielding a more refined approach to semantic comprehension in contexts involving overlapping or inter-layer feature fusion.

2.2. Two-Dimensional Multi-Person Pose Estimation

Multi-person PE builds on single-person estimation but introduces challenges such as keypoint grouping, occlusion, and individual overlap. Some approaches in multi-person estimation first segment the image and then rely on robust single-person PE algorithms for individual recognition. Multi-person PE is principally categorized into top-down and bottom-up approaches. Top-down methods, as shown in Figure 4a, involve person detection and segmentation, while bottom-up methods, illustrated in Figure 4b, focus on keypoint grouping.

2.2.1. Top-Down Approach of Two-Dimensional Multi-Person Pose Estimation

Top-down PE methods generally initiate by detecting individuals within the input image, first localizing each person and then sequentially estimating poses within respective bounding boxes, as illustrated in Figure 4a. The core of top-down methods lies in progressively transitioning from broad localization to precise keypoint identification. This approach separates the task into detection and prediction networks, with numerous studies enhancing both components for improved localization and estimation. For instance, Wang et al. [48] combine CNN and PCNN to refine poses by correcting heatmaps, identifying guide points, and adjusting poses. Cai et al. [49], winners of a keypoint competition, utilize the Residual Step Network (RSN) for efficient spatial feature aggregation and the Pose Refinement Machine (PRM) to balance global and local features for refined keypoint outputs. Xu et al. [50] introduce the ZoomNet, a single-network approach, which leverages ZoomNAS to optimize architecture and resource allocation for handling body hierarchy and scale variations. Furthermore, Artacho et al. [51] introduce UniPose+, which combines multi-scale features with a spatial pyramid configuration to enhance joint prediction accuracy. Wang et al. [52] enhance pose prediction quality by linking high-resolution and low-resolution convolutional streams in parallel, enabling continuous information exchange that enriches semantic details and spatial precision. Xiao et al. [53] enhance localization through a fine-grained representation framework that differentiates body parts and integrates adaptive keypoint encoding, adeptly capturing nuanced interrelations between human instances and their associated keypoints.

2.2.2. Bottom-Up Approach of Two-Dimensional Multi-Person Pose Estimation

In contrast to top-down approaches, bottom-up approaches detect all keypoints across the input image and group them to construct poses, as shown in Figure 4b. This process combines detection and grouping. Yu et al. [54] first utilize heatmaps to globally detect keypoints, assigning probabilities to each point’s location. They introduce dual anatomical center estimation (head and body centers) to merge poses based on spatial configurations and visual similarity, selecting the most plausible representation without additional forward passes, which is particularly effective for smaller scales. Similarly, Newell et al. [55] adopt dual center estimation to enhance small-scale accuracy, leveraging spatial configurations and appearance similarities to merge poses efficiently while eliminating the need for multiple passes. Further advancing these methods, studies [56,57] employ graph neural networks (GCNs), representing the human skeleton as a graph to streamline keypoint grouping and improve PE. Addressing grouping misclassification issues, Jin Lei et al. [58] standardize body centers and calculate statistical relationships among joints, centers, and multi-scale features, enabling dynamic grouping strategies that adapt to varying instance scales with adjustable thresholds. Additionally, Xiao et al. [59] enhance PE accuracy by extracting features through a residual network, feeding keypoints bottom-up, and refining results top-down with bounding box constraints.

2.3. Two-Dimensional Pose Estimation in Videos

The distinction between static images and dynamic videos underscores the complexity of video-based 2D human pose estimation, where the temporal information across frames becomes crucial. Video processing for PE must account for issues like inter-frame degradation, blurring, and even missing frames, which can impede PE accuracy. Nonetheless, leveraging frame-to-frame correlations can enhance supervisory tasks and boost estimation precision. The Long Short-Term Memory (LSTM) network [60] effectively captures temporal dependencies across frames, outputting stable sequential information through a multi-stage CNN with weight-sharing, making it highly beneficial for PE. Nie et al. [61] develop a dynamic kernel distillation model to capture temporal cues, utilizing a forward-gift method to learn feature information, with a temporal discriminator in training to minimize loss, ultimately yielding optimal pose predictions. Lee et al. [62] combine OpenPose with a ResNet-50 backbone for real-time estimation, incorporating confidence scores and correlation domains for improved accuracy. In cases requiring dense labeling but with insufficient frame annotations, Bertasius et al. [63] introduce PoseWarper, which extracts accurate feature information from sparsely labeled video frames to predict and refine the human pose. Similarly, Dong et al. [64] design a temporal-path module to encode data from non-contiguous frames and integrate them with spatial path data for holistic keypoint predictions. In a comparable approach, [65] compress video frame correlations, replicating results across frames and using a top-down model with an instance detector and estimator to accelerate inter-frame PE, thereby reducing inference time for closely related frames. However, these methods underutilize adjacent frame information. Liu et al. [66] address this by introducing a hierarchical alignment framework to mitigate aggregation issues from unaligned contexts, creating a more robust spatiotemporal PE model.

2.4. Summary and Analysis of 2D Pose Estimation Methods

With advancements in deep learning, the accuracy and performance of 2D PE have significantly improved. Recent progress includes the enhanced hourglass network [34,35] in single-person settings, fusion networks [46], and multi-person frameworks like the dual anatomical approach [41] and UniPose+ [51], all of which have shown promising results. Single-person PE, fundamental for detecting an individual’s pose, serves as the technical foundation for more complex applications. Innovations in network structures, such as DeepPose, Stacked Hourglass, CPN, and HRNet, have boosted both accuracy and processing speed. Building on this, multi-person PE involves detecting multiple individuals in one image while handling occlusion and truncation. The top-down approach optimizes pose detection by first identifying bounding boxes and then applying single-person PE. The accuracy of object detection, particularly in crowded scenes, is crucial for the overall task. Thus, detection algorithms should prioritize precision and robustness in dense environments. For example, the object detection methods in the literature [67] meet these demands and can be integrated into real-time systems to boost performance. Additionally, the SAM2 [68] technique uses 2D head keypoints for segmentation and tracking in videos. This interdependence between single- and multi-person PE methodologies drive progress in both areas, as advancements in multi-person PE strategies reciprocally enhance single-person techniques. Video-based PE adds complexity through a temporal layer, integrating single- and multi-person estimation across frames. Methods such as two-stream networks, 3D convolutional networks, and recurrent neural networks are employed to capture the dynamic temporal shifts within videos. PE in video sequences must address challenges inherent to dynamic imagery, such as individual pose detection under heavy occlusion [69], which can compromise boundary segmentation in top-down methods. The bottom-up approach similarly struggles with keypoint association in occluded scenarios. In summary, single-person, multi-person, and video-based PE drive each other’s development.

3. Three-Dimensional Pose Estimation

The primary goal of 3D human pose estimation is to accurately deduce joint and limb positions in 3D space from 2D images or video sequences, extracting pose features that reflect the spatial configuration of the human body. Three-dimensional PE facilitates precise spatial localization of joints (3D coordinates), effectively addressing limitations of 2D methods, such as viewpoint variability, complex motion, and occlusion, thereby enhancing robustness and precision. An in-depth examination of 3D human pose estimation techniques will be presented next.

3.1. 3D Single-Person Pose Estimation

In 3D single-person PE, beyond processing joint and limb data, a mesh can be constructed using a human body model to enhance 3D PE accuracy. Methods for achieving this are categorized into direct estimation approaches, 2D-to-3D lifting techniques, and mesh model-based methodologies.

3.1.1. Direct Estimation

The direct estimation approach entails feeding images directly into the network to produce 3D PE outputs, bypassing the generation of intermediate 2D information. This method, as illustrated in Figure 5a, offers a relatively straightforward solution. Early networks incorporated body part dependencies, combining pose regression with part detectors [70] to achieve end-to-end 3D estimation. For instance, in exploring 2D methods, Liang et al. [71] propose a structure-aware regression model based on skeletal structure loss, effectively addressing long-distance encoding and simplifying both 2D and 3D estimations. Luvizon et al. [72] employ a multi-task framework for 3D PE on static images, diverging from prior methods by directly predicting 3D data. Some studies introduced heatmap-based direct 3D predictions; for example, Nibali et al. [73] address the non-differentiability of 2D coordinates by proposing the MargiPose model, which generates continuous heatmaps to ensure differentiability while performing 3D predictions. Pavlakos et al. [74,75] reframe the highly nonlinear 3D coordinate regression problem using a discrete spatial representation of human anatomy. Their coarse-to-fine prediction framework within a convolutional network produced joint probabilities and used depth information to supply relative supervision, reducing dependency on precise 3D ground truth. While directly generating 3D information streamlines network architecture and avoids multi-cascade structures, its accuracy can be limited by the simplicity of the method. Consequently, recent research trends are shifting towards multi-stage fusion approaches to enhance estimation accuracy.

3.1.2. Two-Dimensional-to-Three-Dimensional Conversion

The 2D-to-3D conversion methodology positions 2D PE as an intermediary phase. Supported by advanced techniques, comprehensive 2D datasets, and enhancement strategies, this approach has significantly elevated 3D estimation precision. These robust 2D frameworks lay a substantial foundation for transforming 2D data into 3D coordinates, as illustrated in Figure 5b. Martinez [76] employs the technique for predicting deep 3D positions from 2D coordinates, setting a standard baseline for early network models. This approach subsequently fostered the mainstream adoption of 3D pose inference grounded in 2D data. Chen et al. [77] emphasize 3D keypoint lifting and 3D-to-2D projection, constructing extensive 2D–3D pose pair datasets. Taemin et al. [78] utilize CNNs on edge devices to process 2D poses, which are then transmitted to a server for 3D reconstruction. Ji et al. [79] propose a Lie algebra-based representation of poses, implementing a self-projection mechanism within PoseMoNet to enhance 3D pose accuracy while maintaining the integrity of human kinematic structures. Ongoing advances in CNN technology further bolster network capabilities, necessitating increasingly robust architectures. Mehwish et al. [80] investigate CNN-based solutions for guiding 3D PE from incomplete keypoint data. Kim et al. [81] employ CNNs to generate 2D keypoint heatmaps through multi-view projection of depth and ridge data, combining this with a fully connected layer to derive 3D joint positions. In contrast to heatmap-centered methodologies, Yan et al. [82] develop STM-CNN, which infers a 2D coefficient matrix and reconstructs the 3D pose via preprocessed shapes and weights within the same network, thereby reducing the computational demands associated with generating heatmaps.

Owing to the intrinsic graph-like structure of human poses, where joints function as nodes and bones as edges, GCNs have emerged as a highly effective approach for PE. GCNs adeptly capture spatial relationships between nodes, proving well suited for exploring joint dependencies and modeling the human body’s geometric structure. Lu Zou et al. [83] conceptualize 2D poses as graphs, redefining 3D estimation as a graph regression problem, where GCNs infer latent structural relationships within the human body. Bing Yu et al. [84] develop a Perceptual U-shaped Graph Convolutional Network (M-UGCN) using a U-shaped network with map-aware local enhancement, extending the receptive field and intensifying local node interactions across multiple scales to improve 2D-to-3D estimation. Building on this, Hua et al. [85] combine 2D pose estimates from dual views with triangulation to produce an initial 3D pose, subsequently refining it through a Cross-view U-shape Graph Convolutional Network (CV-UGCN) under weak supervision, applicable to any preceding 2D method. Wang [86] introduce a multi-constrained extended convolutional network, employing GCNs to impose local constraints via alternating spatial and temporal connections and fully connected layers to enforce global constraints, thereby advancing robust 3D PE. Typically, GCNs rely on single-core modeling to capture information, which can limit model diversity. Addressing the trade-off between generalizability and specificity, Chen [87] proposes a Relation-balanced Graph Convolutional Network (RbGC-Net) with a kernel-sharing strategy. This model allocates kernels according to semantic relations, combining local and global features to boost joint interactivity. Wu et al. [88] introduce the Hierarchical Poselet Guided Graph Convolutional Network (HPGCN), integrating a diagonal-dominant graph convolutional layer with a non-local layer to fully capture pose features, refining 3D pose regression through length-based and direction-based constraints. In the context of 2D-to-3D pose lifting, GCNs sequentially transfer and contextualize node information from 2D joint inputs, yielding highly precise 3D joint position predictions.

These methods typically involve numerous parameters, with [84,87] being relatively lightweight. GCNs’ computational complexity is shaped by the graph topology, where fixed topologies, using predefined skeletons, offer low cost but limited adaptability. Adaptive topologies, though more flexible, increase complexity. Local topologies reduce the computational burden but may omit long-range dependencies, while global topologies improve pose modeling at the cost of a higher computational load. Thus, the topological prior governs both information propagation and efficiency, requiring a balance between complexity and generalization. GCNs, by using skeleton topology, effectively reduce noise interference, significantly enhancing robustness and accuracy in PE. In contrast to other network-based methods, GCNs exploit the topological structure of the skeleton to effectively reduce noise interference, thereby greatly improving the robustness and accuracy of PE.

The kinematic model treats the human body as an articulated structure, accounting for the connectivity between bones and joints, rotation constraints, and segment length ratios. By imposing physical constraints, the kinematic model reduces ambiguity and aligns estimation outcomes more closely with real-world poses—a growing area of interest in 3D human pose estimation. Andrei et al. [89] introduce semi- and self-supervised models rooted in kinematic latent normalization flow and dynamics, augmented by a differentiable semantic body-part alignment loss for self-supervised learning. Xu et al. [90] advances a deep kinematic analysis pipeline that optimizes noisy 2D inputs, decomposes joint motions according to human topology, and applies temporal refinement to enhance 3D PE, fully integrating the kinematic model for accuracy. Building on this, Jiang et al. [91] embed geometric priors into the kinematic model, constraining estimated poses to general kinematic principles. Other strategies fuse kinematic topology with temporal features using a multi-level encoder–decoder architecture [92], where a temporal convolutional encoder first extracts temporal information, followed by a kinematic regression decoder to predict 3D poses. Zhang et al. [93] implement an end-to-end P2P-MeshNet that reconstructs 3D joint positions from 2D inputs, integrating joint rotation estimation via an inverse kinematics network (IKNet-body) and a self-correcting network (IEF), embedding the kinematic model to ensure accurate, consistent joint rotations. Recently, a kinematics-informed filter was developed [94], combining traditional filtering with a recurrent neural network to achieve real-time optimization of 3D motion data, applying biomechanical constraints to effectively reduce jitter and frame loss. By integrating deep learning with kinematic priors, these approaches significantly boost the robustness and precision of 3D PE, particularly in handling complex motion and occlusion, thereby enhancing predictive performance.

3.1.3. Three-Dimensional Estimation Based on Human Body Model

Human body mesh models are constructed from thousands of vertices that form a dense 3D surface, effectively representing various anatomical regions. In 3D PE, these mesh models play an essential role by enabling the generation of a comprehensive 3D shape representation, achieved by learning the intricate spatial relationships between vertices. Unlike point-based models that rely on discrete joint markers, mesh models encapsulate continuous morphological details, allowing for more precise joint localization and a richer portrayal of complex motion and postural dynamics. As shown in Figure 6, mesh models go beyond merely pinpointing key positions by providing an integrated view of the human shape, which substantially improves both the resolution and accuracy of PE.

Volumetric models are central to constructing high-fidelity human body meshes, offering rich shape information, with the SMPL model [95] being the most extensively utilized. Zhou et al. [96] introduce part-centric heatmap triples, or HEMlets, to bridge the disparity between 2D joint detection and 3D PE. Leveraging a convolutional network, they predict HEMlets to capture relative depth data, facilitating joint position estimation through volumetric heatmap regression. This method integrates end-to-end learning with SMPL parameter regression, resulting in accurate 3D pose and shape reconstruction. Andrew et al. [97] propose combining multi-view video (MVV) and Inertial Measurement Unit (IMU) data to enhance PE accuracy. By using a 3D CNN to derive pose embeddings from volumetric models and coupling it with a Long Short-Term Memory (LSTM) network to capture spatio-temporal dependencies, they achieved enhanced pose prediction through dual-stream data fusion. Zheng et al. [98] introduce a lightweight graph transformation network to reconstruct 3D human meshes from 2D poses, effectively integrating GCNs and transformer architectures. This method models joint relationships efficiently, providing a compact yet effective alternative to Pose2Mesh [99]. Zhang et al. [100] develop a multi-branch network to directly estimate and track multi-person 3D poses and Re-ID features using 3D voxel representations constructed from multi-view images. By eliminating single-view reliance, this approach utilizes voxel features to integrate information across views, ensuring robust 3D pose tracking even under partial occlusions. Zhang et al. [101] propose a convolutional network with a shape alignment mechanism, progressively refining 3D pose and shape estimation via a feedback loop within a multi-level pyramid structure. Xu and Xiangyu [102] leverage self-supervised learning for 3D morphology and posture from low-resolution images. By training a convolutional network on unlabeled data, along with the SMPL volumetric model and a projection consistency strategy, they achieved effective recovery of 3D shape, even from low-resolution inputs. Jiefeng Li et al. [103] employed the SMPL parameter model to represent the human body and propose a hybrid inverse dynamics approach that integrates 3D keypoint estimation with human mesh reconstruction. By incorporating torsion-swing rotation decomposition and end-to-end differentiable training, they co-optimize the PE task, significantly enhancing its accuracy and consistency.

In 3D PE, efficient computation is paramount. Both the DUSt3R [104] technique, which reconstructs 2D to 3D from limited views or even a single image to extract geometric data in seconds, and the DeForHMR [105] framework, which combines SMPL parameters, pre-trained ViT features, deformable cross-attention, and iterative error feedback, effectively address diverse task requirements by enhancing both accuracy and computational efficiency. Anastasis Stathopoulos et al. [106] enhance 3D pose and shape reconstruction with a score-guided diffusion model in latent space, ensuring consistency with image data through SMPL parameters from the initial regression network. This method excels in single-frame fitting, multi-view, and video sequence reconstruction.

In local PE, Pengfei Xie et al. [107] propose a novel framework integrating musculoskeletal modeling, where constraints are applied through their interaction. An MLP, combined with the reference pose, progressively refines estimations, yielding more physiologically accurate and natural hand poses. Meanwhile, Istvan Sarandi et al. [108] introduce a point locator network and neural locator field that dynamically predict points on the human body’s surface and volume in 3D space. By integrating mesh, skeleton, and dense pose data, this approach overcomes data heterogeneity and eliminates reliance on specific annotation formats. Zhongang Cai et al. [109] utilize ViT-Huge as the backbone, training on 4.5 million instances. Through data augmentation, parameter optimization, and efficient framework design, it achieves precise human, hand, and face PE with remarkable generalization and transferability.

Several advanced models have been developed based on the SMPL framework, including SMPL-X [110], SMPLify [111], SPIN [112], SMPLR [113], and SMPLer [114], to overcome limitations associated with the original SMPL model, such as high computational demands and the absence of some parts of the keypoints. Nikos et al. [115] alleviate the model’s heavy dependence on parameter space while maintaining the topological structure of the SMPL template mesh. Their approach utilizes a network to directly regress mesh vertex coordinates, integrating shape constraints with the volumetric model to enhance reconstruction accuracy and consistency. Furthermore, SPIN incorporates both SMPL [112] optimization- and regression-based methods, employing a CNN to iteratively update mesh shapes. SMPLify [111] is an optimization-based method that aligns the SMPL model with the detected 2D joints, minimizing the reprojection error in the process. Choutas et al. [110] achieve expressive human body reconstruction from monocular images through a volume-driven attention mechanism, optimizing SMPL parameters to capture features that simultaneously represent body, face, and hand shapes and postures, resulting in the development of SMPL-X. Building on this concept, SMPLify-X was developed to enhance learning on the AMASS dataset [116]. The Multi-HMR [117] network predicts SMPL-X parameters and 3D positions (including hands and faces) using internal camera parameters, marking the first single-shot 3D mesh recovery method for multiple people. Leveraging a ViT backbone for feature extraction, the model integrates the Human Perceptual Head to predict pose, shape, and depth via a cross-attention mechanism, particularly excelling at high resolution. SMPLR [113] leverages a deep learning-based inverse SMPL approach to recover 3D human pose and shape from 2D images, optimizing SMPL parameters via a convolutional network and backpropagation. Additionally, SMPLer [114], proposed by Xu et al., integrates converter architecture with the SMPL model, employing a self-attention mechanism to optimize the shape and pose parameters while estimating keypoints through a volumetric model.

Beyond the SMPL framework and its derivatives, other models have emerged for PE. Wang et al. [118] introduce a skeleton-level skin model that decouples bone structures from shapes, synthesizing meshes by configuring bone proportions. Cheng et al. [119] employ “cylindrical man model” to create occlusion training data and impose pose regularization constraints, addressing occlusion in monocular videos using 2D confidence heatmaps, optical flow consistency constraints, and temporal convolutional networks for both 2D and 3D data. Differing from [110], Xiang et al. [120] approach the 3D orientation of body parts, face, and hands by utilizing a 3D part orientation field (POF) in conjunction with fully convolutional networks to encode and reconstruct pose and shape via a 3D deformable mesh model. Compared to other joint point detection methods, mesh models not only estimate joint positions but also effectively capture morphological changes across body parts.

This approach centers on feature learning based on true ground truth. However, the lack of real 3D annotations compels reliance on approximate methods for generating pseudo-ground truth, which compromises PE accuracy. To address this, several studies have been conducted. Sai Kumar Dwivedi et al. [121] propose threshold-adaptive loss scaling and tokenized pose encoding to reduce the impact of false ground truth. Yuanyuan Song et al. [122] optimize shape prediction by incorporating gradient-optimized model parameters, contour information, and silhouette and vertex loss functions, minimizing dependence on 3D annotations. More recently, Priyanka Patel et al. [123] enhance pseudo-ground truth quality by integrating a full perspective model and dense surface keypoint techniques, driving significant progress in the field. Additionally, PE for similar actions remains a challenge, but Yidan Zhang and Lei Nie [124] effectively mitigate the effects of perspective and skeletal variations using deep metric learning, dynamic time warping, and a triplet loss function.

3.2. Three-Dimensional Single-View Multi-Person Pose Estimation

Three-dimensional multi-person estimation and two-dimensional multi-person PE exhibit similarities in their processing methodologies and challenges; however, the intricacy of the processing pipeline and network architecture in three-dimensional estimation is markedly more pronounced. The primary objective of multi-person PE is to detect and localize keypoints for all individuals within an image, where the number of subjects remains unknown a priori. Existing methods can be systematically categorized into top-down and bottom-up strategies based on their processing approaches.

3.2.1. Top-Down Approach of Three-Dimensional Multi-Person Pose Estimation

The top-down approach employs a two-stage process whereby individuals are initially identified within the input image, followed by the precise localization of keypoints for each subject within their respective detected bounding boxes. This methodology effectively aligns all poses with world coordinates, utilizing each individual’s absolute root coordinates and their root-relative poses. Rogez et al. [125] first design LCR-Net, a framework designed for controlled populations and improve this architecture to LCR-Net++ [126], which systematically defines poses across various positions within the input. This model aggregates adjacent assumed poses and employs a regressor to facilitate collective enhancement, thereby broadening its applicability in real-world scenarios. Building on this concept, Seo et al. [127] develop a two-step local 3D estimation process that integrates hand detection and PE through an attention-based architecture known as StereoNet. This approach incorporates a geometric loss function, termed StereoLoss, and a novel 2D disparity map (StereoDMap) to facilitate absolute 3D hand PE using stereo sensors. Additionally, depth information plays a crucial role in enhancing PE accuracy. For example, Moon et al. [128] propose a top-down methodology that leverages camera distance perception, effectively integrating depth information with camera projection parameters to predict the relative distances of individuals, thereby improving depth accuracy. Their framework employs RootNet to ascertain the position of the human root within the camera coordinate system, subsequently utilizing PoseNet to estimate root-relative 3D poses for cropped regions, thus enabling precise 3D multi-person PE. El Kaid et al. [129] recover absolute depth information by employing a human detector, a 3D root-relative pose reconstructor, and a root depth estimator, while adaptively refining the GAST-Net and RootNet architectures for top-down PE. Similarly, Shen et al. [130] employ depth information fusion to mitigate prediction errors in keypoints, particularly in the context of occlusion challenges, refining pose predictions iteratively from top to bottom through a multi-scale “waterfall” structure. Furthermore, Paudel et al. [131] harness a temporal convolutional network (TCN) to dynamically estimate worker postures over time, ensuring that the 3D posture recognition process not only analyzes single-frame images but also effectively tracks continuous posture changes. This approach incorporates multi-level risk assessments to evaluate 3D human postures, adeptly addressing dynamic challenges while enhancing real-time performance.

3.2.2. Bottom-Up Approach of Three-Dimensional Multi-Person Pose Estimation

In contrast, bottom-up methods concurrently predict all keypoints and subsequently allocate these keypoints to individual subjects. However, occlusion remains a fundamental challenge inherent to any PE approach. Zhen et al. [132] develop a depth-aware partial association algorithm to mitigate this issue, enabling the accurate assignment of joints to individuals while considering occlusion and bone length constraints. Similarly, Mehta et al. [133] utilize a dense 3D convolutional skeleton representation that infers occluded joints by merging multi-person detection with local pose reconstruction. The proposed method integrates learned pose priors with global context, enabling the system to effectively adapt to complex scenarios, including multi-person interactions and occlusions. Simultaneously, it enhances 3D pose reconstruction by improving temporal consistency and maintaining structural integrity. Vasileiadis et al. [134] extract spatial features from point clouds through 3D convolutional layers, effectively capturing both local and global geometric information of the human body while addressing the limitations posed by depth information in 2D methodologies. Nonetheless, achieving a precise assignment of keypoints without overlap or redundancy remains a significant challenge for bottom-up techniques, underscoring a crucial area for further improvement. Zanfir et al. [135] formulate person grouping as an optimization problem using binary integer programming. Initially, they estimate candidate limb connections between detected joint points via a limb scoring module and then reconstruct full skeletons by resolving the binary integer programming optimization problem. Similarly, Li et al. [136] employ a cross-view and clustering strategy to address the classification challenge. They first detect joints independently to generate 3D hypotheses, followed by clustering based on distance similarity in 3D space, ultimately facilitating the accurate grouping and matching of joints to individual skeletons through cross-view clustering.

Some studies have integrated both top-down and bottom-up networks to achieve enhanced results. For instance, Mohamed et al. [137] employ a reliable bottom-up semantic body part segmentation combined with robust top-down body model constraints to capture global pose features, followed by the use of bottom-up local features to fine-tune details, ultimately improving pose and shape estimation accuracy. Yan et al. [138] incorporate a densely connected attention pyramid residual module during the bottom-up phase, along with an isometric regularization term in the top-down phase to penalize misalignment, significantly enhancing inference accuracy. To address the limitation of absolute coordinates in target-centered PE, Zhang et al. [139] propose a dual-network framework merging top-down and bottom-up methods, combined with standardized heatmaps, which bolstered the model’s robustness against scale variations. This approach also tackled the scarcity of 3D live data through a semi-supervised framework. Given the high computational demand of dual-stage networks, Xiao et al. [53] introduce AdaptivePose++, which employs a fine-grained human body representation to estimate all keypoints in a single forward pass. This method is effective for both 2D and 3D estimation, eliminating reliance on traditional global detection and thus enhancing both the robustness and adaptability of 3D PE in complex environments.

3.3. Three-Dimensional Multi-View Multi-Person Pose Estimation

To address the occlusion challenge, hardware solutions can supplement the limited field of view in addition to depth information and other technical approaches. The most prevalent method involves estimating 3D human poses from multiple perspectives using multiple cameras or alternative sensors to capture the subject’s movements from diverse angles. The combined information from each perspective is then processed within a network, as illustrated in Figure 7, to enable accurate PE. However, this approach demands considerable computational resources and extended processing time, making it most suitable for multi-person PE where a balance between accuracy and efficiency is paramount. Helge et al. [140] propose combining camera poses with human poses, dynamically adjusting camera positioning to compensate for occlusions, and predicting consistency constraints through multi-view lenses. To mitigate annotation overhead and reduce redundancy during training, an encoder–decoder model [141] is introduced to extract geometry-aware 3D latent representations from multi-view images and segmented backgrounds. Notably, without periodic consistency constraints, individual view pairings may result in inaccurate 3D pose reconstructions [142]. To address this, additional motion information across views is incorporated by Tian et al. [143], enabling unified 3D reconstruction under multiple constraints. Likewise, paired annotations from thermal and visible images have been utilized for bounding boxes and PE [144], while an adversarial learning framework [145] derives multi-view correlations by measuring spatial distance and angular differences between 2D and 3D spaces. Considering the high cost of multi-view training, Wang et al. [146] introduce a static pose sample simplification approach to reduce storage costs and design a fast-start, lightweight network. Zhang et al. [147], recognizing the limitations of current multi-view methods in merging certain elements, propose a multi-view framework that leverages volume aggregation for multi-scale information integration. Their method further eliminates redundant background through grid-aligned voxel selection, ultimately predicting poses by fusing human body models with keypoints for enhanced accuracy and efficiency.

3.4. Three-Dimensional Pose Estimation in Videos

Currently, PE research has extended from images to videos, with methods such as [129,148,149] capable of operating within video contexts. Although video processing introduces additional complexity, videos provide substantial contextual information, which significantly aids in accurately determining 3D poses. Building on image-based estimation algorithms, videos can be divided into sequential frames [150,151], allowing for frame-by-frame processing followed by consistency analysis to derive the most coherent pose. Temporal information is crucial for video-based approaches; leveraging temporal continuity is essential to retain distinctive characteristics and prevent information loss. Dario Pavllo et al. [152] propose a 3D PE model leveraging TCN and semi-supervised learning, incorporating dilated temporal convolutions and a backward prediction mechanism. Relying solely on intrinsic camera parameters and minimal additional inputs, the model achieves high accuracy with limited labeled data, demonstrating its superiority. Hong et al. [153] introduce a model-independent approach, Temporal Procrustes Alignment Regularization (TPAR), to enable group-level sequence learning of joint motion trajectories, enhancing sequence-level accuracy by resolving geometric misalignments between predicted and true joint paths. Hossain et al. [154] address the estimation of 3D poses from sequences of 2D poses by utilizing temporal information through LSTM units with swift connection type. However, this method primarily emphasizes the temporal dimension, with limited spatial considerations. Inspired by skeletal anatomy, Chen et al. [155] decompose the task into the bone direction and length prediction, proposing a fully convolutional architecture with extended skip connections to predict direction while circumventing LSTM memory constraints. This method integrates an implicit attention mechanism and 2D keypoint visibility to reduce depth ambiguity. Zhang et al. [156] introduce PoseAug, which diversifies training poses through a novel differentiable pose augmentation module, jointly optimized with a 3D estimator to generate more challenging and varied poses online. This approach enhances both initial and intermediate pose generation in video-based estimation, contributing to more robust and nuanced pose interpretation across frames.

3.5. Summary and Analysis of 3D Pose Estimation Methods

Three-dimensional human PE has progressed from traditional geometric reasoning techniques to end-to-end deep learning models, which autonomously learn effective features from extensive datasets, minimizing the complexity of manually crafted features. Successive advancements in model architectures and multi-layer networks have emphasized the powerful integration of depth information [96,129]. Notably, single PE has achieved considerable strides with 2D keypoint regression and direct prediction through deep learning models, marking a strong trend from 2D [41] to 3D methodologies. Techniques from 2D PE are often adapted to 3D scenarios, as 2D pose behavior underpins many 3D innovations, including heatmap-based approaches. For instance, the development of 3D heatmaps using 2D pose heatmaps as intermediate steps has advanced this area [81]. High-precision models like HRNet [66] employ 2D keypoint regression models for 3D PE and, in some cases, integrate a lightweight design to balance accuracy with efficiency. Multi-person PE extends the complexity of single-person estimation by addressing individual differentiation and joint association through association mechanisms and graph structures. This is particularly challenging in crowded or occluded environments, where errors are more prevalent. GCNs are especially effective in multi-person scenarios due to their ability to navigate complex human structure topologies. PE in video sequences inherits all challenges from single-frame estimation but leverages temporal information to strengthen the continuity and robustness of predictions [151]. Temporal consistency across multiple frames can mitigate errors from single-frame predictions and address joint occlusions and complex movements.

Although each task encounters unique challenges and diverse application scenarios, their primary objective converges on inferring 3D joint positions from 2D data. These tasks share technical commonalities and interdependencies in their approaches, particularly in the use of CNNs, GCNs, and time series models. The robustness of PE algorithms under varying illumination and low resolution depends on their adaptability to input quality. Illumination changes affect both 2D keypoint detection and 3D estimation. While GCNs are more resilient, they still rely on accurate input. Low resolution can cause joint information loss, significantly impacting image-processing networks like CNNs. Although GCNs are more robust, errors may still accumulate. To improve robustness, strategies such as data augmentation, super-resolution, temporal fusion, and multi-modal learning can enhance model generalization across diverse environments. Single-person PE prioritizes individual precision, multi-person estimation emphasizes person differentiation and joint association, and video estimation focuses on maintaining temporal coherence. However, these tasks are fundamentally similar in feature extraction and inference mechanisms, building on interconnected technical foundations and theoretical frameworks.

4. Evaluation Metrics, Datasets, and Comparative Analysis

4.1. Evaluation Metrics

4.1.1. Two-Dimensional Pose Evaluation Metrics

Assessing model performance in 2D human pose estimation is inherently complex due to diverse features and requirements, including variations between upper-body and whole-body estimation, single-person and multi-person contexts, and differences in human scale. To effectively navigate these evaluation challenges, researchers have introduced a comprehensive suite of metrics, among which several are commonly utilized as standard benchmarks:

Percentage of correct parts (PCP).

The PCP [157] is an early metric in human pose estimation, designed to evaluate the accuracy of limb predictions. This metric links key joint points, forming “limbs” (e.g., upper arm, lower arm) and assesses prediction accuracy by measuring the distance between predicted and actual limb positions. For each limb, the PCP calculates a vector between two connected joints, such as the shoulder and elbow, and deems the prediction correct if the distance between the endpoints of the predicted and ground-truth vectors falls below a certain threshold (typically 50% of the true limb length (PCP@0.5)). However, the PCP is less commonly used in recent work due to its sensitivity to shorter limb lengths and associated detection challenges.

2.: Percentage of correct keypoints (PCK).

The PCK [158] evaluates keypoint localization accuracy for individual joints, such as the head, shoulder, or knee, focusing on precise joint positions rather than entire limb accuracy. This metric calculates accuracy by measuring the distance between each predicted keypoint and its ground-truth counterpart. The most commonly applied threshold, PCKh@0.5, sets the acceptable distance to within 50% of the head segment length in each test image.

3.: Average precision (AP) and average recall (AR).

AP is a standard metric for evaluating model performance in classification and detection, effectively balancing precision and recall, especially in imbalanced or multi-class datasets. Mean Average Precision (mAP), representing average precision across classes, is widely used in human pose datasets like MPII and PoseTrack. In the COCO dataset, average recall (AR) [159] complements mAP by focusing on recall. Object Keypoint Similarity (OKS), similar to IoU in object detection, evaluates keypoint proximity to true locations. COCO uses mAP across OKS thresholds for a comprehensive assessment of model accuracy across scales.

4.1.2. Three-Dimensional Pose Evaluation Metrics

MPJPE (Mean Per Joint Position Error)

In 3D human pose estimation, while the percentage of correct keypoints (PCK) can be adapted, MPJPE is more widely utilized. The MPJPE assesses error by measuring the Euclidean distance between estimated and true 3D joint positions. Variants include PA-MPJPE, which aligns predicted poses to true poses through scaling, rotation, and translation, thus minimizing global discrepancies like perspective and scale. Another variant, Normalized Mean Per Joint Position Error (NMPJPE), integrates normalization into the MPJPE calculation to enhance accuracy.

2.: MPVE (Mean Per Vertex Error)

It measures the average error between the estimated and the ground truth for each vertex in a 3D human mesh model. Different from other metrics such as the MPJPE and NMPJPE, the MPVE focuses on the precision of vertex positions across the entire body mesh, not only the accuracy of joint points, and is especially often used in 3D human mesh reconstruction tasks.

4.2. Dataset and Comparative Analysis

In deep learning, datasets are essential, forming the foundation for model training and directly impacting generalization, performance, and applicability. Effective learning relies on extracting features from extensive labeled data; hence, the dataset’s quality, quantity, and diversity are crucial for robust representation development. By providing input–output pairs, datasets enable models to optimize weights and biases, thereby enhancing prediction accuracy. A dataset that includes diverse environments, angles, lighting, object shapes, and classes promotes the learning of adaptive features, allowing the model to navigate real-world variability. Additionally, the choice of dataset influences the convergence speed of algorithms and the adaptability of the model as well as the information extraction capabilities of the network architecture.

4.2.1. Two-Dimensional Datasets and Performance Comparison

Although there are a large number of datasets in 2D PE, some small-scale datasets were only used in the beginning. Later, with the increase in network depth, small-scale datasets gradually exposed the shortcomings of a lack of different object motion and limited data. We compare the following datasets: Penn Action [160]; FLIC [161]; J-HMDB [162]; CrowdPose [163]; HiEve [164]; the MPII dataset [165]; the Leeds Sports Pose (LSP) dataset [166]; the Common Objects in Context (COCO) dataset [159], which Microsoft introduced in detail; and the PoseTrack dataset [167], detailed in the Table 1. Furthermore, a comparative analysis of performance metrics across methods using the MPII, LSP, COCO, and PoseTrack datasets is conducted.

Table 2 shows the comparison results of different 2D single-person PE methods on the MPII dataset using the PCKh@0.5 metric. In terms of the methods based on regression and the heatmap, the regression method directly estimates, but the regression problem based on PE usually cannot give a linear optimal solution, and the accuracy is also lower than that of the heatmap method. Compared with the single-person estimation in MPII, as shown in Table 2, and LSP, as shown in Table 3, the performance of both can be clearly seen. This is also because the heatmap method predicts the heatmap of each joint, which better retains the original information, and the local has a good supervision effect on the global, thus improving the accuracy. High-performance methodologies enhance estimation accuracy but rely on cascaded network structures. Whether through the hourglass architecture, residual mechanisms, or series–parallel layer connections, these methodologies improve feature learning and joint interaction, optimizing performance. Notably, methods [42,43] excel on the MPII and LSP datasets, both incorporating GANs. Method [42] integrates the hourglass structure into the backbone, while both methodologies leverage the generator–discriminator interaction to enhance sample discrimination and feature learning, improving accuracy. Additionally, method [39], combining the hourglass with a pyramid residual mechanism, achieves excellent results. However, these complex structures demand substantial computational resources due to their high parameter and computation requirements.

Table 4 shows the experimental outcomes of various 2D PE methodologies on the COCO dataset test development set. The heights of AP and AR data also highlight the significant effects of bottom-up and top-down methods. According to the comparison, it can be found that the accuracy of top-down is higher than that of bottom-up. The top-down method achieves higher accuracy by segmenting individual images before estimating keypoints, making it ideal for complex scenes and high-precision tasks. However, it requires high-quality labeled data and advanced hardware (e.g., GPUs) and incurs high computational costs, with limited real-time performance. In contrast, the bottom-up method directly extracts keypoints in multi-person scenes, offering better computational efficiency and real-time performance but with lower accuracy and more complex grouping. The performance difference stems from variations in network structure and processing flow. The top-down method reduces scene interference through segmentation, improving accuracy and simplifying grouping. While faster, the bottom-up method faces a more complex grouping challenge. Thus, selecting the optimal method depends on task requirements, data quality, and hardware resources. In Table 5, we mainly compare the performance of three methods on the PoseTrack2017 and PoseTrack2018 datasets for video streams. Most of the processing of video streams is also based on single- or multi-person estimation between frames.

As shown in Table 2, Table 3 and Table 4, model performance on the MPII dataset is generally similar, with higher accuracy for the shoulder, head, and hip but lower accuracy for the wrist and ankle. Because the dataset focuses on daily activities, it is suitable to reflect the pose changes in reality, but it is weak to deal with complex motions. In contrast, the LSP dataset, which focuses on athletic movements, demonstrates superior performance, particularly in shoulder recognition, due to precise annotations and relatively consistent pose variations, providing a clear training target. The COCO dataset, however, performs worse than MPII and LSP, primarily due to its multi-person scenes and complex backgrounds, which introduce challenges such as interaction and occlusion, reducing model accuracy. Dataset complexity and diversity significantly impact model performance—MPII and LSP, being simpler and suited for standard single-person actions, yield higher accuracy, whereas COCO, with its intricate environments and multi-person interactions, presents greater challenges.

4.2.2. Three-Dimensional Datasets and Performance Comparison

Three-dimensional annotation relies on advanced motion capture systems to accurately capture human motion trajectories in 3D space, yielding precise 3D joint positions. Most 3D human pose estimation datasets are generated in controlled indoor or simulated environments, utilizing motion capture (MoCap) systems, CMU Panoptic, and 3DPW. While some datasets incorporate Inertial Measurement Units (IMUs), their accuracy and resolution often remain limited. Table 6 outlines several prominent datasets, including Human3.6M [168], MuPoTS-3D [169], the MUPO-3DHP dataset [169], the 3DPW dataset [170], the AMASS dataset [165], the NBA2K dataset [171], the GTA-IM dataset [172], and occlution-Person [173], which covers information such as release year, data size, and applicable scenarios. Human3.6M and MuPoTS-3D are the most widely used, and their performance metrics across different PE methods are analyzed and compared in detail.

In Table 7, most 3D single-human PE methods demonstrate high accuracy on the Human3.6M dataset, a result attributed partly to the dataset’s intrinsic characteristics. Comparing experimental results, direct estimation methodologies show larger errors, primarily due to their reliance on shallow networks and regression models with limited feature extraction capabilities. In contrast, the 2D-to-3D conversion method uses a cascade network, combining a high-performance 2D estimator with a 3D conversion module to enhance accuracy. However, increasing network depth or task fusion does not always improve results. The methodologies in the table optimize network design for specific tasks, ensuring high performance. Excessive network complexity or feature processing can cause information overload or computational inefficiency. Despite strong dataset performance, these methods face significant limitations in real-world applications, where their effectiveness may sharply decline.

The complexity of dynamic scenes poses challenges to their robustness in 3D PE. Notably, estimating 3D poses from video data yields superior performance compared to single-image approaches, as temporal consistency enhances stability and accuracy across frames. While the Human3.6M dataset supports high accuracy under controlled conditions, researchers are actively exploring methods to bridge the gap between laboratory precision and real-world applicability, with promising directions emerging in multi-frame and action-conditioned models. A scrutiny of Table 8 reveals that multi-view 3D PE techniques outperform single-view 3D methods when evaluated using the same datasets and performance metrics. The multi-view setting effectively mitigates occlusion and depth blur, thereby enhancing estimation accuracy.

5. Practical Applications of Pose Estimation

The maturation of PE technology has significantly advanced the field of computer vision, leading to its widespread application across various domains. In the following sections, we will provide a detailed review of several key areas where this technology is extensively utilized.

Motion analysis.

By extracting human keypoints to derive semantic information, PE facilitates the recognition and analysis of various human actions. For instance, Ref. [174] stacks 2D heatmaps of human pose sequences into 3D heatmap volumes, enabling the use of ResNet layers to predict human actions from these volumetric representations. This approach also enhances privacy by not displaying individuals’ faces in images or videos. Applications of PE span multiple contexts; for example, it has been employed for behavioral monitoring through posture recognition [175]. The generation of precise skeletal information through this technology can be leveraged in diverse scenarios, such as identifying and mitigating hazardous activities among the elderly [176] as well as monitoring unsafe behaviors of workers in construction environments [177].

2.: Sport analysis and training.

In sporting events, PE is particularly focused on the motion capture and analysis of athletes, significantly contributing to the scientific advancement of competitive sports. This technology plays a pivotal role in sport science, athlete training, and performance evaluation while also paving new avenues for research in the field [178]. For instance, in yoga posture monitoring applications [179], data collection and model training is integrated with real-time feedback, enabling deep learning models to analyze posture data. This functionality allows users to track their practice data and monitor progress effectively. Moreover, video surveillance systems or cameras are employed to record players’ movements during table tennis matches and training sessions [180], capturing their striking techniques while PE algorithms extract keypoints from the video frames. PE finds applicability across various sports, including underwater athlete PE [181], where swimming posture recognition is emphasized. In track and field events, PE offers a novel perspective for motion analysis and performance enhancement [182]. Similarly, research on 3D PE and tracking in handball further exemplifies the versatility of this technology in diverse athletic contexts [183]. Additionally, the emerging Judging Support System leverages PE to accurately capture and assess complex athletic movements, comparing them to established standards to ensure precise scoring. In contrast to manual refereeing, this system enhances both the efficiency and fairness of judgments, significantly reducing or even eliminating errors from misjudgments or missed calls.

3.: Human–computer interaction.

Unlike conventional human–computer interaction methods, PE offers a natural, non-contact approach for controlling and interacting with virtual objects through motion, skeletal structures, and joint points. Within VR/AR environments, PE facilitates user actions to engage dynamically with virtual elements. For example, Ref. [184] leverages pose data to capture hand interactions with objects in virtual settings. Ref. [185] employs upper-body PE to enhance avatar generation and control, thereby improving interaction quality and expanding the potential for immersive user experiences in virtual worlds. In addition, in [186], pose data are combined with a model of the virtual patient to create a virtual environment that responds in real time to the actions of the trainer. This includes simulating haptic feedback that enables the trainer to sense and manipulate the responses of the virtual patient.

4.: Entertainment and art.

Gesture recognition technology has become increasingly prevalent in entertainment, animation, and film production, enriching our visual experiences. For instance, Ref. [7] employs 3D PE and human mesh recovery techniques to analyze keypoints in images, facilitating the generation of corresponding 3D character models. This technology enables the real-time manipulation of these 3D characters, achieving natural and fluid animation effects that enhance our visual engagement and leisure experiences. Moreover, Ref. [187] harnesses specific poses to inform character animation creation, providing robust support for animators. Similarly, the pose-driven technology discussed in [188] synthesizes realistic 2D motion, offering greater flexibility and creative freedom. This approach can be applied across various 2D animation projects, accommodating diverse styles and requirements, thereby expanding the horizons of artistic expression.

5.: Health monitoring and rehabilitation.

By analyzing posture data, medical professionals can assess individuals’ movement statuses in real time, facilitating the identification of abnormal or undesirable movement patterns. This capability aids in diagnosing potential health issues and enhances healthcare delivery. For instance, Ref. [189] employs real-time 3D PE through sensor data to capture user postures during daily activities. By monitoring and correcting poor postures, the system effectively prevents and alleviates lower back pain. In another application, Ref. [190] utilizes wireless sensors to collect real-time posture data from athletes during training and competition, enabling the identification and correction of poor postural habits, thereby reducing injury risks and enhancing training efficacy. Furthermore, Ref. [191] presents a real-time monitoring system based on human PE that guides the daily exercise routines of the elderly, significantly improving health management outcomes.

6.: Performing arts.

Human pose estimation plays a vital role in the performing arts, particularly in dance. With the increasing diversity of dance forms, such as street and national dance, accurately distinguishing dance postures has become essential. However, the subtle variations between different movements make PE indispensable. As noted in Reference [192], employing 17 measurement methods to capture full-body motion characteristics effectively differentiates dance types. This underscores the need for high-performance techniques capable of capturing subtle differences, highlighting the growing significance of PE in artistic performance.

6. Challenges, Outlook, and Conclusions

6.1. Challenges

Although numerous efficient PE networks have been developed, PE still faces considerable challenges. In complex, crowded scenes, cascade algorithms often lack reliability in detecting individuals, hindering multi-person PE. Additionally, shooting conditions are categorized into static and dynamic scenarios. While PE is straightforward in static environments, motion estimation accuracy becomes more challenging when the camera is in motion. For instance, the MASt3R method [193], as discussed in the literature, is employed to estimate camera motion. Top-down methods struggle to distinguish overlapping boundaries, while bottom-up approaches find keypoint association more challenging under occlusion. Additionally, some algorithms perform well in controlled environments but lack generalizability to diverse real-world scenarios. In addition to the complexity and variability inherent in multi-person scenarios, PE also faces limitations due to inconsistent evaluation metrics, difficulty in capturing subtle pose variations under external disturbances, and excessive resource consumption. Ultimately, these issues converge into three core challenges: occlusion, limited data, and computational efficiency.

The problem of occlusion.

The challenge of occlusion arises from the wide variety of potential obstructions, including other individuals, surrounding objects, or environmental elements. This diversity makes it difficult for models to account for all possible occlusion scenarios during training, ultimately limiting their ability to generalize effectively. When specific body parts are occluded, the model often fails to retrieve the positional information of the hidden joints. This absence disrupts the model’s ability to extrapolate based on the information from visible joints, increasing the likelihood of erroneous predictions. Data clarity presents a challenge, as improper lighting (e.g., overexposure or insufficient illumination) can result in information loss. In video processing and real-time surveillance, person aliasing is prevalent, where individuals in one frame are occluded by objects in the next or suddenly enter the scene due to changing lighting, leading to information omission. Fusion-based video tracking and positioning can mitigate these issues. Consequently, inaccurate occlusion estimation and failure to track objects in real time compromise performance and reduce model robustness.

2.: Not enough available data.

PE algorithms typically demand vast amounts of annotated data for effective training. However, data collection and labeling are both time-intensive and labor-intensive processes, often prone to human error, resulting in inconsistent or inaccurate annotations that directly impair model training efficacy. Moreover, many datasets lack sufficient diversity in critical aspects such as age, gender, body shape, and attire, which limits model performance in varying demographics and contexts. In highly challenging scenarios, such as occlusions, intricate backgrounds, or dynamic environments, the scarcity of data becomes increasingly evident, hindering models from adequately learning to accurately recognize these situations in real-world applications.

3.: Computational efficiency and real-time performance.

Advanced PE algorithms commonly employ deep learning models with numerous parameters, resulting in high computational demands. In real-time systems, continuous input, such as video streams, must be processed swiftly, with high-resolution images and complex scenes requiring particularly rapid handling to avoid latency. While approaches like parallel processing and lightweight models can enhance efficiency, they frequently compromise accuracy. Thus, optimization is essential not only for hardware but also within the algorithms themselves, which must balance precision and computational efficiency while enabling quick learning and adaptability. This highlights the fundamental challenge in PE: achieving a balance between computational efficiency and resource consumption.

6.2. Outlook

As we delve into technical algorithms and address persistent challenges, we continually explore ways to overcome these obstacles and expand the societal impact of PE technology, aiming to deliver more efficient solutions and applications. Currently, many algorithms rely primarily on Human3.6M or COCO datasets, which impose certain demographic limitations. Thus, extending the scope of application and diversifying datasets is essential. For instance, building specialized datasets for specific populations such as infants, young children, and the elderly by collecting data in health-sensitive environments or capturing distinctive behaviors would enable algorithms tailored to specific user groups. Limited training data hinder comprehensive feature learning. Expanding data types can enhance applicability, while fine-tuning the network minimizes computational surges, yielding lighter, more generalizable models. Furthermore, PE algorithms, reliant on image and video inputs, are sensitive to image quality, with blur or shadows compromising accuracy. Integrating image processing and person tracking at the network’s front-end can optimize input, particularly in multi-person PE, improving individual differentiation, localization, and pose prediction. These advancements achieve superior generalization, computational efficiency, and accuracy over existing methodologies. While existing human body models serve as a foundation for research, they often lack comprehensive alignment with diverse audience groups. Addressing model bias and ensuring stable performance in noisy environments are pressing areas for future exploration. One promising approach is the development of multi-model cascades, with a primary model for initial PE and auxiliary models for verification and refinement. Additionally, implementing two-stream networks to parallelize PE and exploring further cascading techniques could significantly enhance accuracy. Although such strategies may increase computational demands, the potential for accuracy improvement is substantial. For real-time applications, achieving lower computation times by allowing for slight precision trade-offs remains a priority. While real-time optimization has gained attention in recent years, there remains ample space for innovation, particularly in developing lighter, more portable network architectures. Algorithm evaluation should not be limited to accuracy metrics alone; incorporating a time–accuracy balance index within a lightweight framework would facilitate more holistic assessments. For both 2D and 3D algorithms, future research should focus on extracting richer semantic information with minimal annotation requirements while filtering out redundant noise in occluded or cluttered environments. In cases of limited generalization and labeling difficulties in training data, transfer learning can help bridge these gaps. Additionally, multi-modal data fusion-integrating frame images, infrared data, radar signals, and more offer a pathway to further enhancing the accuracy and robustness of PE models.

6.3. Conclusions

This review offers a comprehensive introduction to PE technology within the deep learning domain, categorizing PE into 2D and 3D approaches. We systematically discuss various advanced technologies and their applicable conditions in single-person, multi-person, and video methods and expound on how to improve the estimation effect in multi-person and video methods through single-person PE technology. Additionally, we summarize the evaluation metrics and datasets relevant to these categories, conduct a comparative analysis of datasets, and provide detailed statistical performance evaluations of various algorithms. This paper further elucidates the connections between 2D and 3D methodologies, enabling readers to gain a deeper understanding of 3D algorithms through their 2D counterparts. Lastly, we analyze existing challenges in PE, offering valuable insights on emerging trends and potential directions informed by current needs and technological advancements. Overall, this paper serves as a robust reference for comprehensively understanding PE technology and provides critical guidance for future research.

Author Contributions

Conceptualization, L.Z., Z.L. and A.W.; methodology, Z.L., R.S. and S.L.; validation, L.Z., S.L. and R.S.; writing—review and editing, L.Z. and Z.L.; visualization, R.S. and A.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Social Science Foundation of Heilongjiang Province (No. 22RKC306) and Basic Scientific Research Foundation Project of Provincial Colleges and Universities in Heilongjiang Province (2022KYYWF-FC05).

Data Availability Statement

No new data were created.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this manuscript.

References

Gong, X.; Geng, X.; Nie, G.; Wang, T.; Zhang, J.; You, J. Normative Evaluation Method of Long Jump Action Based on Human Pose Estimation. IEEE Access 2023, 11, 125452–125459. [Google Scholar] [CrossRef]
Li, H.; Guo, H.; Huang, H. Analytical Model of Action Fusion in Sports Tennis Teaching by Convolutional Neural Networks. Comput. Intell. Neurosci. 2022, 2022, 7835241. [Google Scholar] [PubMed]
Du, W.; Wang, Y.; Qiao, Y. Rpan: An end-to-end recurrent pose-attention network for action recognition in videos. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Wei, L.; Yu, X.; Liu, Z. Human pose estimation in crowded scenes using Keypoint Likelihood Variance Reduction. Displays 2024, 83, 102675. [Google Scholar]
Juraev, S.; Ghimire, A.; Alikhanov, J.; Kakani, V.; Kim, H. Exploring Human Pose Estimation and the Usage of Synthetic Data for Elderly Fall Detection in Real-World Surveillance. IEEE Access 2022, 10, 94249–94261. [Google Scholar]
Yu, X.; Zhang, X.; Xu, C.; Ou, L. Human-robot collaborative interaction with human perception and action recognition. Neurocomputing 2024, 563, 126827. [Google Scholar]
Weng, C.-Y.; Curless, B.; Kemelmacher-Shlizerman, I. Photo wake-up: 3d character animation from a single photo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Lan, G.; De Vries, L.; Wang, S. Evolving efficient deep neural networks for real-time object recognition. In Proceedings of the 2019 IEEE Symposium Series on Computational Intelligence (SSCI), Xiamen, China, 6–9 December 2019; IEEE: Piscataway, NJ, USA, 2019. [Google Scholar]
Zhou, D.; He, Q. PoSeg: Pose-Aware Refinement Network for Human Instance Segmentation. IEEE Access 2020, 8, 15007–15016. [Google Scholar]
Niu, Z.; Lu, K.; Xue, J.; Wang, J. Skeleton Cluster Tracking for robust multi-view multi-person 3D human pose estimation. Comput. Vis. Image Underst. 2024, 246, 104059. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems; Springer: Berlin/Heidelberg, Germany, 2012; p. 25. [Google Scholar]
Toshev, A.; Szegedy, C. DeepPose: Human Pose Estimation via Deep Neural Networks. In Proceedings of the 27th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
Mehta, D.; Sridhar, S.; Sotnychenko, O.; Rhodin, H.; Shafiei, M.; Seidel, H.P.; Theobalt, C. VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera. ACM Trans. Graph. 2017, 36, 1–14. [Google Scholar]
Poppe, R. Vision-based human motion analysis: An overview. Comput. Vis. Image Underst. 2007, 108, 4–18. [Google Scholar]
Yue, R.; Tian, Z.; Du, S. Action recognition based on RGB and skeleton data sets: A survey. Neurocomputing 2022, 512, 287–306. [Google Scholar]
Wang, C.; Yan, J. A Comprehensive Survey of RGB-Based and Skeleton-Based Human Action Recognition. IEEE Access 2023, 11, 53880–53898. [Google Scholar]
Liu, Z.; Zhu, J.; Bu, J.; Chen, C. A survey of human pose estimation: The body parts parsing based methods. J. Vis. Commun. Image Represent. 2015, 32, 10–19. [Google Scholar] [CrossRef]
Gong, W.; Zhang, X.; Gonzàlez, J.; Sobral, A.; Bouwmans, T.; Tu, C.; Zahzah, E.-H. Human Pose Estimation from Monocular Images: A Comprehensive Survey. Sensors 2016, 16, 1966. [Google Scholar] [CrossRef] [PubMed]
Munea, T.L.; Jembre, Y.Z.; Weldegebriel, H.T.; Chen, L.; Huang, C.; Yang, C. The Progress of Human Pose Estimation: A Survey and Taxonomy of Models Applied in 2D Human Pose Estimation. IEEE Access 2020, 8, 133330–133348. [Google Scholar]
Dang, Q.; Yin, J.; Wang, B.; Zheng, W. Deep Learning Based 2D Human Pose Estimation: A Survey. Tsinghua Sci. Technol. 2019, 24, 663–676. [Google Scholar]
El Kaid, A.; Baina, K. A Systematic Review of Recent Deep Learning Approaches for 3D Human Pose Estimation. J. Imaging 2023, 9, 275. [Google Scholar] [CrossRef]
Wang, J.; Tan, S.; Zhen, X.; Xu, S.; Zheng, F.; He, Z.; Shao, L. Deep 3D human pose estimation: A review. Comput. Vis. Image Underst. 2021, 210, 103225. [Google Scholar]
Sarafianos, N.; Boteanu, B.; Ionescu, B.; Kakadiaris, I.A. 3D Human pose estimation: A review of the literature and analysis of covariates. Comput. Vis. Image Underst. 2016, 152, 1–20. [Google Scholar]
Liu, W.; Bao, Q.; Sun, Y.; Mei, T. Recent Advances of Monocular 2D and 3D Human Pose Estimation: A Deep Learning Perspective. Acm Comput. Surv. 2023, 55, 1–41. [Google Scholar]
Chen, Y.; Tian, Y.; He, M. Monocular human pose estimation: A survey of deep learning-based methods. Comput. Vis. Image Underst. 2020, 192, 102897. [Google Scholar]
Sun, X.; Xiao, B.; Wei, F.; Liang, S.; Wei, Y. Integral Human Pose Regression. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
He, K.M.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 27–30 June 2016. [Google Scholar]
Gu, K.; Yang, L.; Mi, M.B.; Yao, A. Bias-Compensated Integral Regression for Human Pose Estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10687–10702. [Google Scholar] [PubMed]
Su, S.; She, B.; Zhu, Y.; Fang, X.; Xu, Y. RCENet: An efficient pose estimation network based on regression correction. Multimed. Syst. 2024, 30, 1–13. [Google Scholar]
Kumar, P.; Chauhan, S. Towards improvement of baseline performance for regression based human pose estimation. Evol. Syst. 2024, 15, 659–667. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the 30th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Tompson, J.; Goroshin, R.; Jain, A.; LeCun, Y.; Bregler, C. Efficient Object Localization Using Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
Newell, A.; Yang, K.; Deng, J. Stacked Hourglass Networks for Human Pose Estimation. In Proceedings of the 14th European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
Zou, X.; Bi, X.; Yu, C. Improving Human Pose Estimation Based on Stacked Hourglass Network. Neural Process. Lett. 2023, 55, 9521–9544. [Google Scholar]
Dong, X.; Yu, J.; Zhang, J. Joint usage of global and local attentions in hourglass network for human pose estimation. Neurocomputing 2022, 472, 95–102. [Google Scholar]
Kamel, A.; Sheng, B.; Li, P.; Kim, J.; Feng, D.D. Hybrid Refinement-Correction Heatmaps for Human Pose Estimation. IEEE Trans. Multimed. 2021, 23, 1330–1342. [Google Scholar]
Kim, S.-T.; Lee, H.J. Lightweight Stacked Hourglass Network for Human Pose Estimation. Appl. Sci. 2020, 10, 6497. [Google Scholar] [CrossRef]
Qin, X.; Guo, H.; He, C.; Zhang, X. Lightweight human pose estimation: CVC-net. Multimed. Tools Appl. 2022, 81, 17615–17637. [Google Scholar]
Yang, W.; Li, S.; Ouyang, W.; Li, H.; Wang, X. Learning Feature Pyramids for Human Pose Estimation. In Proceedings of the 16th IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Bengio, Y. Generative Adversarial Nets. In Proceedings of the 28th Conference on Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
Chen, Y.; Shen, C.; Wei, X.; Liu, L.; Yang, J. Adversarial PoseNet: A Structure-aware Convolutional Network for Human Pose Estimation. In Proceedings of the 16th IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
Zhu, A.; Zhang, S.; Huang, Y.; Hu, F.; Cui, R.; Hua, G. Exploring hard joints mining via hourglass-based generative adversarial network for human pose estimation. AIP Adv. 2019, 9, 035321. [Google Scholar]
Tian, L.; Wang, P.; Liang, G.; Shen, C. An adversarial human pose estimation network injected with graph structure. Pattern Recognit. 2021, 115, 107863. [Google Scholar]
Malakshan, S.R.; Saadabadi, M.S.E.; Mostofa, M.; Soleymani, S.; Nasrabadi, N.M. Joint Super-Resolution and Head Pose Estimation for Extreme Low-Resolution Faces. IEEE Access 2023, 11, 11238–11253. [Google Scholar] [CrossRef]
Xiao, B.; Wu, H.; Wei, Y. Simple Baselines for Human Pose Estimation and Tracking. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Chen, K.; Wu, Z.; Huang, J.; Su, Y. Self-Attention Mechanism-Based Head Pose Estimation Network with Fusion of Point Cloud and Image Features. Sensors 2023, 23, 9894. [Google Scholar] [CrossRef] [PubMed]
Wang, X.; Tong, J.; Wang, R.J.N.P.L. Attention refined network for human pose estimation. Neural Process. Lett. 2021, 53, 2853–2872. [Google Scholar] [CrossRef]
Wang, J.; Long, X.; Gao, Y.; Ding, E.; Wen, S. Graph-pcnn: Two stage human pose estimation with graph pose refinement. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XI 16. Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
Cai, Y.; Wang, Z.; Luo, Z.; Yin, B.; Du, A.; Wang, H.; Sun, J. Learning delicate local representations for multi-person pose estimation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part III 16. Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
Xu, L.; Jin, S.; Liu, W.; Qian, C.; Ouyang, W.; Luo, P.; Wang, X. ZoomNAS: Searching for Whole-Body Human Pose Estimation in the Wild. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 5296–5313. [Google Scholar]
Artacho, B.; Savakis, A. UniPose plus: A Unified Framework for 2D and 3D Human Pose Estimation in Images and Videos. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 9641–9653. [Google Scholar] [CrossRef]
Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep High-Resolution Representation Learning for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 3349–3364. [Google Scholar] [CrossRef]
Xiao, Y.; Wang, X.; He, M.; Jin, L.; Song, M.; Zhao, J. A Compact and Powerful Single-Stage Network for Multi-Person Pose Estimation. Electronics 2023, 12, 857. [Google Scholar] [CrossRef]
Cheng, Y.; Ai, Y.; Wang, B.; Wang, X.; Tan, R.T. Bottom-up 2D pose estimation via dual anatomical centers for small-scale persons. Pattern Recognit. 2023, 139, 109403. [Google Scholar] [CrossRef]
Newell, A.; Huang, Z.; Deng, J. Associative embedding: End-to-end learning for joint detection and grouping. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Jin, S.; Liu, W.; Xie, E.; Wang, W.; Qian, C.; Ouyang, W.; Luo, P. Differentiable hierarchical graph grouping for multi-person pose estimation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part VII 16. Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
Zeng, Q.; Hu, Y.; Li, D.; Sun, D. Multi-person pose estimation based on graph grouping optimization. Multimed. Tools Appl. 2023, 82, 7039–7053. [Google Scholar] [CrossRef]
Jin, L.; Wang, X.; Nie, X.; Liu, L.; Guo, Y.; Zhao, J. Grouping by Center: Predicting Centripetal Offsets for the Bottom-up Human Pose Estimation. IEEE Trans. Multimed. 2023, 25, 3364–3374. [Google Scholar]
Chen, X.; Yang, G. Multi-person pose estimation with limb detection heatmaps. In Proceedings of the 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018. [Google Scholar]
Luo, Y.; Ren, J.; Wang, Z.; Sun, W.; Pan, J.; Liu, J.; Lin, L. LSTM Pose Machines. In Proceedings of the 31st IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Nie, X.; Li, Y.; Luo, L.; Zhang, N.; Feng, J. Dynamic kernel distillation for efficient pose estimation in videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Lee, J.; Kim, T.-Y.; Beak, S.; Moon, Y.; Jeong, J. Real-Time Pose Estimation Based on ResNet-50 for Rapid Safety Prevention and Accident Detection for Field Workers. Electronics 2023, 12, 3513. [Google Scholar] [CrossRef]
Bertasius, G.; Feichtenhofer, C.; Tran, D.; Shi, J.; Torresani, L. Learning temporal pose estimation from sparsely-labeled videos. Adv. Neural Inf. Process. Syst. 2019, 32, 3027–3038. [Google Scholar]
Dong, X.; Wang, X.; Li, B.; Wang, H.; Chen, G.; Cai, M. YH-Pose: Human pose estimation in complex coal mine scenarios. Eng. Appl. Artif. Intell. 2024, 127, 107338. [Google Scholar]
Liu, H.; Liu, W.; Chi, Z.; Wang, Y.; Yu, Y.; Chen, J.; Tang, J. Fast Human Pose Estimation in Compressed Videos. IEEE Trans. Multimed. 2023, 25, 1390–1400. [Google Scholar]
Liu, Z.; Feng, R.; Chen, H.; Wu, S.; Gao, Y.; Gao, Y.; Wang, X. Temporal feature alignment and mutual information maximization for video-based human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 21–24 June 2022. [Google Scholar]
Zhou, Q.; Shi, H.; Xiang, W.; Kang, B.; Latecki, L.J. DPNet: Dual-Path Network for Real-Time Object Detection with Lightweight Attention. IEEE Trans. Neural Netw. Learn. Syst. 2024, 25, 1390–1400. [Google Scholar]
Ravi, N.; Gabeur, V.; Hu, Y.T.; Hu, R.; Ryali, C.; Ma, T.; Feichtenhofer, C. Sam 2: Segment anything in images and videos. arXiv 2024, arXiv:2408.00714. [Google Scholar]
Chen, Y.; Wang, Z.; Peng, Y.; Zhang, Z.; Yu, G.; Sun, J. Cascaded Pyramid Network for Multi-Person Pose Estimation. In Proceedings of the 31st IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Li, S.; Chan, A.B. 3D Human Pose Estimation from Monocular Images with Deep Convolutional Neural Network. In Proceedings of the 12th Asian Conference on Computer Vision (ACCV), Singapore, 1–5 November 2014. [Google Scholar]
Liang, S.; Sun, X.; Wei, Y. Compositional Human Pose Regression. Comput. Vis. Image Underst. 2018, 176, 1–8. [Google Scholar]
Luvizon, D.C.; Picard, D.; Tabia, H. 2D/3D Pose Estimation and Action Recognition using Multitask Deep Learning. In Proceedings of the 31st IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Nibali, A.; He, Z.; Morgan, S.; Prendergast, L. 3d human pose estimation with 2d marginal heatmaps. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, 7–11 January 2019; IEEE: Piscataway, NJ, USA, 2019. [Google Scholar]
Pavlakos, G.; Zhou, X.; Derpanis, K.G.; Daniilidis, K. Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose. In Proceedings of the 30th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 22–25 July 2017. [Google Scholar]
Pavlakos, G.; Zhou, X.; Daniilidis, K. Ordinal Depth Supervision for 3D Human Pose Estimation. In Proceedings of the 31st IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Martinez, J.; Hossain, R.; Romero, J.; Little, J.J. A simple yet effective baseline for 3d human pose estimation. In Proceedings of the 16th IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
Chen, C.-H.; Ramanan, D. 3D Human Pose Estimation=2D Pose Estimation plus Matching. In Proceedings of the 30th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 22–25 July 2017. [Google Scholar]
Hwang, T.; Kim, J.; Kim, M. A Distributed Real-time 3D Pose Estimation Framework based on Asynchronous Multiviews. Br. Ksii Trans. Internet Inf. Syst. 2023, 17, 559–575. [Google Scholar]
Yang, J.; Ma, Y.; Zuo, X.; Wang, S.; Gong, M.; Cheng, L. 3D pose estimation and future motion prediction from 2D images. Pattern Recognit. 2022, 124, 108439. [Google Scholar]
Ghafoor, M.; Mahmood, A. Quantification of Occlusion Handling Capability of a 3D Human Pose Estimation Framework. IEEE Trans. Multimed. 2023, 25, 3311–3318. [Google Scholar] [CrossRef]
Kim, Y.; Kim, D. A CNN-based 3D human pose estimation based on projection of depth and ridge data. Pattern Recognit. 2020, 106, 107462. [Google Scholar] [CrossRef]
Yan, J.; Zhou, M.L.; Fang, B.; Xu, K. 3D Human Pose Estimation via Spatio-Temporal Matching from Monocular RGB Images. Int. J. Pattern Recognit. Artif. Intell. 2022, 36, 2255017. [Google Scholar] [CrossRef]
Zou, L.; Huang, Z.; Gu, N.; Wang, F.; Yang, Z.; Wang, G. GMDN: A lightweight graph-based mixture density network for 3D human pose regression. Comput. Graph. 2021, 95, 115–122. [Google Scholar] [CrossRef]
Yu, B.; Huang, Y.; Cheng, G.; Huang, D.; Ding, Y. Graph U-Shaped Network with Mapping-Aware Local Enhancement for Single-Frame 3D Human Pose Estimation. Electronics 2023, 12, 4120. [Google Scholar] [CrossRef]
Hua, G.; Liu, H.; Li, W.; Zhang, Q.; Ding, R.; Xu, X. Weakly-Supervised 3D Human Pose Estimation with Cross-View U-Shaped Graph Convolutional Network. IEEE Trans. Multimed. 2023, 25, 1832–1843. [Google Scholar] [CrossRef]
Wang, H.; Bai, B.; Li, J.; Ke, H.; Xiang, W. 3D human pose estimation method based on multi-constrained dilated convolutions. Multimed. Syst. 2024, 30, 1–17. [Google Scholar] [CrossRef]
Chen, L.; Liu, Q. Relation-balanced graph convolutional network for 3D human pose estimation. Image Vis. Comput. 2023, 140, 104841. [Google Scholar] [CrossRef]
Wu, Y.; Kong, D.; Wang, S.; Li, J.; Yin, B. HPGCN: Hierarchical poselet-guided graph convolutional network for 3D pose estimation. Neurocomputing 2022, 487, 243–256. [Google Scholar] [CrossRef]
Zanfir, A.; Bazavan, E.G.; Xu, H.; Freeman, W.T.; Sukthankar, R.; Sminchisescu, C. Weakly supervised 3d human pose and shape reconstruction with normalizing flows. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part VI 16. Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
Xu, J.; Yu, Z.; Ni, B.; Yang, J.; Yang, X.; Zhang, W. Deep kinematics analysis for monocular 3d human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Jiang, L.; Wang, Y.; Li, W. Regress 3D human pose from 2D skeleton with kinematics knowledge. Electron. Res. Arch. 2023, 31, 1485–1497. [Google Scholar] [CrossRef]
Liao, X.; Dong, J.; Song, K.; Xiao, J. Three-Dimensional Human Pose Estimation from Sparse IMUs through Temporal Encoder and Regression Decoder. Sensors 2023, 23, 3547. [Google Scholar] [CrossRef] [PubMed]
Zhang, X.; Zhou, Z.; Han, Y.; Meng, H.; Yang, M.; Rajasegarar, S. Deep learning-based real-time 3D human pose estimation. Eng. Appl. Artif. Intell. 2023, 119, 105813. [Google Scholar]
Martini, E.; Boldo, M.; Bombieri, N. FLK: A filter with learned kinematics for real-time 3D human pose estimation. Signal Process. 2024, 224, 109598. [Google Scholar] [CrossRef]
Loper, M.; Mahmood, N.; Romero, J.; Pons-Moll, G.; Black, M.J. SMPL: A Skinned Multi-Person Linear Model. ACM Trans. Graph. 2015, 34, 851–866. [Google Scholar]
Zhou, K.; Han, X.; Jiang, N.; Jia, K.; Lu, J. HEMlets PoSh: Learning Part-Centric Heatmap Triplets for 3D Human Pose and Shape Estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 3000–3014. [Google Scholar]
Gilbert, A.; Trumble, M.; Malleson, C.; Hilton, A.; Collomosse, J. Fusing Visual and Inertial Sensors with Semantics for 3D Human Pose Estimation. Int. J. Comput. Vis. 2019, 127, 381–397. [Google Scholar] [CrossRef]
Zheng, C.; Mendieta, M.; Wang, P.; Lu, A.; Chen, C. A lightweight graph transformer network for human mesh reconstruction from 2d human pose. In Proceedings of the 30th ACM international Conference on Multimedia, Lisbon, Portugal, 10–14 October 2022. [Google Scholar]
Choi, H.; Moon, G.; Lee, K.M. Pose2mesh: Graph convolutional network for 3d human pose and mesh recovery from a 2d human pose. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part VII 16. Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
Zhang, Y.; Wang, C.; Wang, X.; Liu, W.; Zeng, W. VoxelTrack: Multi-Person 3D Human Pose Estimation and Tracking in the Wild. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 2613–2626. [Google Scholar]
Zhang, H.; Tian, Y.; Zhou, X.; Ouyang, W.; Liu, Y.; Wang, L.; Sun, Z. Pymaf: 3d human pose and shape regression with pyramidal mesh alignment feedback loop. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021. [Google Scholar]
Xu, X.; Chen, H.; Moreno-Noguer, F.; Jeni, L.A.; De la Torre, F. 3d human shape and pose from a single low-resolution image with self-supervised learning. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part IX 16. Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
Li, J.; Xu, C.; Chen, Z.; Bian, S.; Yang, L.; Lu, C. Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Wang, S.; Leroy, V.; Cabon, Y.; Chidlovskii, B.; Revaud, J. Dust3r: Geometric 3d vision made easy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 11–15 June 2024. [Google Scholar]
Heo, J.; Hu, G.; Wang, Z.; Yeung-Levy, S. DeforHMR: Vision Transformer with Deformable Cross-Attention for 3D Human Mesh Recovery. arXiv 2024, arXiv:2411.11214. [Google Scholar]
Stathopoulos, A.; Han, L.; Metaxas, D. Score-guided diffusion for 3d human recovery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024. [Google Scholar]
Xie, P.; Xu, W.; Tang, T.; Yu, Z.; Lu, C. MS-MANO: Enabling hand pose tracking with biomechanical constraints. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
Sárándi, I.; Pons-Moll, G.J.A.P.A. Neural Localizer Fields for Continuous 3D Human Pose and Shape Estimation. Adv. Neural Inf. Process. Syst. 2025, 37, 140032–140065. [Google Scholar]
Cai, Z.; Yin, W.; Zeng, A.; Wei, C.; Sun, Q.; Yanjun, W.; Liu, Z. Smpler-x: Scaling up expressive human pose and shape estimation. Adv. Neural Inf. Process. Syst. 2023, 36, 11454–11468. [Google Scholar]
Choutas, V.; Pavlakos, G.; Bolkart, T.; Tzionas, D.; Black, M.J. Monocular expressive body regression through body-driven attention. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part X 16. Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
Lassner, C.; Romero, J.; Kiefel, M.; Bogo, F.; Black, M.J.; Gehler, P.V. Unite the People: Closing the Loop Between 3D and 2D Human Representations. In Proceedings of the 30th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Kolotouros, N.; Pavlakos, G.; Black, M.; Daniilidis, K. Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Madadi, M.; Bertiche, H.; Escalera, S. SMPLR: Deep learning based SMPL reverse for 3D human pose and shape recovery. Pattern Recognit. 2020, 106, 107472. [Google Scholar]
Xu, X.; Liu, L.; Yan, S. SMPLer: Taming Transformers for Monocular 3D Human Shape and Pose Estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 3275–3289. [Google Scholar] [PubMed]
Kolotouros, N.; Pavlakos, G.; Daniilidis, K. Convolutional mesh regression for single-image human shape reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Mahmood, N.; Ghorbani, N.; Troje, N.F.; Pons-Moll, G.; Black, M.J. AMASS: Archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Baradel, F.; Armando, M.; Galaaoui, S.; Brégier, R.; Weinzaepfel, P.; Rogez, G.; Lucas, T. Multi-hmr: Multi-person whole-body human mesh recovery in a single shot. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024. [Google Scholar]
Wang, H.; Güler, R.A.; Kokkinos, I.; Papandreou, G.; Zafeiriou, S. BLSM: A bone-level skinned model of the human mesh. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part V 16. Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
Cheng, Y.; Yang, B.; Wang, B.; Yan, W.; Tan, R.T. Occlusion-aware networks for 3d human pose estimation in video. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Xiang, D.; Joo, H.; Sheikh, Y. Monocular total capture: Posing face, body, and hands in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Dwivedi, S.K.; Sun, Y.; Patel, P.; Feng, Y.; Black, M.J. Tokenhmr: Advancing human mesh recovery with a tokenized pose representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 17–21 June 2024. [Google Scholar]
Song, Y.; Zhou, H. 3D Human Mesh Recovery with Learned Gradient. 2024; in press. [Google Scholar]
Patel, P.; Black, M.J.J.A.P.A. CameraHMR: Aligning People with Perspective. arXiv 2024, arXiv:2411.08128. [Google Scholar]
Zhang, Y.; Nie, L.J.S.R. Human motion similarity evaluation based on deep metric learning. Sci. Rep. 2024, 14, 30908. [Google Scholar]
Rogez, G.; Weinzaepfel, P.; Schmid, C. LCR-Net: Localization-Classification-Regression for Human Pose. In Proceedings of the 30th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 22–25 July 2017. [Google Scholar]
Rogez, G.; Weinzaepfel, P.; Schmid, C. LCR-Net plus plus: Multi-Person 2D and 3D Pose Detection in Natural Images. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 1146–1161. [Google Scholar]
Seo, K.; Cho, H.; Choi, D.; Heo, T. Stereo Feature Learning Based on Attention and Geometry for Absolute Hand Pose Estimation in Egocentric Stereo Views. IEEE Access 2021, 9, 116083–116093. [Google Scholar]
Moon, G.; Chang, J.Y.; Lee, K.M. Camera distance-aware top-down approach for 3d multi-person pose estimation from a single rgb image. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
El Kaid, A.; Brazey, D.; Barra, V.; Baïna, K. Top-Down System for Multi-Person 3D Absolute Pose Estimation from Monocular Videos. Sensors 2022, 22, 4109. [Google Scholar] [CrossRef]
Shen, T.; Li, D.; Wang, F.-Y.; Huang, H. Depth-Aware Multi-Person 3D Pose Estimation with Multi-Scale Waterfall Representations. IEEE Trans. Multimed. 2023, 25, 1439–1451. [Google Scholar]
Paudel, P.; Kwon, Y.-J.; Kim, D.-H.; Choi, K.-H. Industrial Ergonomics Risk Analysis Based on 3D-Human Pose Estimation. Electronics 2022, 11, 3403. [Google Scholar] [CrossRef]
Zhen, J.; Fang, Q.; Sun, J.; Liu, W.; Jiang, W.; Bao, H.; Zhou, X. Smap: Single-shot multi-person absolute 3d pose estimation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XV 16. Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
Mehta, D.; Sotnychenko, O.; Mueller, F.; Xu, W.; Elgharib, M.; Fua, P.; Theobalt, C. XNect: Real-time Multi-Person 3D Motion Capture with a Single RGB Camera. ACM Trans. Graph. 2020, 39, 17. [Google Scholar]
Vasileiadis, M.; Bouganis, C.-S.; Tzovaras, D. Multi-person 3D pose estimation from 3D cloud data using 3D convolutional neural networks. Comput. Vis. Image Underst. 2019, 185, 12–23. [Google Scholar]
Fabbri, M.; Lanzi, F.; Calderara, S.; Alletto, S.; Cucchiara, R. Compressed volumetric heatmaps for multi-person 3d pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Li, M.; Zhou, Z.; Liu, X. 3D hypothesis clustering for cross-view matching in multi-person motion capture. Comput. Vis. Media 2020, 6, 147–156. [Google Scholar] [CrossRef]
Omran, M.; Lassner, C.; Pons-Moll, G.; Gehler, P.; Schiele, B. Neural Body Fitting: Unifying Deep Learning and Model Based Human Pose and Shape Estimation. In Proceedings of the 6th International Conference on 3D Vision (3DV), Verona, Italy, 5–8 September 2018. [Google Scholar]
Tian, Y.; Hu, W.; Jiang, H.; Wu, J. Densely connected attentional pyramid residual network for human pose estimation. Neurocomputing 2019, 347, 13–23. [Google Scholar]
Cheng, Y.; Wang, B.; Tan, R.T.T. Dual Networks Based 3D Multi-Person Pose Estimation from Monocular Video. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 1636–1651. [Google Scholar] [CrossRef]
Rhodin, H.; Spörri, J.; Katircioglu, I.; Constantin, V.; Meyer, F.; Müller, E.; Fua, P. Learning Monocular 3D Human Pose Estimation from Multi-view Images. In Proceedings of the 31st IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Rhodin, H.; Salzmann, M.; Fua, P. Unsupervised Geometry-Aware Representation for 3D Human Pose Estimation. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 12 June 2018. [Google Scholar]
Dong, J.; Jiang, W.; Huang, Q.; Bao, H.; Zhou, X. Fast and robust multi-person 3d pose estimation from multiple views. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Tian, L.; Cheng, X.; Honda, M.; Ikenaga, T. Multi-view 3D human pose reconstruction based on spatial confidence point group for jump analysis in figure skating. Complex Intell. Syst. 2023, 9, 865–879. [Google Scholar]
Lupion, M.; Polo-Rodríguez, A.; Medina-Quero, J.; Sanjuan, J.F.; Ortigosa, P.M. 3D Human Pose Estimation from multi-view thermal vision sensors. Inf. Fusion 2024, 104, 102154. [Google Scholar]
Ershadi-Nasab, S.; Kasaei, S.; Sanaei, E. Uncalibrated multi-view multiple humans association and 3D pose estimation by adversarial learning. Multimed. Tools Appl. 2021, 80, 2461–2488. [Google Scholar]
Wang, H.; Sun, M.-H.; Zhang, H.; Dong, L.-Y. LHPE-nets: A lightweight 2D and 3D human pose estimation model with well-structural deep networks and multi-view pose sample simplification method. PLoS ONE 2022, 17, e0264302. [Google Scholar]
Zhang, Y.; Zhang, J.; Xu, S.; Xiao, J. Multi-view human pose and shape estimation via mesh-aligned voxel interpolation. Inf. Fusion 2025, 114, 102651. [Google Scholar]
Wang, J.; Yan, S.; Xiong, Y.; Lin, D. Motion guided 3d pose estimation from videos. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
Liu, R.; Shen, J.; Wang, H.; Chen, C.; Cheung, S.-C.; Asari, V. Attention mechanism exploits temporal contexts: Real-time 3d human pose reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Dabral, R.; Mundhada, A.; Kusupati, U.; Afaque, S.; Sharma, A.; Jain, A. Learning 3D Human Pose from Structure and Motion. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Li, Z.; Wang, X.; Wang, F.; Jiang, P. On boosting single-frame 3d human pose estimation via monocular videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Pavllo, D.; Feichtenhofer, C.; Grangier, D.; Auli, M. 3d human pose estimation in video with temporal convolutions and semi-supervised training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Hong, J.W.; Yoon, S.; Kim, J.; Yoo, C.D. Joint Path Alignment Framework for 3D Human Pose and Shape Estimation from Video. IEEE Access 2023, 11, 43267–43275. [Google Scholar]
Hossain, M.R.I.; Little, J.J. Exploiting Temporal Information for 3D Human Pose Estimation. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Chen, T.; Fang, C.; Shen, X.; Zhu, Y.; Chen, Z.; Luo, J. Anatomy-Aware 3D Human Pose Estimation with Bone-Based Pose Decomposition. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 198–209. [Google Scholar] [CrossRef]
Zhang, J.; Gong, K.; Wang, X.; Feng, J. Learning to Augment Poses for 3D Human Pose Estimation in Images and Videos. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10012–10026. [Google Scholar] [CrossRef] [PubMed]
Eichner, M.; Marin-Jimenez, M.; Zisserman, A.; Ferrari, V. 2D Articulated Human Pose Estimation and Retrieval in (Almost) Unconstrained Still Images. Int. J. Comput. Vis. 2012, 99, 190–214. [Google Scholar] [CrossRef]
Yang, Y.; Ramanan, D. Articulated Human Detection with Flexible Mixtures of Parts. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 2878–2890. [Google Scholar] [CrossRef]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the 13th European Conference on Computer Vision (ECCV), Zurich, Switzerland, 22 June 2014. [Google Scholar]
Zhang, W.; Zhu, M.; Derpanis, K.G. From Actemes to Action: A Strongly-supervised Representation for Detailed Action Understanding. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Sydney, Australia, 1–8 December 2013. [Google Scholar]
Sapp, B.; Taskar, B. MODEC: Multimodal Decomposable Models for Human Pose Estimation. In Proceedings of the 26th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 23–28 June 2013. [Google Scholar]
Jhuang, H.; Gall, J.; Zuffi, S.; Schmid, C.; Black, M.J. Towards understanding action recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Sydney, Australia, 1–8 December 2013. [Google Scholar]
Li, J.; Wang, C.; Zhu, H.; Mao, Y.; Fang, H.S.; Lu, C. Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Lin, W.; Liu, H.; Liu, S.; Li, Y.; Xiong, H.; Qi, G.; Sebe, N. HiEve: A Large-Scale Benchmark for Human-Centric Video Analysis in Complex Events. Int. J. Comput. Vis. 2023, 131, 2994–3018. [Google Scholar] [CrossRef]
Andriluka, M.; Pishchulin, L.; Gehler, P.; Schiele, B. 2D Human Pose Estimation: New Benchmark and State of the Art Analysis. In Proceedings of the 27th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 24–27 June 2014. [Google Scholar]
Johnson, S.; Everingham, M. Clustered Pose and Nonlinear Appearance Models for Human Pose Estimation; University of Leeds: Aberystwyth, UK, 2010. [Google Scholar]
Andriluka, M.; Iqbal, U.; Insafutdinov, E.; Pishchulin, L.; Milan, A.; Gall, J.; Schiele, B. PoseTrack: A Benchmark for Human Pose Estimation and Tracking. In Proceedings of the 31st IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Ionescu, C.; Papava, D.; Olaru, V.; Sminchisescu, C. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 1325–1339. [Google Scholar] [CrossRef]
Mehta, D.; Sotnychenko, O.; Mueller, F.; Xu, W.; Sridhar, S.; Pons-Moll, G.; Theobalt, C. Single-Shot Multi-Person 3D Pose Estimation from Monocular RGB. In Proceedings of the 6th International Conference on 3D Vision (3DV), Verona, Italy, 5–8 September 2018. [Google Scholar]
von Marcard, T.; Henschel, R.; Black, M.J.; Rosenhahn, B.; Pons-Moll, G. Recovering Accurate 3D Human Pose in the Wild Using IMUs and a Moving Camera. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Zhu, L.; Rematas, K.; Curless, B.; Seitz, S.M.; Kemelmacher-Shlizerman, I. Reconstructing nba players. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part V 16. Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
Cao, Z.; Gao, H.; Mangalam, K.; Cai, Q.Z.; Vo, M.; Malik, J. Long-term human motion prediction with scene context. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part I 16. Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
Zhang, Z.; Wang, C.; Qiu, W.; Qin, W.; Zeng, W. AdaFuse: Adaptive Multiview Fusion for Accurate Human Pose Estimation in the Wild. Int. J. Comput. Vis. 2021, 129, 703–718. [Google Scholar]
Duan, H.; Zhao, Y.; Chen, K.; Lin, D.; Dai, B. Revisiting skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 21–24 June 2022. [Google Scholar]
Das, S.; Sharma, S.; Dai, R.; Bremond, F.; Thonnat, M. Vpn: Learning video-pose embedding for activities of daily living. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part IX 16. Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
Hbali, Y.; Hbali, S.; Ballihi, L.; Sadgal, M. Skeleton-based human activity recognition for elderly monitoring systems. IET Comput. Vis. 2018, 12, 16–26. [Google Scholar] [CrossRef]
Guo, H.; Yu, Y.; Ding, Q.; Skitmore, M. Image-and-skeleton-based parameterized approach to real-time identification of construction workers’ unsafe behaviors. J. Constr. Eng. Manag. 2018, 144, 04018042. [Google Scholar]
Duan, C.; Hu, B.; Liu, W.; Song, J. Motion Capture for Sporting Events Based on Graph Convolutional Neural Networks and Single Target Pose Estimation Algorithms. Appl. Sci. 2023, 13, 7611. [Google Scholar] [CrossRef]
Swain, D.; Satapathy, S.; Acharya, B.; Shukla, M.; Gerogiannis, V.C.; Kanavos, A.; Giakovis, D. Deep learning models for yoga pose monitoring. Algorithms 2022, 15, 403. [Google Scholar] [CrossRef]
Kulkarni, K.M.; Shenoy, S. Table tennis stroke recognition using two-dimensional human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Giulietti, N.; Caputo, A.; Chiariotti, P.; Castellini, P. SwimmerNET: Underwater 2D Swimmer Pose Estimation Exploiting Fully Convolutional Neural Networks. Sensors 2023, 23, 2364. [Google Scholar] [CrossRef] [PubMed]
Baumgartner, T.; Paassen, B.; Klatt, S. Extracting spatial knowledge from track and field broadcasts for monocular 3D human pose estimation. Sci. Rep. 2023, 13, 1–11. [Google Scholar]
Sajina, R.; Ivasic-Kos, M. 3D Pose Estimation and Tracking in Handball Actions Using a Monocular Camera. J. Imaging 2022, 8, 308. [Google Scholar] [CrossRef]
Qian, X.; He, F.; Hu, X.; Wang, T.; Ramani, K. Arnnotate: An augmented reality interface for collecting custom dataset of 3d hand-object interaction pose estimation. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology, Bend, OR, USA, 29 October–2 November 2022. [Google Scholar]
Anvari, T.; Park, K.; Kim, G.J.A.S. Upper body pose estimation using deep learning for a virtual reality avatar. Appl. Sci. 2023, 13, 2460. [Google Scholar] [CrossRef]
Scherfgen, D.; Schild, J. Estimating the pose of a medical manikin for haptic augmentation of a virtual patient in mixed reality training. In Proceedings of the 23rd Symposium on Virtual and Augmented Reality, Virtual, Brazil, 18–21 October 2021. [Google Scholar]
Willett, N.S.; Shin, H.V.; Jin, Z.; Li, W.; Finkelstein, A. Pose2Pose: Pose selection and transfer for 2D character animation. In Proceedings of the 25th International Conference on Intelligent User Interfaces, Greenville, SC, USA, 18–21 March 2020. [Google Scholar]
Xia, G.; Ma, F.; Liu, Q.; Zhang, D. Pose-Driven Realistic 2-D Motion Synthesis. IEEE Trans. Cybern. 2023, 53, 2412–2425. [Google Scholar]
Seth, A.; James, A.; Mukhopadhyay, S. Wearable Sensing System to perform Realtime 3D posture estimation for lower back healthcare. In Proceedings of the 2021 IEEE International Symposium on Robotic and Sensors Environments (ROSE), Virtual, 28–29 October 2021; IEEE: Piscataway, NJ, USA, 2021. [Google Scholar]
Guo, H.; Liu, X.; Liu, H. Research on Athlete Posture Monitoring and Correction Technology Based on Wireless Sensing and Computer Vision Algorithms. Mob. Netw. Appl. 2024, 2024, 1–12. [Google Scholar]
Chen, W.; Jiang, Z.; Guo, H.; Ni, X. Fall Detection Based on Key Points of Human-Skeleton Using OpenPose. Symmetry 2020, 12, 744. [Google Scholar] [CrossRef]
Baker, B.; Liu, T.; Matelsky, J.; Parodi, F.; Mensh, B.; Krakauer, J.W.; Kording, K. Computational kinematics of dance: Distinguishing hip hop genres. Front. Robot. AI 2024, 11, 1295308. [Google Scholar]
Leroy, V.; Cabon, Y.; Revaud, J. Grounding image matching in 3d with mast3r. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024. [Google Scholar]

Figure 1. The number of articles on human pose estimation in the Web of Science Core Collection since 2016.

Figure 2. Content of this review.

Figure 3. Two-dimensional single-person PE. (a) Regression-based method; (b) heatmap-based method.

Figure 4. Overview of 2D multi-person PE diagram. (a) Top-down approach; (b) bottom-up approach.

Figure 5. Three-dimensional PE methodological framework. (a) Direct estimation; (b) 2D-to-3D conversion.

Figure 6. Commonly used human body models. (a) Skeletal point-based models; (b) contour-based models; (c) volume-based model; (d) hierarchical bone representation; (e) SMLP model.

Figure 7. Overview diagram of 3D multi-view PE.

Table 1. Summary of 2D dataset information: S denotes single person, M denotes multiple persons, and A denotes both.

Dataset	Year	Size	Single/Multi-Person	Joints	Metrics
MPII	2014	25K images	A	16	PCK/ mAP
LSP	2010	2K images	S	14	PCP/ PCK
COCO	2017	330K images	M	17	AP/AR
PoseTrack Dataset	2017	46K frames	M	15	mAP
PoseTrack Dataset	2018	46K frames	M	15	mAP
Penn Action	2013	2326 video clips	S	13	PCK
FLIC	2013	5K images	S	10	PCP PCK
J-HMDB	2013	5K images	S	15	PCK
CrowdPose	2017	20K images	M	14	mAP
HiEve	2017	50k frames	M	14	mAP

Table 2. Performance comparison of different methods based on 2D single-person PE on MPII dataset (PCKh@0.5).

Method	Head	Shoulder	Elbow	Wrist	Hip	Knee	Ankle	Total
[36]	97.5	96.2	90.8	86.6	89.3	87.1	83.4	90.4
[33]	98.2	96.3	91.2	87.1	90.1	87.4	83.6	90.9
[38]	98.3	96.5	91.8	82.9	90.7	88.4	85.2	91.6
[39]	98.5	96.7	92.5	88.7	91.1	88.6	86.0	92.0
[37]	98.1	96.2	90.9	87.2	89.8	87.3	83.5	90.8
[42]	98.1	96.7	92.5	88.4	90.8	88.8	85.3	91.8
[34]	97.1	96.2	91.6	86.1	90.4	87.7	83.9	90.9
[43]	97.1	95.9	90.4	85.1	89.1	85.8	81.5	89.8
[49]	98.5	97.3	93.9	89.9	90.2	90.6	86.8	93.0

Table 3. Performance comparison of different methods based on 2D single-person PE on LSP dataset (PCK@0.2).

Method	Head	Shoulder	Elbow	Wrist	Hip	Knee	Ankle	Total
[43]	97.8	95.5	94.4	92.9	94.7	95.6	94.3	95.1
[37]	98.1	93.1	89.2	86.1	92.7	92.8	90.4	91.7
[38]	98.2	93.8	90.6	89.3	93.6	94.4	93.8	88.6
[39]	98.3	94.5	92.2	88.9	94.4	95.0	93.7	93.9
[42]	98.3	94.7	92.3	89.7	94.3	95.4	94.1	92.9

Table 4. The performance comparison of different methods based on 2D multi-person PE on the COCO dataset. AP_M and AP_L denote the average precision for medium and large targets, respectively.

Method	AP	AP₅₀	AP₇₅	AP_M	AP_L	AR
[53]	70.5	88.5	76.7	64.5	79.4	-
[52]	77.0	92.7	84.5	73.4	83.1	82.0
[48]	76.8	92.6	84.3	73.3	82.7	81.6
[49]	78.6	94.3	86.6	75.5	83.3	83.8
[55]	65.5	86.8	72.3	60.6	82.6	-
[56]	67.6	85.1	73.7	62.7	74.6	-
[54]	71.5	89.1	78.5	67.2	18.1	-
[58]	70.6	89.9	77.2	69.3	75.3	76.2

Table 5. Comparison of different PE methodologies based on 2D videos on PoseTrack2017 and PoseTrack2018 datasets.

Method	Head	Shoulder	Elbow	Wrist	Hip	Knee	Ankle	Total
PoseTrack2017
[63]	79.5	84.3	80.1	75.8	77.6	76.8	70.8	77.9
[65]	79.9	87.6	82.8	76.7	80.7	79.4	72.8	80.0
[66]	86.1	86.1	81.8	77.4	79.5	79.1	73.6	80.9
PoseTrack2018
[63]	78.9	84.4	80.9	76.8	75.6	77.5	71.8	78.0
[65]	78.8	84.8	79.8	73.2	76.2	75.6	69.9	77.2
[66]	83.6	84.5	81.4	77.9	76.8	78.3	72.9	79.6

Table 6. Summary of comparison of 3D datasets. S stands for single person; M stands for multiple people. I stands for indoor, O stands for outdoor, and A stands for both.

Dataset	Year	Size	Environment	Single/ Multi-Person	Subject
Human3.6M	2014	3.6 M frames	I	M	11
MuCo-3DHP	2017	n/a	A	S	8
3DPW	2018	60 sequences	O	M	7
MuPoTs-3D	2018	8 k frames	A	A	8
AMASS	2019	9 M frames	A	S	300
GTA-IM	2020	1 M frames	I	S	n/a
NBA2K	2020	27 k poses	O	S	27
Occlusion-Person	2020	7.3 M frames	I	A	13

Table 7. Single PE performance on the Human3.6M dataset (Protocol 1).

Methods/Index	[74]	[75]	[79]	[84]	[86]	[87]	[90]	[148]	[149]	[155]
MPJPE	71.9	56.2	44.3	49.7	44.1	50.1	45.6	42.6	45.1	44.1
PA-MPJPE	51.9	41.8	34.7	-	34.7	39.3	36.2	32.7	35.6	35.0

Table 8. A comparison of 3D multi-person PE. 3DPCKabs represents the accuracy of absolute 3D keypoints, while 3DPCKrel evaluates the accuracy of relative 3D keypoints.

Method	MuPoTS-3D				Human3.6M
	3DPCKrel		3DPCKabs		MPJPE
	All People	Matched People	All People	Matched People	p1	p2
[128]	81.8	82.5	-	-	-	54.4
[129]	-	-	-	56.8	-	-
[130]	75.2	83.2	39.2	39.7	-	-
[133]	72.1	78.0	-	-	63.6	-
[135]	-	-	-	-	43.4	49.1
[132]	73.5	80.5	35.4	38.7	-	-

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sun, R.; Lin, Z.; Leng, S.; Wang, A.; Zhao, L. An In-Depth Analysis of 2D and 3D Pose Estimation Techniques in Deep Learning: Methodologies and Advances. Electronics 2025, 14, 1307. https://doi.org/10.3390/electronics14071307

AMA Style

Sun R, Lin Z, Leng S, Wang A, Zhao L. An In-Depth Analysis of 2D and 3D Pose Estimation Techniques in Deep Learning: Methodologies and Advances. Electronics. 2025; 14(7):1307. https://doi.org/10.3390/electronics14071307

Chicago/Turabian Style

Sun, Ruiyang, Zixiang Lin, Song Leng, Aili Wang, and Lanfei Zhao. 2025. "An In-Depth Analysis of 2D and 3D Pose Estimation Techniques in Deep Learning: Methodologies and Advances" Electronics 14, no. 7: 1307. https://doi.org/10.3390/electronics14071307

APA Style

Sun, R., Lin, Z., Leng, S., Wang, A., & Zhao, L. (2025). An In-Depth Analysis of 2D and 3D Pose Estimation Techniques in Deep Learning: Methodologies and Advances. Electronics, 14(7), 1307. https://doi.org/10.3390/electronics14071307

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An In-Depth Analysis of 2D and 3D Pose Estimation Techniques in Deep Learning: Methodologies and Advances

Abstract

1. Introduction

2. Two-Dimensional Pose Estimation

2.1. Two-Dimensional Single-Person Pose Estimation

2.1.1. Regression-Based Single-Person Pose Estimation

2.1.2. Heatmap-Based Single-Person Pose Estimation

2.2. Two-Dimensional Multi-Person Pose Estimation

2.2.1. Top-Down Approach of Two-Dimensional Multi-Person Pose Estimation

2.2.2. Bottom-Up Approach of Two-Dimensional Multi-Person Pose Estimation

2.3. Two-Dimensional Pose Estimation in Videos

2.4. Summary and Analysis of 2D Pose Estimation Methods

3. Three-Dimensional Pose Estimation

3.1. 3D Single-Person Pose Estimation

3.1.1. Direct Estimation

3.1.2. Two-Dimensional-to-Three-Dimensional Conversion

3.1.3. Three-Dimensional Estimation Based on Human Body Model

3.2. Three-Dimensional Single-View Multi-Person Pose Estimation

3.2.1. Top-Down Approach of Three-Dimensional Multi-Person Pose Estimation

3.2.2. Bottom-Up Approach of Three-Dimensional Multi-Person Pose Estimation

3.3. Three-Dimensional Multi-View Multi-Person Pose Estimation

3.4. Three-Dimensional Pose Estimation in Videos

3.5. Summary and Analysis of 3D Pose Estimation Methods

4. Evaluation Metrics, Datasets, and Comparative Analysis

4.1. Evaluation Metrics

4.1.1. Two-Dimensional Pose Evaluation Metrics

4.1.2. Three-Dimensional Pose Evaluation Metrics

4.2. Dataset and Comparative Analysis

4.2.1. Two-Dimensional Datasets and Performance Comparison

4.2.2. Three-Dimensional Datasets and Performance Comparison

5. Practical Applications of Pose Estimation

6. Challenges, Outlook, and Conclusions

6.1. Challenges

6.2. Outlook

6.3. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI