**1. Introduction**

The depth camera can provide the ranging information from a single depth image or a point cloud for a variety of applications, such as gaming, three-dimensional (3D) reconstruction and object recognition. Many human-centered tasks based on depth camera had been investigated in the last few years, as shown in Figure 1. For example, 3D human reconstruction is the process of recovering a 3D human surface model by finding the accurate correspondence between frames [1,2]. The 3D segmentation technology of human body is the most critical technology in applications such as digital clothing and computer animation [3]. The health monitoring system using the depth information can check the diseased parts of the human body to facilitate the guidance of rehabilitation training [4]. The size measurement of human body based on the depth camera is a safe and non-contact fast measurement method, which overcome the challenges of high cost and bulky electronic scanners [5]. Human behavior recognition, as a fundamental research problem, is an extremely significant component and extensively studied research subject in computer vision [6]. The ultimate objective of encoding human body is to extract the various joints of a predefined skeleton in a simplified manner.

Conventional methods for detecting joints of human body utilize two-dimensional (2D) images or video, taken by traditional cameras. Significant progress has been made for these methods in recent years, by leveraging the powerful deep learning. However, there are still some limitations on human pose estimation using only 2D images, due to the coexisting complex backgrounds, variable viewpoints, highly flexible poses, etc. Additional depth information can provide enriched 3D data to overcome the limitation of 2D data.

**Citation:** Xu, T.; An, D.; Jia, Y.; Yue, Y.A Review: Point Cloud-Based 3D Human Joints Estimation. *Sensors* **2021**, *21*, 1684. https://doi.org/ 10.3390/s21051684

Academic Editor: Roberto Vezzani

Received: 28 January 2021 Accepted: 23 February 2021 Published: 1 March 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

**Figure 1.** Human-centered applications of depth sensor.

The purpose of depth camera-based 3D human pose estimation is to locate the (*<sup>x</sup>*, *y*, *z*) coordinates of joints in 3D space. Ideally, once the captured human pose changes, the joints can still be reliably estimated. Figure 2 describes the specific overview of the 3D joints extraction. The first step of this process is to capture the human poses by a depth sensor. Since the obtained poses of human body contain redundant information, it is necessary to process the data in advance. Next, the pre-processed data is used to calculate the 3D coordinates of the joints using special methods, such as template-based method, feature-based method, and machine learning-based method. Finally, the error analysis is performed.

Usually 3D data formats include depth map, point cloud, mesh, and voxel grid, etc. Here a point cloud is a collection of points in the 3D space. In addition to the commonly used 3D coordinate information, point cloud can also carry other dimensional information such as color and normal vector. The point cloud can be obtained directly through the depth sensors. Compared with the voxel grid, the storage space of the point cloud is smaller, and the geometric information can still be expressed well after the rotation. Compared with the mesh method, the point cloud is easily obtained, while there is not direct method to acquire the mesh data. Compared with the depth map, the point cloud represents the 3D object in a more intuitive way. Moreover, the conversion between the point cloud and the other 3D formats is quite straight forward. The widely used open source libraries for 3D point cloud processing are mainly Point Cloud Library (PCL) [7] and Open3D [8]. PCL is a cross-platform C++ library, which implements a large number of point cloudrelated general algorithms and efficient data structures, involving point cloud acquisition, filtering, segmentation, registration, retrieval, feature extraction, recognition, tracking, surface reconstruction, visualization, etc. It can support multiple operating systems such as Windows, Linux, Android, Mac OS X, and some embedded real-time systems. Open3D is a modern library that can support the rapid software development for 3D data processing. A set of data structures and algorithms are exposed in C++ and Python, its core features include 3D data structure, 3D data processing algorithms, scene reconstruction, and 3D visualization. PCL is more mature with a large number of data structures and algorithms for 3D data processing. In contrast, Open3D can be installed and used in the python environment, and the programming is faster and simpler.

**Figure 2.** The flow diagram of point cloud-based 3D joints extraction.

The aim of this survey is to provide a comprehensive overview of 3D human joints extraction based on point cloud. The survey mainly focuses on publication of point cloudbased joint estimation for human body in computer vision. Here, the point cloud of human body is directly captured from 3D ranging devices. The involved content mainly includes the datasets of relevant pose, the methods of point cloud-based joint extraction, and the applications of point cloud-based joint extraction. The summarized methods are shown in Figure 3, including template-based methods, feature-based methods and machine learning-based methods. Compared with the existing surveys, the main contributions of this review include:


**Figure 3.** Categorization of the methods for the 3D joints extraction based on point cloud. SoG is sums of spatial gaussians; SMPL is skinned multi-person linear; CNN is convolutional neural network; RF is random forest; RTW is random tree walk.

The remainder of this review is organized as following: Section 2 introduces sets of devices for range detection; Section 3 reviews the existing point cloud-based methods of human joints extraction; Section 4 enumerates the 3D human dataset; Section 5 discusses various applications of point cloud-based joint extraction; Section 6 concludes this survey with potential research directions in the future.

#### **2. Depth Sensors for 3D Data Acquisition**

According to the ranging principle, we divide the depth sensors into three categories as shown in Figure 4, binocular stereo vision, time-of-flight (ToF) and structured light technologies. Early research focused on the passive method, such as binocular stereo vision, to calculate the depth information. Typically, two cameras were used to take pictures of the object from different perspectives. This mechanism is similar to imaging by two eyes of human. However, the result can be easily affected by the texture of object, and it is also time-consuming in the registration process. Compared with the passive ranging method, rapid active ranging shows obvious advantages. On one hand, radar and Lidar are commonly used in military to reconnoiter and detect battlefields in various environments, and it is also manipulated for obstacle detection in automatic driving technology. On the other hand, with the commercial application of low-cost depth cameras, the related research based on depth cameras has gradually unlocked new applications in the mobile and intelligent terminal devices.

**Figure 4.** Classification of depth sensors for 3D data acquisition.

The ToF depth camera first continuously sends light pulses to the detected object, and then the sensor is used to receive the light returned from the object. The final distance is calculated when the flight (round trip) time of the detected light pulse is obtained. ToF sensors are divided into two types: direct time-of-flight (dToF) and indirect time-offlight (iToF) sensors according to different modulation methods. Lighting units of dToF generally use LEDs or lasers, including laser diodes and vertical cavity surface emitting lasers (VCSELs), to emit high-performance pulsed light, which directly measures the time difference from the emitter to the receiver and multiplies it by the speed of light to measure the relative distance of the object. The receiver must be a special sensor with very high accuracy, so it is difficult to reduce the cost and miniaturization. The light emitted by the iToF sensor is modulated by a continuous wave, whose intensity changes regularly. According to the selection of detected distance, it can be divided into pulsed light and continuous light. Next, the iToF depth cameras compare the signal difference between the emitted signal and the reflected signal, and then multiply it by the speed of light to ge<sup>t</sup> the relative distance. In addition, the response speed of the iToF receiver is not as fast as dToF, and it cannot accurately sense for the sub-nanosecond time difference. Fundamentally, radar and Lidar are also a kind of ToF sensor. The former uses millimeter waves, and has stronger anti-interference ability, the latter emits laser signals, and has higher detection accuracy. Structured light cameras project the invisible pseudo-random light spots to the detected object through the infrared (IR) emitter. According to the produced different light spots, they can be divided into speckle structured light, fringe structured light, and coded structured light cameras. The projected light spot is unique and known in the spatial distribution, and have been pre-stored in the structured light memory. The size and shape of these speckles projected on the observed object vary according to the distance and direction of the object and the camera. The captured spots are compared with the known spots, and then the depth information is obtained. Different depth cameras can be selected according to specific parameters, and the detailed comparison is shown in Table 1.



#### **3. Methods of Point Cloud-Based Joint Estimation**

This section mainly describes several methods to achieve human joint extraction. Existing surveys have made qualitative comparisons of the joint extraction techniques from the perspective of RGB or depth map. We limit our attention to point cloud-based methods, which are mainly divided into feature-based methods, template-based methods and machine learning-based methods. Each method is discussed in each section, respectively.
