*3.1. Template-Based Methods*

The human body is a flexible and complex object with many specific features, such as movement structure, body shape, surface texture, body parts or joint positions. A mature human model is not necessary to contain all the human attributes. Otherwise, it should meet the specific task of combining and describing human poses. The template-based method is intuitive and simple. It judges the motion category by comparing the similarity between the detected object and the constructed template. The existing template-based algorithms can be roughly divided into three categories according to their principles, including geometric model, mathematical model, and mesh model. When the model is selected to match with the observed point cloud of the human body, the joints are regarded as the connection of the rigid part to achieve pose estimation. A complex model usually features more characteristic parameters; it can provide a better approximation for the human body to achieve improved reality and accuracy.

#### 3.1.1. Geometric Model

In general, a geometric model roughly divides the human body into several parts. Each part can be regarded as rigid and then fitted by the 3D geometric shapes, including general cylinder, ellipse, and rectangle. Knoop et al. [9] proposed a new method of fusing different input signals for human tracking. The algorithm can process 2D and 3D input data from different sensors (such as ToF camera, stereo or single-ocular images). For the tracking system, a 3D human model was built with several parts, in which each part was represented by a degenerate cylinder. The top view and bottom view of each cylinder can be regarded as an ellipse structure. Moreover, the two ellipses cannot rotate with each other and their planes were parallel. Therefore, a cylinder model needed to be described by five parameters: the major axes and minor axes of these two ellipses, together with the length of the cylinder. The entire 3D human model was composed of 10 cylinders, of which the torso started to extend as the root node. Each child node was described by a degenerate cylinder and the corresponding transformation of its parent node.

The Head–Neck–Trunk (HNT) deformable template, represented by circles, trapezoids, rectangles, and trapezoids, was proposed in 2011 [10]. Once the HNT template works, the limbs (i.e., two arms and two legs) would be detected and fitted with rectangles. The end points of the rectangle were regarded as the joints of the human body, and its depth information was used to determine whether the human body was in a self-occlusion state. When self-occlusion occurred, the part segmentation of the human body was triggered, and then the segmented limbs were fitted separately. Inversely, the joints of the human body were directly obtained containing the contact points between the geometric shapes and the end points. Suau et al. [11] proposed a fast method to localize five joints of human body on the point cloud. In this method, the geometric deformation model established by the basic curve evolution theory and the level set method [12] was adopted to spread the topological structure of the human body. Additionally, the Narrow Band Level Set (NBLS) method [13] was also expanded to filter the 2.5D data according to its physical area. With the purpose of maintaining the connectivity on the depth surface to facilitate the extraction of topological features, the calculated NBLS map was filled, and finally the geodesic distance was used to quickly locate the five end points corresponding the five extreme joints. Lehment et al. [14] used an ellipsoid model for the upper-body with nine basic body modules, including left/right upper arms, lower arms, hands, head, torso and neck. Since the ellipsoid is a 3D equivalent ellipse, it is easy to be generated and controlled. Even in the case of having no clue about the color or texture, it can also find the nearest neighbor points with the input point cloud to calculate the similarity of the likelihood function.

Unlike the above method, Sigalas et al. [15] used 2D information to estimate the pose of the 3D torso. The initial face identification of the human in 2D image was used to segmen<sup>t</sup> the area of human body from the background. Based on the illumination, scale and pose invariant features on the 2D silhouette, the 2D silhouette was extracted from a 2D body, and then the curve analysis was performed to initially assume the area of the human shoulder. Meanwhile, the 3D information of the point cloud was meshed to estimate the 3D shoulder coordinates. The ellipsoid model was finally fitted with the torso area of the human body by least-squares optimization; a set of anthropometric standards were also applied to further refine the 3D torso pose.

In order to increase the robustness of the algorithm, Sigalas et al. [16] further demonstrated a multi-person tracking system, as shown in Figure 5, in which human segmentation and pose tracking were contained. Human segmentation detected multiple human bodies by face detection in the depth map, and each human body was segmented individually. In the meantime, the length information of each part was calculated. Pose tracking first defined a body model with a head, upper and lower torsos, arms, and legs. Ellipse and circular were used to represent the upper and lower torsos, while cylinder could be implemented to fit for the remaining parts. Additionally, each part had a length limit. The point cloud of the human body obtained by the depth camera was rotated to the top view, and the reprojection ratio *freproj* in Equation (1) was introduced as a matching index.

=

*NPr*

(1)

*N*3*D*

*freproj*

**Figure 5.** A geometric model can be applied to human joint estimation. (**a**) includes the human body segmentation and depth-based ordering and (**b**) includes the pose recovery and tracking. Figure from [16].

Multiple views by rotating the cylinder model around the *x*-axis can be generated, including occluded and non-occluded. For each view of the cylinder, the corresponding reprojection ratio visible points *NPr* to the total number of 3D points *N3D* in the point cloud was calculated. Its value varied with the view, and it reached the minimum in the top view of the cylinder when the view-axis of the camera was aligned with the long axis of the cylinder.

Based on T-pose, Wu et al. [17] created a simplified human skeleton model with customized parameters to adapt to different body types. The depth image and the corresponding 3D point cloud, as a pair of inputs, were first pre-processed and initialized to obtain personalized parameters. The torso part could be detected on the binarized image, the centroid of this part was calculated as the root node afterwards, and then using the root node as the parent node to iteratively find other child nodes. After obtaining the length between nodes, the human skeleton information was obtained by matching with skeleton model and further optimizing the joint angle. Besides, this method used the threshold segmen<sup>t</sup> to solve the self-occlusion problem.

## 3.1.2. Mathematical Model

Mathematical model mainly transfers conceptual knowledge commonly used in mathematics to model construction. The basic idea is to build a model with the representation method of probability distribution to list each possible result and give their probabilities. Significant amount of work has been accomplished using Gaussian Mixture Model (GMM) in recent years. GMM is to establish a mixed model based on multiple Gaussian distributions for each pixel in the image. The parameters in the model are continuously updated according to the observed image, and background estimation is performed at the same time. Based on GMM, an algorithm [18] was proposed based on a single depth camera to estimate the pose and shape of the human body in real time. Due to the probabilistic measurement, it did not require explicit point correspondences. The articulated deformation model, which is based on exponential-maps, can direct embed into the GMM model. However, this algorithm simply used the first few frames to acquire human pose in the dynamic scenes, which usually did not provide the complete information. To cope with the time-varying articulated human body shape, Xu et al. [19] applied a GMM model to establish the pose and shape of the observed user. This method obtained the correspondence between the model and the user, and realized the shape estimation of human body based on multiple RGB-D sensors without any priori information. Compared with a single view case, depth data from multiple RGB-D sensors can not only handle more complex poses, especially occlusion situations, but also can be used to achieved different types of shape estimation by changing body attributes such as height, weight or other physical characteristics. Ge et al. [20] constructed a new non-rigid joints registration framework for human pose estimation by improving the two latest registration techniques. One is Coherent Point Drift (CPD), and the other is Articulated Iterative Closest Point (AICP). The GMM model was applied to initialize the standard pose of the human body through the CPD, and then AICP was employed with other pose point clouds to complete the pose estimation task. In the follow-up work, for incomplete data caused by self-occlusion and view changes, an effective pose tracking strategy was introduced to process continuous depth data [21,22]. Each new frame initialized a new template, which effectively reduced the ambiguity and uncertainty in the process of visible point extraction.

Stoll et al. [23] proposed Sums of spatial Gaussians (SoG) in 2015, which used a quadtree to gather image pixels with similar color values into a larger square. It demonstrated remarkable performance for 2D data. Each square was represented by a Gaussian function, and then a set of isotropic Gaussian components constituted the SoG. Inspired by SoG, Ding et al. [24] presented Generalized SoG (G-SoG), which used an anisotropic Gaussian function with less calculation to represent the entire human body. On this basis, they expanded the 3D express form of SoG by grouping 3D parts of the point cloud with similar depth into voxels. The 3D Gaussian model only contained spatial statistical data, but not color information.

Both SoG and G-SoG involve pose tracking of different characters. The former represents observed point cloud through effective octree division, and the latter embeds a quaternion-based articulated skeleton to create a standard human template model. A single un-normalized 3D Gaussian *G* can be expressed as Equation (2):

$$G(\mathbf{x}) \, \, = \exp\left(-\frac{||\mathbf{x} - \mu^2||}{2\sigma^2}\right),\tag{2}$$

where *x* is 3D coordinates, *μ* and *σ*<sup>2</sup> are the mean and the variance, respectively. SoG has the form as Equation (3),

$$K(\mathbf{x}) \;= \sum\_{i=1}^{n} \mathcal{G}\_{i}(\mathbf{x}) \tag{3}$$

A 3 × 3 covariance matrix is introduced into Equation (2) to replace the variance *σ*2, an anisotropic Gaussian in Equation (4) is obtained as,

$$\mathbf{G}(\mathbf{x}) \;= \exp(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^T \begin{bmatrix} \mathbf{C}\_{11} & \mathbf{C}\_{12} & \mathbf{C}\_{13} \\ \mathbf{C}\_{12} & \mathbf{C}\_{22} & \mathbf{C}\_{23} \\ \mathbf{C}\_{13} & \mathbf{C}\_{23} & \mathbf{C}\_{33} \end{bmatrix} (\mathbf{x} - \boldsymbol{\mu})) \tag{4}$$

SoG was used to represent the point cloud of the human body, and then registered it with the G-SoG body template for human tracking. An Energy function in Equation (5), including similarity term, continuity term and visibility term, to describe the similarity between G-SoG and SoG [25],

$$\hat{\theta} = \arg\min\_{\theta} \sum\_{i \in K\_m} -E\_{\text{sim}}^{(i)}(\theta) \cdot \text{Vis}(i) + \lambda\_{\text{con}} E\_{\text{con}}(\theta), \tag{5}$$

the first term emphasizes the similarity of the two models, *Vis* gives the visible state of each Gaussian function, and the second term is added to smooth the pose estimation. In addition, the overlaps on the 2D projection plane of the Gaussian functions were used to judge whether there is occlusion.

Based on the previous work, the author expanded the previous framework and proposed an articulated and generalized Gaussian kernel correlation (GKC)-based system [26], as shown in Figure 6, which supported subject-specific shape modeling and articulated pose estimation for the whole body and hands.

**Figure 6.** A conceptual scheme of mathematical model-based human joint estimation. Articulated pose estimation for the full body (**a**) and hand (**b**). The 1st row shows the Sums of spatial Gaussians (SoG) -based template models and an observed point cloud. Their corresponding Gaussian kernel density maps are depicted in the 2nd row, followed by the pose estimation results in the 3rd row. Figure from [26].

Apart from the method based on Gaussian distribution, Ganapathi et al. advanced a real-time tracking algorithm based on Maximum A Posteriori (MAP) inference in a probabilistic temporal model, and the human pose of each part was updated with Iterative Closest Point (ICP) algorithm. A two-stage method was proposed to solve the problem of recovering the human pose from a single depth map [27]. In the first stage, course template found in a large model dataset was used to make skeleton deformation, and then in the second stage, the detailed part of the human shape was restored by Stitched Puppet model [28] to fit the deformed model.

#### 3.1.3. Mesh Model

The mesh model is composed of many small polygonal patches in the computer to form the surface in the real world. Through the parameterized human body model, the structure has specific outer surfaces in addition to the skeleton, which reflects 3D appearance of the human body. Specific details such as characteristics are convenient to judge whether self-occlusion behavior occurs.

Ye et al. [29] built a fast pose detection system. After segmenting and denoising, the human point cloud was aligned with a series of mesh model, and the invisible parts were filled in the alignment process. Next, the shape and pose were deformed to perform finetuning. For the point cloud that failed to be registered, they searched the best alignment model again to complete the joint extraction process. Grest et al. [30] used the ICP algorithm with nonlinear optimization technology to achieve the purpose of aligning the mesh model and the human point cloud. By using the ICP algorithm, Park et al. [31] also recorded and processed multiple depth point clouds of single person from different perspectives to capture the shape of the entire body. Using template matching and Principal Component Analysis (PCA), a statistical body model representing a variety of human shapes and poses can be generated. PCA shortened the searching time by projecting the data into a low-dimensional Principal Component (PC) space. The ICP algorithm was adopted to fit the subject-specific human body model and depth data frame by frame, so that the accuracy of the original joint positions estimated by the Software Development Kit (SDK) was improved.

Hesse et al. [32] exerted a combination of texture model and random forest to classify body parts. According to the parts of human body, the position of the human joint was estimated. Vasileiadis et al. [33] used 3D Signed Distance Functions (SDF) data to represent the model, which was extended by a supplementary mechanism to track the pose of the human body in the depth sequence. In the actual multi-person interaction scene, the depth data of the human body in different perspectives was collected [34], and the mesh model was used for fitting to eliminate contact joint error. A new unsupervised framework was proposed to eliminate the influence of noise [35]. The method consisted of three steps: the deformed model and the human point cloud were registered with non-rigid point method to establish point correspondence, and the skeleton structure was extracted from the new point set sequence based on the cluster. Finally, Linear Blend Skinning (LBS)- based joint learning refine the positions. Huang et al. [36] estimated the joint positions by fitting a reference surface model, which included a reference triangle mesh surface and an inherent tree-shaped skeleton. Walsman et al. [37] utilized mesh templates to track human pose in real time and reconstructed high-resolution surface silhouette, so that it can facilitate gesture recognition and motion prediction using commercial depth sensors and GPU hardware.

Among all the mesh models, as a prominent and parametric human body model, Skinned Multi-Person Linear (SMPL) model can carry out arbitrary shape model and animation drive. This method can simulate the bulges and depressions of human muscles during limb movement. Therefore, the surface distortion of the human body during exercise can be avoided, and the appearance of human muscle stretching, and contraction can be accurately described. Zhou et al. [38] employed MobileNet to build a 2D human skeleton model, which facilitated the initialization of the point cloud. And then the customized SMPL model was fitted to the observed point cloud. The error was gradually reduced between the SMPL model and the actual observed point cloud by minimizing the loss function.

In order to enhance the generalization ability of the model, Joo et al. [39] designed a unified deformable model "Frank" to capture human motion at multiple scales without markers, including facial expressions, body motions and gestures. Figure 7 illustrates the main components of Frank. Each part is fitted with FaceWarehouse [40], SMPL model and artist-defined hand, respectively. Finally, the three models are partially spliced to capture human body motion and subtle facial expressions. The seamless mesh *V<sup>U</sup>* in Equation (6) was denoted as the motion and shape of the target subject. The main components of the Frank model *M<sup>U</sup>* include motion factor *<sup>θ</sup>U*, shape factor *<sup>ϕ</sup>U*, and global transformation factor *tU*. 

$$V^{II} = M^{II} \left( \theta^{II} + \varphi^{II} + t^{II} \right) \tag{6}$$

**Figure 7.** A mesh model is used to fit with point cloud of human body. (**a**) SMPL model; (**b**) Face-Warehouse model; and (**c**) artist-defined hand model. In (**<sup>a</sup>**–**<sup>c</sup>**), the red dots represent 3D positions of the corresponding keypoints reconstructed by detectors; (**d**) Body model; (**e**) Face and hand models are aligned with the corresponding parts of the body model; and (**f**) The whole Frank model. Figure from [39].

The motion parameters *θ<sup>U</sup>* in Equation (7) and shape parameters *ϕ<sup>U</sup>* in Equation (8) of the Frank model are the combination of each sub-model parameters,

$$\theta^{II} = \left\{ \theta^B, \,\theta^F, \,\theta^{LH}, \,\theta^{RH} \right\}, \tag{7}$$

$$\boldsymbol{\wp}^{\mathrm{II}} = \left\{ \boldsymbol{\varrho}^{\mathrm{B}}, \,\,\boldsymbol{\varrho}^{\mathrm{F}}, \,\,\boldsymbol{\varrho}^{\mathrm{LH}}, \,\,\boldsymbol{\varrho}^{\mathrm{RH}} \right\}, \tag{8}$$

where *B* represents SMPL model, *F* belongs to face model, *LH* and *RH* are abbreviations of the left- and right-hand models, respectively. The motion parameter *θ<sup>U</sup>* mainly expresses the overall motion pose of the human body, including the relative angle information of the joints, and the shape parameter *ϕ<sup>U</sup>* is defined as the ratio of length, width and height.

The next step is to merge the model with the point cloud. There are two cases: one is when the corresponding point can be found between the model and the point cloud, the other is when the corresponding point is not obvious. In the first case, 2D detection first was operated to find the corresponding keypoints in each sub-region, and then converted them to 3D space. In the second case, the ICP algorithm was exerted to register the point cloud with model. The final objective function can be written as Equation (9):

$$\mathcal{E}\left(\theta^{II}, \phi^{II}, t^{II}\right) \;= E\_{keypoints} + E\_{icp} + E\_{scam} + E\_{prior} \tag{9}$$

*Ekeypoints* means 3D keypoints detections, the term *Eicp* expresses the cost of ICP algorithm. The skeleton hierarchy of the Frank model was closely connected. However, the independent surface parameterizations of some sub-model may lead to the introduction of discontinuities at the boundary. To avoid this artifact, the difference *Eseam* between the

vertices of the last two circles around the seam was minimize. Because the SMPL and Face-Warehouse model did not capture hair and clothes, the full body could not be explained well by the model. This resulted in incorrect registration during ICP. Hence, *Eprior* was set on the model parameters to avoid overfitting the model to these noise sources. Furthermore, a new model, Adam, was derived to better capture the rough geometry of the human body with clothes and hair to match the geometry of the source data more accurately.

This method above showed the potential that the unmarked motion being captured can eventually surpass the mark-based one. The marker-based method is very susceptible to occlusion, which makes it difficult to capture the details of the body and hands at the same time. This work can not only solve the occlusion problem, but also achieve higher precision model fitting results.

## *3.2. Feature-Based Methods*

Global feature is a common feature-based method, which refers to the overall attributes. Common features include color, texture, and shape features. Because it is a low-level visual feature at the pixel level, the global feature possesses low variance, simple calculation, intuitive representation, etc. Among the global features, geodesic distance and geometric feature are commonly used in the point cloud-based applications.

#### 3.2.1. Geodesic Distance

Geodesic distance is literally inferred to be the shortest path distance between two points, which is different from the Euclidean distance that usually being used in the geometrical space. Euclidean distance is the shortest distance between two points in space, and geodetic distance is the shortest path of two points along the surface of the object. To solve the shortest path problem, Dijkstra's algorithm [41] was commonly used, which was a greedy algorithm. This was because after specifying the starting point and ending point, the algorithm always tried to access the next node that is closest to the starting point in each step of the loop, thereby gradually obtaining the shortest distance between the two points.

Krejov et al. [42] located and separated the left and right hands according to the image domain, and then processed each hand in parallel to build a weighted graph on the surface. An effective Dijkstra's algorithm is utilized to traverse the entire graph to find *N* candidate fingertips. With the shortest path algorithm, multi-touch interaction among multiple users is realized. Phan et al. [43] proposed an online multi-view voting scheme (MVS) running at an interactive rate. It combined the measurement results from multiple sources to generate a fine geodesic distance graph (GDG), and then five geodesic extremes in the GDG were marked as the head, hands, and feet. Assuming that the length of each bone is determined in advance, so additional landmarks are obtained by calculating the centroid of each region, corresponding to the secondary joints of the wrist, elbow, knee, ankle, and neck. To overcome the errors caused by misdetection and occlusion, an improved method using feature point trajectories to correct the error detection was designed [44]. Five extreme points were detected by geodesic distance method. A shoulders template was applied to search for the position of the shoulders. Once the shoulder joint is determined, the geometric midpoint was regarded as the position of the elbow joints. An iterative search method was used to find the elbow point by minimizing the total geodesic distance from the shoulder point to the hand point through the elbow point. Besides, a minimum distance constraint was imposed afterward in the corresponding recognition to predict its most likely spatial position in the next frame for tracking the trajectory of each joint. To solve the problem of detecting and identifying body parts in the depth data at the video frame rate, a solution was proposed to obtain a new interest point detector on the point cloud data [45]. First, the extreme points were detected by using the geodesic distance, and were further divided into hands, feet, or head using local shape descriptors, and 3D direction vector of each point is given. To speed up the search process of candidate points in the human body, a quadtree-based method was utilized to effectively group adjacent

data, and then Dijkstra's algorithm was applied on this basis to obtain the feature points. In the tracking process, a noise removal and restoration method based on Kalman filter was used to correct and predict the extreme positions [46].

Combining multiple methods can also provide better accuracy in estimating the position and direction of the joints. Handrich et al. [47] replaced depth information with more complex features describing local geodesic neighborhoods, and then a random forest classifier was used to learn the correct body part from these descriptors. Baak et al. [48] employed the geodesic distance, which was extracted from the input data as a sparse feature to retrieve the pose from a large 3D pose dataset, and merged it with the previous pose to achieve pose tracking. Mohsin et al. [49] described a system for successfully locating specific body parts. Multiple depth sensors were used to collect point clouds from different perspectives to help solve the occlusion problem. In order to locate prominent human limbs, a triangular mesh model was applied to the 3D point cloud, and the ends of the limbs were marked with geodesic distance.

In general, the geodesic distance can only be used to detect the five extreme points of the human body, namely, the head, hands, and feet. A hybrid framework using depth camera to automatically detect joints was proposed [50]. This method divided the joints into two types: implicit joints and dominant joints. Dominant joints include extreme points, elbows, and knees. Implicit joints are points on the trunk, such as the neck and shoulders. The specific extraction process is shown in Figure 8a. Due to the rigidity of the human limbs, the dominant joints are easier to be detected than the implicit ones. First, the geodesic features of the human body are used to establish extreme points.

$$D\_{\mathcal{S}}(p\_0, P(\mathbf{x}\_p, y\_p)) \ = \sum D\_{\mathcal{S}}(P(\mathbf{x}\_p, y\_p), P(\mathbf{x}\_{q'} y\_{q})) \tag{10}$$

**Figure 8.** Geodesic distance is used to locate the positions of extreme points. (**a**) describes the overview of the workflow of the proposed method. (**b**) shows the skeleton model used in our method. The green dots represent the extreme points. Blue dots represent implicit joints (neck, waist, shoulders, and hips). Red dots represent dominant joints (elbows and knees). Figure from [50].

In Equation (10), *P* denotes the point cloud, *p0* is starting point, and *Dg (*·*)* represents the geodesic distance between two random points. If the corresponding relationship between the extreme points and the skeleton model is not given, it is difficult to detect the position of the joints. Therefore, starting from mapping an extreme point to the head, the feature of the area around each extreme point is used to compare with the head model. Each extreme point is gradually mapped to the corresponding part of the human body model. In the skeletal model as described in Figure 8b, the geodesic distance between the

head and the hand is smaller than the geodesic distance between the head and the foot, which is the criterion used to separate the hand and the foot joints.

With the above restrictions, the extreme points are found. The human skeleton model is then used to define implicit joints. Assuming that the geodesic distance between the left hand and the left shoulder is shorter than the geodesic distance between the left hand and the right shoulder. The relationship between the left hand and the right hand can be described as Equations (11) and (12):

$$D\_{\mathcal{K}}(p\_{\rm Lh}, j\_{\rm Ls}) \; \; \; \; \; \; D\_{\mathcal{K}}(p\_{\rm Lh}, j\_{\rm Rs}) , \tag{11}$$

$$D\_{\mathcal{X}}(p\_{Rh}, j\_{Rs}) \; \; \; \; \; \; \; D\_{\mathcal{X}}(p\_{Rh}, j\_{Ls}) \tag{12}$$

By adding constraints such as Euler angle and geodesic distance ratio, the joint candidates were ensured to show the degree of curvature of the path. The strategy based on the global shortest path was adopted to detect the dominant joint candidates, such as elbow and knee joint, and then the shortest paths for specific detection were further used to locate these joint. Furthermore, to deal with self-occlusion, when the distance map is updated, the difference in depth values is calculated between adjacent points. If the difference is less than the threshold, the two points are on the same surface of the human body. Otherwise, they are in different parts of the human body.

## 3.2.2. Geometric Feature

Geometric features refer to the overall attributes, common ones include texture and shape features of the human body. To eliminate the influence of complex poses by constructing and merging 3D point clouds of multiple views. The part detector was used to detect the body parts [51], and then the centroid of each part was obtained as the joint position. Based on the shape segmentation and skeleton sequence, Zhang et al. [52] designed an extraction method of human skeleton. In the preliminary step, the centroid of every part was also used to generate a pseudo skeleton. Multiple depth sensors were also utilized to achieve the purpose of motion capture [53]. First, multi-frame depth data from the depth sensor was converted into multiple point clouds, and then, these point clouds were combined into a merged point cloud, on which the skeleton line was acquired by the Reeb graph. Finally, the joint position was calculated from the skeleton line according to the joint structure of the human body. A curve skeleton expression based on the set of cross-section centroids was presented [54]. Patil et al. [55] applied multiple inertial measurement unit (IMU) sensors, which were placed at the human joints to estimate the 3D position of the joints, the Lidar data compensate for displacement drift during the initial calibration of the skeleton structure. A 2.5D thinning algorithm was exerted [56], including segmentation of the occlusion region and thinning line extraction. The thinning line bone obtained cannot determine the exact position of all body joints, but the end-joints of the body part can be detected. Finally, it was registered with the constructed human model containing 16 bone joints, and the human joints were extracted.

Xu et al. [57] detected the human joints in a single-frame point cloud using the TOF depth camera. The process was distributed into three stages as shown in Figure 9. An inhouse captured 3D dataset containing 1200-frame depth images was first collected, which can be categorized into four different poses (upright, raising hands, parallel arms, and akimbo). To eliminate the influence of the background and noise points on the algorithm, the point cloud was separated from the background by the conditional filtering in the data pre-processing stage. To avoid self-occlusion, the point cloud was projected to the 2D top view, and then the point cloud was easily rotated by the angle, which was formed by the farthest points on the *x*-axis and the horizontal axis, to make the viewpoint of the camera parallel with the direction of the human body being facing. Finally, the 3D silhouette of the human body was extracted by adopting the public algorithm in the PCL [7] as a global feature in the point cloud.

**Figure 9.** Geometric feature is used for human joint estimation. The approach consists of three stages: data acquisition, data pre-processing and joint estimation. (**a**) The point clouds are directly obtained from the depth camera. (**b**) Data pre-proposing mainly involves three parts: firstly, the irrelevant points are filtered, then the orientation of the human point cloud is adjusted, and finally the 3D silhouette is extracted. (**c**) Fourteen joints of human body are extracted by using the geometric feature of human silhouette. Figure from [57].

Before extracting the joints of human body, different poses were classified according to the angle and aspect ratio of the silhouette point cloud. First, the four poses are divided into two categories, according to the angle formed by the farthest point on the *x*-axis and the point with minimum value on the *y*-axis, one includes upright and akimbo poses, and the other contains the remaining poses. Then, the two poses in each category are further distinguished according to the aspect ratio of the silhouette. There was slight difference in the extraction method of the 14 joints for different poses. The approximate flow was that the head and foot joints were regarded as the centroids of each segmented part according to body proportion. As the base point, the waist joint was obtained in the next step using the prior information, while the shoulder and hand joints can be acquired afterwards. The elbow and knee joints were calculated by judging whether bending was present. In an upright state, the elbow joint was determined to be the midpoint between the hand joint and the shoulder joint, while the knee joint was also located on the line between the foot point and the midpoint of the left and right shoulders. When in the bent state, the elbow joint was defined the farthest point in the arm point cloud from the straight line formed the hand and shoulder joints, and the knee joint was the minimum value in the *z* direction.

Compared with the other methods, the accuracy of the joints was greatly improved.The average joint error was less than 5.8 cm by using both the in-house and public datasets,but it was also affected by the clothes, which led to more error in the waist joint.

#### *3.3. Machine Learning-Based Methods*

Given the rapid development of machine learning technology in computer vision, some of the latest deep learning networks, such as PointNet [58], VoxelNet [59], PointCNN [60] and PointConv [61], are also implemented in the 3D point clouds. These algorithms have further pushed the development of deep learning on 3D point clouds to address various problems [62,63]. This review attempts to track and summarize the progress of point cloudbased networks for human tracking in recent years, so as to provide a clear prospect for the current point cloud-based joint extraction of the human body. We mainly summarize from two categories of neural network and classification tree.

#### 3.3.1. Neural Network

One very important field of machine learning is the neural networks. Especially, convolutional neural network (CNN) is a fascinating and powerful tool that can achieve grea<sup>t</sup> analysis results in many tasks of computer vision. A 2D CNN is used to locate 2D human joints, which are then extended to 3D through a depth transformation to reduce the computational cost. Biswas et al. [64] designed an end-to-end system that combines

RGB images and point cloud information to recover 3D human pose. Özbay et al. [65] used a simplified extraction method "Conditional Random Field" to classify 3D human point clouds, and the corresponding images and poses as input of CNN transmitted similar spaces. When the image-pose pair is matched, the value of dot product is high, otherwise the value is low. Without making any assumption about the appearance and initial pose of the human, the proposed system could be applied to multi-human interaction scenarios [66]. Schnürer et al. [67] utilized networks to generate 2D belief maps, combined with depth information for pose detection of the upper body, which required fewer resources while achieving a high frame rate. However, the depth mapping of a 2D single-channel image did not represent an actual 3D representation. To overcome this limitation, a 3D CNN architecture was proposed to provide a likelihood map for each joint [68], and the detection structure was extended to make it suitable for multi-person pose estimation. Millimeter wave (mmWave) has the advantages of high bandwidth and fast speed, which is the reason why it is used as the carrier of 5G technology. A new method of real-time detection and tracking of human joints using mmWave radar was proposed [69], named mmPose. This is the first method to detect different joints using mmWave radar reflected signals, and the emission wave at 77 GHz allowed it to capture small differences from the reflective surface. The algorithm structure is shown in Figure 10.

**Figure 10.** In automatic/semi-autonomous vehicles and traffic monitoring systems, mm-Pose can be used to perform robust skeleton pose estimation of pedestrians. Figure from [69].

The objects reflected the radar signal within a coherent processing interval (CPI), and a 3D radar cube was obtained with fast-time, slow-time and channel. In order to overcome the sparseness of the voxel grids. and significantly reduce the subsequent machine learning structure, the depth, the ratio between elevation and azimuth, and normalized power values of the reflected signal were assigned to the RGB channels to generate a 3D heat map, which can be used as the input to CNN, and the output of CNN were different human joints in 3D space.

In addition to CNN, other neural networks are also commonly used in point cloudbased pose recognition. Fully connected network (FCN) was introduced to accurately simulate the restriction of the human joint [70], which can effectively implement the realistic restriction by transforming the constraint force in the physics engine into an optimization problem. Li et al. [71] proposed multi-layer residual network to obtain hand features for tracking and segmenting. Zhang et al. [63] adopted an adversarial learning method to ensure the effectiveness of the restored human pose to alleviate the ambiguity of the human pose caused by weak supervision. A deep learning-based weakly supervised network, as shown in Figure 11, not only used the weakly-supervised annotations of 2D joints, but also applied the fully supervised annotations of 3D joints. It is worth noting that 2D joints of human body can help select effective sampling points to reduce the computational cost of the point cloud-based network.

**Figure 11.** A schematic diagram of human joint estimation using Neural Network. The network consists of two modules, the point clouds proposal module and the 3D pose regression module. Using the input depth map, we first estimate the 2D human pose, and use it to sample and normalize the extracted point clouds from depth. Then we use the initial 3D pose converted from the estimated 2D pose and the normalized point clouds to predict the final 3D human pose. Figure from [63].

In this paper, a point cloud-based network is involved. Initially, Qi et al. [58] proposed the PointNet network, which can extract features of point from unordered point clouds. PointNet used traditional multilayer perceptrons (MLPs) as the core learning layer. It is commonly used to deal with 3D object classification and point-level semantic tasks. In the subsequent research work, PointNet++ [72] added local structures at different scales to enhance PointNet. Because of the effectiveness of this method, the author used the PointNet++ network to deal with point segmentation. Compared with the existing methods about pose estimation of the human body that require human foreground detection, this method can perform accurate pose estimation without clear requirements.

Joint extraction of part structures from human body attracted much attention for further research. The proposed self-organizing network aims to use unannotated data to obtain accurate 3D hand pose estimation [73]. The heat map, as the output of 3D CNN, reflected the probability distribution of the joints. In [74], heat map was used as the intermediate supervision of the 3D hourglass network to participate in the skeletal constraints for the hand tracking. In addition to the heat map representing the distance, the unit vector field was introduced, and joint position was inferred by weighted fusion [75]. In order to further improve the accuracy of the fingertips, a fingertip refinement network was designed to model the visible surface of the hand and perform pose regression [76]. Different from the original PointNet, Local Continuous PointNet (LCPN) [77] was proposed to extract the local features of the neighbor index in the unorganized point cloud to estimate the facial joints. The input of the 3D CNN was encoded through the projection of the point cloud [78]. After the convolution and pooling layer, 3D features can be extracted from the

volume representation, which can be used to return the relative position of the hand joints in the 3D volume. An end-to-end multi-person 3D network Point R-CNN for the pose estimation was proposed [79], which used panoramic point clouds of multiple cameras to solve the occlusion problem. The whole network can be regarded as a combination of two parts. The first segmen<sup>t</sup> is for the instance detection using VoxelNet and the other segmen<sup>t</sup> is for instance processing by the PointNet to acquire the joint information.

#### 3.3.2. Classification Tree

The classification tree is one of the prevailing methods for human body segmentation. As a newly emerging and highly flexible machine learning algorithm, random forest (RF) refers to a classifier that uses multiple trees to train and predict samples, has a wide range of applications.

Inspired by the decision forest, each point in the point cloud of the human body is voted to evaluate the contribution of each part of the human body, so that collaborative method was proposed to learn the 3D features of the human body [80]. Xia et al. [81] trained the cascade regression network from the pre-recorded human motion dataset. In addition, the hierarchical kinematics model of the human pose was introduced into the learning process, it can directly estimate the accurate angles of the 3D joints.

Random forest is often used to segmen<sup>t</sup> different parts of the human body in the following literatures. Different regions of the upper body were first detected, and then the probability map for each region were calculated [82]. The highest part in the probability map was defined as the external joints. The internal joints, such as the elbow, were fitted with an ellipse model to obtain. In the 3D point cloud, the Principal Direction Analysis (PDA) was used to estimate the main direction of the body part, and then the main direction was mapped to each part of the 3D model to estimate the human pose [83]. In the prescribed action set, a pose estimation using multiple random forests was proposed to enhance the results of motion analysis [84]. A group of random verification forests were set to verify classification results of the initial random regression forest for precise joints positioning. The geodesic-based feature descriptors played a significant role in the random forest classifier to produce more exact spatial predictions for body parts and bone joints [85]. Random forest was also applied to infer the consistency between the input data and the construction template [86]. The method successfully restores the shape of the human body and extracts joints.

The method of pixel inference using random decision tree usually requires more heavy calculation. Especially when the number of trees is increased to improve generalization and accuracy, the computational burden of multiple trees may force a trade-off between speed and accuracy, and the random tree walk (RTW) method can obtain greater gain. The method combined RTW with optimization methods such as ICP and random search, which raised the ability to extend of the classification tree [87]. RTW was used to initialize various assumptions in different ways and then passed them to the optimization stage.

Yub et al. [88] no longer trained the tree for pixel-level classification, and used the regression tree to estimate the probability distribution towards a specific joint direction relative to the current position. In the test process, the direction of random walking was randomly selected from a set of representative directions. A new position by a constant step was found in that direction.

For all positioning problems, as long as we know the direction of any point on the object towards that position, we can find the correct position. Ideally, the orientation of all parts should be trained from all possible positions of the whole body, because random tree walking could reach the joint position faster, so a starting point close to the target joint position was required. In the case of using the skeleton topology, one needed to provide a nearby initial point for the RTW, as shown in Figure 12.

**Figure 12.** Example of classification tree for human joint estimation. (**a**) illustrates the kinematic tree implemented along with random tree walk (RTW). First, the random walk toward belly positions starts from body center. The belly positions (red dot in (**a**)) become starting point for hips and chest, and so forth. (**b**) shows the RTW path examples. (**c**) illustrates offset sample range spheres in green. In (**d**), the green dots represent offset samples. Figure from [88].

RTW can be described as training regression trees for each joint in the human skeleton. Here, the direction from the point to that specific joint is obtained by training a regression tree. Therefore, a training set is first constructed with the position of each joint point and the depth value of the input point.

The unit direction vector *u*ˆ from the offset point to the joint was defined as Equation (13):

$$\hat{a} = (p\_j - q) / ||p\_j - q||,\tag{13}$$

where *pj* is the coordinate of a random point, *q* is the position of the specific joint. The training sample *S* is expressed as Equation (14):

$$S = (I, q, \mathfrak{d}),\tag{14}$$

*I* represents the depth value. The goal is to find a partitioned binary tree that minimizes the sum of squared differences. At the same time, the directions are stored on each leaf node in the form of clusters, so that several representative directions and corresponding probability weights form the output of the tree. When estimating the pose, the path starts walking randomly from some initial points. In each step of the traversal, the regression tree is traversed to a leaf node, where a set of directions corresponding to the current point can be obtained. However, the step direction is randomly selected from the k-means cluster unit vector at the leaf node.
