*3.4. Summary*

The template-based method first needs to establish a template library or a parameterized template, and then the similarity between the point cloud of human body and the sample in the template library or the target model is compared. This method is relatively rough and time-consuming. Given diversity and multi-scale structure of the sample data, the same pose of human body may be very different in space. Therefore, the accuracy of template-based methods is very limited.

Feature-based method needs to extract the global or local features of the point cloud, which combine with some prior knowledge to obtain the 3D joints of the human body. This method relies on the selection of feature points such that it is not suitable for self-occlusion and changing poses. Therefore, it is necessary to further optimize the robustness of the algorithm for covering the poses of the human body as much as possible.

The machine learning-based method mainly uses the network to automatically learn the required features from the point cloud, and then the learned features can be regarded as the judgment condition to extract the human joints. Compared with the above two methods, it has been greatly improved. On one hand, the obtained joints can achieve higher accuracy by learning sample features in a large training set, and on the other hand, it is also very robust to scale processing. The machine learning-based method can make up for the shortcomings of the above two methods, but it is restricted by the sample richness of the training set, so the construction of the training set is very important to the machine learning-based method.

Moreover, we summarize some works of point cloud-based joint estimation for human body in Table 2. For 3D pose estimation, two different error metrics can estimate the accuracy of the method. One is direct measurement of the Euclidean distance between the estimated and ground truth joints, another is the average precision (AP), which is defined as the ratio of correctly estimated joints within a specific threshold. Some works have adopted the AP error metrics, and Table 2 reports datasets together with some key parameters, especially, the threshold values *δ.* When *δ* changes, the AP of each joint would be different. The 3D coordinate of each classified joints, which are obtained by proposing algorithm in the literature, is compared with the corresponding ground truths of the joints within the same dataset. When the difference between the two values above is less than the given threshold *δ* in Table 2, the joint coordinate is considered to be a correct position. For acquiring the tracking accuracy of each classified joint in the referenced works, the ratio AP in Table 3 is finally calculated between the sum of the correct locating joints and all joints.


**Table 2.** Summary of the referenced works for human pose estimation with depth inputs.

"-" represents that the value is not given; *δ* represents threshold value; "Y" means running on a desktop with GPU; and "N" means running on a desktop without GPU.

**Table 3.** Tracking accuracy of the human body joints in the referenced works.


"-" represents that the value is not given.

#### **4. Public Depth Dataset of the Human Body**

Public datasets play an important role in testing the robustness of the algorithm and provide a platform to compare different algorithms in a fair manner. In the past few years, many 3D benchmark datasets for different applications have been collected and made publicly available to the research community. The structure of the dataset mainly includes RGB, depth map or point cloud acquired from structured light or ToF depth camera. In this paper, we only focus on the point cloud-based dataset. This section provides a detailed review of the datasets listed in Table 4. Since point cloud occupies fairly large storage space, most datasets usually provide depth maps together with the internal parameters of the camera, which can be easily converted to the point cloud-based datasets.


**Table 4.** Previously developed depth datasets for human bodies.

S denotes single person, M denotes multi-person interaction. "-" represents that the value is not given.

> A set of widely availed depth dataset, named SMMC-10, was constructed as a benchmark for the algorithm testing [90]. To generate this dataset, a probability model containing 15 rigid parts of the human body was first defined. These rigid parts were spatially constrained by the joints with 48 degrees of freedom. This dataset was recorded by the Motion Capture (MoCap) system and the ToF camera (Swissranger SR4000) at 100–250 ms per frame. It included 28 real actions, such as fast kicking, swinging, self-focusing, and whole-body rotation.

> Another constructed dataset by Li et al. [91], named MSR-Action3D (https://documents.uow.edu.au/~wanqing/#MSRAction3DDatasets (accessed on 12 June 2020)), was 20 game action for seven subjects facing the depth camera, including: high arm wave, horizontal arm wave, hammer, hand catch, forward punch, high throw, draw x, draw tick, draw circle, hand clap, two hand wave, side-boxing, bend, forward kick, side kick, jogging, tennis swing, tennis serve, golf swing, and pickup and throw. Each action was captured three times by Kinect v1 (Microsoft Corp., Redmond, WA, USA) at 15 frames per second. In total, the dataset reasonably covered various movements of arms, legs, and torso, which stored 4020 motion samples with 23,797 depth maps. Notes that, if an action was done with only one arm or one leg, subjects were advised to use their right arms or right legs.

> A large-scale RGB+D human action recognition dataset, named NTU RGB+D dataset (http://rose1.ntu.edu.sg/Datasets/actionRecognition.asp (accessed on 12 December 2020)), used a recurrent neural network (RNN) to simulate the long-term time correlation of various parts in the human body, and to better classify the human body poses [92]. The dataset collected more than 56,000 video samples with a total of four million frames from 40 different subjects and 60 different operation classes, including daily operations, interoperations, and health-related operations.

> Nguyen et al. [93] explored the extraction of skeleton during human walking. The content of the walking gait dataset was about 18.6 GB, and it was divided into a test set and a training set. The test set included samples of five subjects, and the training set was

walking gait data of four individuals. The data collected for each person includes the information of skeleton and silhouette together with the point cloud.

Berkeley Multimodal Human Action Database (MHAD) took seven males and five females aged between 23 to 30 years old as subjects (http://tele-immersion.citris-uc.org/ berkeley\_mhad (accessed on 2 November 2020)) [94]. Each subject performed 11 actions in succession, such as jumping, throwing, waving, sitting down and so on. To ensure the accuracy of action acquisition, all subjects repeated 5 times for each action, resulting in a total of approximately 660 action sequences. In addition, a T-pose model was created for each subject to extract its corresponding skeleton.

The G3D dataset (http://dipersec.king.ac.uk/G3D/G3D.html (accessed on 29 June 2020)) mainly includes different game actions [95]. Given the internal parameters of the depth camera, the captured depth map can be converted into a point cloud. The dataset contains 10 subjects. Each subject was required to complete 7 action sequences consisting of 20 game actions: punch right, punch left, kick right, kick left, defend, golf swing, tennis swing forehand, tennis swing backhand, tennis serve, throw bowling ball, aim and fire gun, walk, run, jump, climb, crouch, steer a car, wave, flap and clap. On this basis, G3Di (http://dipersec.king.ac.uk/G3D/G3Di.html (accessed on 29 June 2020)) [96] was constructed to process a human interaction for multiplayer games. The dataset contained six pairs of subjects' motion interaction behaviors, such as boxing, volleyball, football, table tennis, sprinting and hurdles. Each action was separately stored as RGB, depth and skeleton data.

A complex human activity dataset, called SBU-Kinect-Interaction (https://www3.cs. stonybrook.edu/~kyun/research/kinect\_interaction/index.html (accessed on 19 December 2020)), was created to describe the interaction between two people [97], including synchronized video, depth and motion capture data. All videos were recorded in the same laboratory environment. Seven participants performed the activity consisting of 21 groups, where each group contained a pair of different people performing all eight interactions. Note that in most interactions, one person was acting and the other was reacting. Each action category contained one or two sequences. There were approximately 300 interactions in the entire dataset.

The CDC4CV pose dataset [98] was acquired with the depth information of the upper body for comparison of static pose estimation techniques by the Kinect v1, including 9 joints of three subjects. During the acquirement of the depth pose of human body, the upper body of each subject was ensured to stay within the 640 × 480 window. Nearly 700 depth data including three subjects were collected and labelled, of which 345 depth data were chosen as the training set and the rest data were used as the test set.

The EVAL dataset was built in 2012 (http://ai.stanford.edu/~varung/eccv12 (accessed on 16 September 2020)) [89], which included 24 action sequences of three different subjects. Each subject performed actions of gradually increasing complexity at the place where is approximately 3 m away from a Kinect camera. The ground truth of the 12 joints was captured using the Vicon motion capture system, and stored in the EVAL dataset together with the corresponding 3D point clouds.

CMU MoCap (http://mocap.cs.cmu.edu (accessed on 14 October 2020)) [99] used 12 Vicon infrared MX-40 cameras to collect the motions of the human body wearing black jumpsuits, including six major categories, such as human interaction, interaction with environment, locomotion, physical activities and sports, situations and scenarios, and test motions. And each category was further divided into 23 sub-categories. 41 marks were posted on the human body for the cameras to collect the ground truth of the joints during the motion. The images captured by various cameras were then triangulated to obtain 3D data.

In summary, MoCap is a motion capture system by posting marks on the joints of the human body with multiple cameras to track the human joints from different views. Accurate 3D skeleton information at a very high frame rate is acquired in the system. However, the system is usually expensive, and only available in an indoor environment. At

present, many methods also use a single depth camera for data acquiring and processing. The subjects do not need to wear any equipment with constraints. For the single person datasets, MSR-Action3D and G3D used gaming action as the main application. Both of them were single view and in the similar action sequence. In addition to depth data, MSR-Action3D also collected the video data, and G3D provided the corresponding RGB images at the relatively high frame rate. SMMC-10, Walking gait, MHAD, CDC4CNV and EVAL are mainly included basic behaviors of the human body. Only single individual was used to complete a series of complex actions in SMMC-10. MHAD contained the depth information of 12 subjects from four different views; the gender and age of the subjects were also given. EVAL provided the ground truth of 12 joints and the corresponding information of the 3D point cloud. CDC4CNV tracked the nine joints of the upper body, and Walking gait was used to analyze human gait, both of their application scenarios have certain limitations. For the multi-person datasets, the MoCap system can be used to obtain the ground truths of 41 joints in CMU MoCap dataset, it covered multi-person gaming, sport and other behaviors. SBU-Kinect-Interaction provided eight classes of interaction sequences. After that, G3Di provided common interaction activities for multi-person gaming, 20 joints of the human body were given for detailed analysis. NTU RGB+D used Kinect v2 to acquire the ground truths of 25 joints with more human interaction activities. In addition to the depth information, the dataset also included RGB images and IR videos.

#### **5. Application of Point Cloud-Based Joint Estimation**

Human joint recognition is one of the important directions of artificial intelligence applications. With the maturity of technology, human-related research can use joint information to solve some problems. According to different application scenarios, the approaches can be divided into the following categories, including virtual try-on technology, 3D human reconstruction, action recognition, human–computer interaction, and many others, some examples of the above application are shown in Figure 13. The related literatures are summarized according to the above application scenarios.

**Figure 13.** Applications of point cloud-based joint estimation.

Human body shape estimation is essential for virtual try-on technology. Estimating the 3D human shape in motion from a set of unstructured 3D point clouds is a very challenging task. Human joints can play an important role as a priori in 3D shape estimation of human body. Yang et al. [100] proposed an automatic method to solve the estimation of the human body shape in motion. Under the premise of wearing loose clothes, the model reconstruction problem is expressed as an optimization problem by controlling the body shape. Based on the automatic detection of human joints, the pose fitting scheme was optimized [101]. The results of 3D scanning from multiple viewpoints were projected onto 2D images, and then deep learning algorithms were utilized to mark the joints, which were helpful to find the best pose parameters. With the help of the joints in the SMPL model and manually marked the joints in the point cloud for registration, the result of coarse registration was obtained, and the hot core feature was extracted between the two frames during the changing pose for non-rigid registration. Both the result of non-rigid registration and coarse registration was fitted each other to ge<sup>t</sup> the final 3D human body model [102]. The joints of the frontal point cloud of the human body, which was generated directly with Kinect device, helped initialize the personalized SMPL model, and the model was registered with the input point cloud to find the corresponding points for obtaining a 3D human body model [103]. Yao et al. [104] further projected the obtained model onto the corresponding RGB image.

The joints can also assist the 3D reconstruction of the human body. Matteo et al. [1] utilized the information provided by the skeletal tracking algorithm to transform each point cloud into a standard pose in real time, and then registered each transformed point cloud to achieve 3D human body reconstruction. In order to extract more point features, a graph aggregation module was used to enhance PointNet++ [2], an attention module was used to better map disordered point features into ordered skeleton joint, and a skeleton graph module aimed to regress the skeleton joints by SMPL parameters. A dataset containing 2D scenes and 3D human body models was constructed. After marking the joints of human body on 2D image, they were converted to 3D coordinate system generated by the radar, and the SMPL model was used to fit the pose of the human body [105].

To improve the accuracy and real-time performance of action recognition, the skeletonbased method is studied in various research fields as an effective technology. Instead of using the entire skeleton as the input to Hierarchical RNN, the human skeleton was divided into five parts according to the physical structure of the human body, and then they were fed into five subnets, respectively. As the number of layers increases, the representation extracted by the subnet merges as higher layer output [106]. Through the distance of the joints and occupancy information of the skeleton, the time information was also extracted using the time pyramid to form the dataset of each action [107,108]. To recognize human actions, the difference in values between consecutive frames was used to calculate the new positions and angles of all joints. The input of the structure tree neural network was the human joints, and the output was the action classification [109]. Zhang et al. [110] extracted the local surface geometric feature (LSGF) of each joint in the point cloud, and introduced the global feature of the vector encoding video sequence. Finally, the SVM classifier was applied to reach the result of the action classification. Khokhova et al. [111] utilized a regular grid to divide the 3D space, and then used descriptors based on space occupancy information to identify the pose of the static frame. An enhanced skeleton visualization method was present [112], in which a CNN was implemented as the main structure to recognize the view-invariant human action.

Human–robot motion retargeting is one of interesting research in human–computer interaction technology. The goal of human–robot motion retargeting is to make the robot follow the movement of the human body. Wang et al. [113,114] established a model as a bridge between the input point cloud of the human body and the robot, so as to achieve the purpose of human–robot motion retargeting. The activity was decomposed into multiple unit sequences, each unit was related to an important factor of behavior [115], and then was inputted into a dynamic Bayesian network to analyze human behavior intentions and realize human–computer interaction.

In addition to some of the above applications, Kim et al. [116] used high-speed RGB and depth sensors to generate movement data of an expert dancer; all skeletons could be reorganized to generate desired dance movements. Given the visual input, let the robot ratiocinate and choose the best container and human pose to perform a transfer task [117]. Soft biometrics can solve the problem of people re-identification. For each measured subject, the 3D skeleton information was applied to adjust the human pose and create the standard pose (SSP) of the skeleton. The SSP was divided into grids to obtain individual characteristics for identification [118]. In archaeology, skeleton joints are helpful to generate a model that can represent any biological shape [119]. The length and position of joints were also beneficial to judge whether the age of the human body is a child or an adult [120]. Desai et al. [121] used the direction of the foot or torso to judge the orientation of the human, and then combined and optimized the skeletons collected by multiple cameras to obtain the final skeleton, even in the case of occlusion. In terms of rehabilitation treatment, a method was advanced to improve the evaluation of upper-limb rehabilitation [122]. The skeleton of the point cloud was taken by Microsoft Kinect, which was registered with the SMPL template to obtain the position and length of the joint.
