**4. Experiment Results**

Intel Realsense D435 is used as a depth camera for the experiments of the proposed method. A focal length *f* and a frame rate of the depth camera are 325.8 mm and 30 Hz, respectively. The resolutions of depth video are specified as 640 × 480. The threshold *Tb* for (1) and *T*σ for (6) are set to 100 and 50, respectively. The parameter γ in (9) is 4000, which means *r* is 2 when *dh* is 2000 mm.

Figures 9 and 10 show the extractions of the human body region through by mask R-CNN and by the background depth image, respectively. Both methods of the human body region extraction accurately extract the human body region at not only a standing state, but also a walking state. In addition, the human body region is accurately extracted regardless of the states of the human body. In Figure 9, areas painted in green and red are the human body and human head regions, respectively. The human head regions are accurately found even though the position of the hand is above the head. The human body region extraction by the background depth image extracts larger regions than by mask R-CNN, so some part of the background is included in the human body region. In addition, the bottom area of the human body is not included in the human body region because the depth values of these area are similar to the depth value of the ground.

**Figure 9.** Extraction of human body region by mask R-CNN. (**a**) Standing toward front; (**b**) standing backward; (**c**) standing sideways; (**d**) walking toward camera; (**e**) walking opposite to camera; (**f**) lateral walking, (**g**) standing toward front and waving hand; (**h**) standing backward and waving hand.

**Figure 10.** Extraction of human body region by background depth image. (**a**) Standing toward front; (**b**) standing backward; (**c**) standing sideways; (**d**) walking toward camera; (**e**) walking opposite to camera; (**f**) lateral walking.

Figure 11 shows the correction of human height by (12) when the body region is extracted by mask R-CNN. The distributions of the estimated human height are large because of the noises of the depth frame when the correction of human height is not applied. After applying the correction of the human-height estimation by (12), the human heights are estimated as certain heights after about 20 frames.

**Figure 11.** Results of height estimation after correction through cumulative average for each three persons.

Figure 12 shows the result of the human-height estimation depending on the methods of the human body region extraction. The actual height of a person is 177 cm. In the human body region extraction through the background depth image, first 50 frames are accumulated to generate the background depth image. The human body keeps at a distance of 3.5 m from the camera. The body height is estimated as 176.2 cm when the human body region is extracted by mask R-CNN. The body height is estimated as 172.9 cm when the human body region is extracted by the background depth image.

**Figure 12.** Results of height estimation depending on methods of human body region extraction.

Figure 13 shows the result of the human-height estimation according to the distance between human body and the camera. The actual height of a person is 182 cm. The distances from the camera are 2.5 m, 3 m, 3.5 m, 4 m and 4.5 m. The averages of the human heights are estimated as 181.5 cm, 181.1 cm, 181.2 cm, 179.7 cm and 179.8 cm when the distances are 2.5 m, 3 m, 3.5 m, 4 m and 4.5 m, respectively.

**Figure 13.** Results of height estimation according to distance between camera and human body.

Figure 14 shows the result of the human-height estimation when a person whose height is 180 cm is standing, walking toward the camera and lateral walking. Human body keeps at a distance of 2.5 m from the camera when the human is standing and lateral walking. When the human is walking toward the camera, the distance from the camera is in range of 2.5 m to 4 m. In the standing state, the human height is estimated as 178.9 m. The human height is 177.1 cm and 174.9 cm in the lateral walking and the walking toward the camera, respectively. The magnitude of the estimated error in the lateral walking state is similar to in the standing state. The estimated error in walking toward the camera is larger than the others. The reason is that the vertical length of the human body is reduced because human knees are bent to a certain degree while walking.

**Figure 14.** Results of height estimation in standing and walking states.

Figure 15 shows the positions of *dh* and (*xf*, *yf*) before and after the correction of the head-top and the foot-bottom, respectively. The green and red points in Figure 15 represent the head-top and foot-bottom points, respectively. In Figure 15a, the position of *dh* is on the hair area, so *dh* has some error and the changes in *dh* are large as shown in Figure 16. After correcting *dh*, the changes is smaller. The changes of the estimated human body height are reduced after correcting of the foot-bottom point as shown in Figure 17. Two persons whose actual heights are 182 cm and 165 cm are estimated as 188.6 cm and 181.5 cm before correcting *dh*, respectively, as 172.2 cm and 163.3 cm after the correction of the head-top point, respectively and as 181.5 cm and 163.1 cm after the correction of both head-top and foot-bottom points, respectively.

**Figure 15.** Positions of *dh* and (*xf*, *yf*). (**a**) Before correction; (**b**) after correction.

**Figure 16.** Changes in *dh* according to frame order.

**Figure 17.** Result of height estimation after correction of *dh* and (*xf*, *yf*).

Figure 18 shows the results of the human-height estimation depending on *r*0 and *T*<sup>σ</sup>, which are the parameters for (8) and (9) when *d0* is 2000. The estimated height drops sharply when *r*0 is less than or equal to 2 and decreases smoothly when *r*0 is larger than 2. In addition, the estimated height linearly increases when *T*σ is less than or equal to 250 and slowly increases when *T*σ is larger than 250. Body height is estimated most accurately when *r*0 is 2 and *T*σ is 125.

**Figure 18.** Results of human-height estimation depending on parameters for (8) and (9). (**a**) *r*; (**b**) *T*<sup>σ</sup>*.*

Tables 2–4 show the results of the human-height estimation depending on human body postures for 10 persons. All of persons are captured within a range of 2.5 m to 4 m from the camera. Each person is captured with 150 frames. The error of the human-height estimation is calculated as follows:

$$\frac{\left|\overline{H} - H\_{\text{actual}}\right|}{H\_{\text{actual}}},$$

where *H* and *Hactual* are the average of the corrected human heights by (12) and an actual height for a person, respectively. When the human body region is extracted by mask R-CNN, the errors of the human-height estimation with standing, lateral walking and walking towards the camera are 0.7%, 1.3% and 1.8%, respectively. The accurate foot-bottom for the human-height estimation is the point of a foot heel which is on the ground. However, the bottommost pixel of the body region which is extracted as the foot-bottom point is usually a foot toe point in the proposed method. The position difference between the foot heel and foot toe points may make the error of the human-height estimation. The human-height estimation errors with standing, lateral walking, and walking towards the camera are 2.2%, 2.9%, 4.6%, respectively, when the body region is extracted by the background depth image. The human-height estimation using only depth frames has more error than using both color and depth frames.

**Table 2.** Results of human-height estimation by proposed method while standing.



**Table 3.** Results of human-height estimation by proposed method while lateral walking.

**Table 4.** Results of human-height estimation by proposed method while walking towards camera.

