**2. Related Works**

#### *2.1. Object Detection from Color Information by Convolutional Neural Network*

Object detection problems in color image can generally be classified into four categories: classification, localization, detection and object segmentation. First, the classification determines an object category for single object in the image. Second, the localization finds the boundary box for single object in the image. Third, the detection finds the boundary boxes and determines object categories for multiple objects. Finally, the object segmentation finds pixels where each object is. CNN can solve whole categories of object detection problems. CNN replaces the weights of the neural networks with kernels which are rectangular filters. Generally, object detection through CNN are classified as 1-stage and 2-stage methods [27]. The 1-stage method performs both the location and the classification at once. The 2-stage method performs the classification after the location. The 1-stage method is faster than the 2-stage method but is less accurate. R-CNN [20] is a first proposed method for the detection through CNN. R-CNN applies a selective search algorithm to find the boundary box with a high probability where an object exists. The selective search algorithm is the method of constructing the boundary box by connecting adjacent pixels with similar texture, color and intensity. The object is classified through SVM (support vector machine). The feature map of the boundary boxes is extracted through AlexNet. R-CNN has disadvantages that the object detection is seriously slow and SVM should be trained separately from CNN. Fast R-CNN [21] has higher performance than R-CNN. Fast R-CNN applies a RoIPool algorithm and introduces a softmax classifier instead of SVM, so the feature map extraction and the classification are integrated into one neural network. faster R-CNN [22] replaces the selective search algorithm into a region proposal network (RPN) so whole processes of object detection are performed in one CNN. YOLO [23,24] is the 1-stage method for object detection. YOLO defines the object detection problem as a regression problem. YOLO divides an input image into the grid cells of a certain size. The boundary boxes and the reliability of the object are predicted for each cell at same time. YOLO detects the objects more quickly than the 2-stage methods but is less accurate. SSD [25] allows the various sizes of the grid cells in order to increase the accuracy of object detection. mask R-CNN [26] is proposed for the object segmentation unlike other R-CNNs.

#### *2.2. Object Length Measurement from Color or Depth Information*

Length estimation methods based on color video are classified into length estimations by camera parameters [2–6], by vanishing points [7–12], by prior statistical knowledge [28,29], by gaits [30,31] and by neural networks [32,33]. The length estimation methods by the camera parameters generate an image projection model into an color image by the focal length, the height and the poses of a camera. The object length is estimated by converting the 2D coordinates in the pixels of the image into 3D coordinates through the projection model. The length estimation methods by the camera parameters have a disadvantage that the accurate camera parameters should be provided in advance. In order to overcome this disadvantage, Liu [2] introduces an estimation method for the camera parameters using prior knowledge about the distribution of relative human heights. Cho [6] proposes an estimation method for the camera parameters by tracking the poses of human body from a sequence of frames. The length estimation methods by the vanishing points use a principle that several parallel lines in 3D space meet at one point in a 2D image. The vanishing point is found by detecting the straight lines in the image. The length ratio between two objects can be calculated using the vanishing points. If the length of one object is known in advance, then the length of another object can be calculated by the length ratio. Criminisi [7] introduces a length estimation method by the given vanishing points of the ground. Fernanda [8] proposes the detection method of the vanishing points by clustering the straight lines iteratively without camera calibration. Jung [9] proposes the method of detecting the vanishing points for color videos captured by multiple cameras. Viswanath [10] proposes an error model for the human-height estimation by the vanishing points. The error of the human-height estimation is corrected by the error model. Rother [11] detects the vanishing points by tracking specific object such as tra ffic signs from a sequence of frames. Pribyl [12] estimates the object length by detecting the specific objects. Human height can be also estimated by the prior statistical knowledge of human anthropometry [28,29] or by the gaits [30,31]. In recent years, estimation studies in various fields achieve grea<sup>t</sup> success by applying neural networks. The neural networks are also applied to the human-height estimation. Gunel [32] proposes a neural network for predicting a relationship between each proportion of human joints and human height. Sayed [33] estimates human height by CNN using a length ratio between a human body width and a human head size.

Since depth video has distance information from the depth camera, the distance between two points in a depth image can be measured without the camera parameters or the vanishing points. Many studies [34–36] extract a skeleton, which is the connection structure of human body parts, from the depth image for the human-height estimation. However, the human body region extraction is some inaccurate due to noises in depth video.

## **3. Proposed Method**

In this study, we propose a human-height estimation method using color and depth information. It is assumed that a depth camera is fixed in a certain position. Color and depth videos are captured by the depth camera. Then, a human body and a human head are extracted from current frame in color or depth video. A head-top and a foot-bottom are found in the human head and the human body, respectively. Two head-top and foot-bottom points are converted into 3D real-world coordinates by the corresponding pixel values of the frame in depth video. Human height is estimated by calculating a distance between two real-world coordinates. The flow of the proposed methods is shown in Figure 1.

**Figure 1.** Flow of proposed method.

#### *3.1. Human Body Region Extraction*

It is important to accurately detect the human body region for estimating human height precisely. Most frames in depth video have many noises, so it is difficult to extract the accurate human body region from the depth frame. In contrast, color video allows to detect the human body region accurately by CNN. In the proposed method, mask R-CNN [26] is applied to extract the accurate human body region from current frame in color video. Then, the human body region is mapped to current frame in depth video. If color video is not available for extracting the human body region, then the human body region is extracted from current frame in depth video directly. In this case, the human body region is extracted by comparing with current depth frame with a pre-captured background depth image. Figure 2 shows the flow of extracting the human body region in the proposed method.

**Figure 2.** Flow of human body region extraction. (**a**) From color information; (**b**) from depth information.

#### 3.1.1. Human Body Region Extraction Using Color Frames

Mask R-CNN [26] consists of three parts: a feature pyramid network (FPN) [37], a residual network (ResNet) [38,39] and a RPN. A FPN detects the categories and the boundary boxes of objects in color video. ResNet extracts an additional feature map from each boundary box. Figure 3 shows the processes of extracting the human body and the human head using mask R-CNN.

**Figure 3.** Processes of extracting human body and human head regions using mask R-CNN.

FPN condenses the scale of the input frame through bottom–up layers and expands the scale through top–down layers. Various scale objects can be detected through FPN. ResNet introduces a skip connection algorithm that the output value of each layer feeds into the next layer and directly into the layers about more than 2 hops away. The skip connection algorithm reduces the amount of values to be learned for weights of layers, so the learning efficiency of ResNet is improved. The feature map is extracted from the frame through FPN and ResNet. RPN extracts the boundary boxes and the masks which are object area in rectangle and in pixels, respectively. Comparing with faster R-CNN which applies RoIPool, mask R-CNN extends RPN to extract not only the boundary box, but also the masks. RoIPool rounds off the coordinates of the boundary box to integer. In contrast, RoIAlign allows the floating coordinates. Therefore, the detection of the object areas of mask R-CNN is more precisely than of faster R-CNN. Figure 4 shows an example for calculating the coordinates of the regions of interest (RoIs) for detecting the boundary box and the masks by RoIPool and RoIAlign. Non-max-suppression (NMS) removes overlapping areas between the boundary boxes. The size of the overlapping area is calculated for each boundary box. Two boundary boxes are merged if the size of the overlapping area is more than 70%.

**Figure 4.** Example of region of interest (RoI) coordinate calculation by RoIPool and RoIAlign. (**a**) RoIPool; (**b**) RoIAlign.

Table 1 shows the performances of mask R-CNN when various types of backbones are applied to mask R-CNN. When X-101-FPN is applied as the backbone of mask R-CNN, the average precision of the boundary boxes (Box AP), which is a metric for detecting the boundary box, is the highest. However, the times for a train and a detection are slowest. In consideration of a tradeoff between the accuracy and the time for the detection, the proposed method applies ResNet-50 FPN which consists of 50 CNNs as the backbone.



The human body and the human head are detected by mask R-CNN. Mask R-CNN is trained using 3000 images of COCO dataset [40] with information about the human body and the human head. In the training mask R-CNN, a learn rate and epochs are set to 0.001 and 1000, respectively. A threshold for detection of the human body and the human head is set to 0.7. If a detection accuracy for RoI is more than the threshold, then corresponding RoI is detected as the human body or the human head. The process of extracting the human body and human head regions through mask R-CNN is as follows.


#### 3.1.2. Human Body Region Extraction Using Depth Frames

If the depth camera is fixed in a certain position, then the pixels of the human body region in the depth frame have di fferent values from the depth pixels of a background. Therefore, the body region can be extracted by comparing depth pixels between the depth frame and the background depth image which has depth information about background. In order to extract the human body region accurately, the background depth image should be generated from several depth frames that capture the background because the depth video includes temporary noises. A depth value at the certain position of the background depth image is determined as a minimum value among pixels in the corresponding position of the depth frames capturing the background.

The human body region is extracted by comparing the pixels between the depth frame and the background depth image. A binarization image *B* is generated for the human body region extraction as follows:

$$B(\mathbf{x}, y) = \begin{cases} 1, & d\_b(\mathbf{x}, y) - d(\mathbf{x}, y) > T\_b \\ 0, & \text{otherwise} \end{cases},\tag{1}$$

where *db*(*<sup>x</sup>*, *y*) and *d*(*<sup>x</sup>*, *y*) are the depth pixel values of the background depth image and the depth frame at position of (*<sup>x</sup>*, *y*), respectively and *Tb* is a threshold for the binarization.

#### *3.2. Extraction of Head Top and Foot Bottom Points*

The topmost point of the human head region is extracted as the head-top, (*xh*, *yh*) and the bottommost point of the human body region as the foot-bottom, (*xf*, *yf*). If horizontal continuous pixels exist as shown in Figure 5, then the head-top or the foot-bottom is extracted as a center point among these pixels. If human stands with legs apart as shown in Figure 6, then two separate regions may be found from the bottommost of the human body region. In this case, the center points of two regions are the candidates of the foot-bottom. One candidate which has a depth value closer to the depth pixel value of the head-top point is selected as the foot-bottom point.

head-top

 and foot-bottom

**Figure 6.** Extracting the foot-bottom point in case of apart human legs.

#### *3.3. Human Height Estimation*

**Figure 5.** Extracting the

Human height is estimated by measuring a length in the 3D real world between the head-top and foot-bottom points. In order to measure the length on the real world, 2D image coordinates of two head-top and the foot-bottom are converted into 3D real-world coordinates by applying a pinhole camera model [41] as follows:

$$\begin{aligned} X &= \frac{(x - W/2)}{f} d(x, y) \\ Y &= \frac{(y - H/2)}{f} d(x, y) \\ Z &= d(x, y), \end{aligned} \tag{2}$$

 points.

where *X*, *Y*, *Z* are the real-world coordinates, *f* is a focal length of the depth camera, which means the parameter of the depth camera, and *W* and *H* are the horizontal and vertical resolutions of the depth image, respectively. In (2), the origin of the image coordinate system is the top–left of the image, but the origin of 3D camera coordinate system is the camera center. In order to compensate for the difference in the position of the origin between two coordinate systems, the coordinates of the image center are subtracted from the image coordinates. the real-world coordinates of the head-top (*Xh*, *Yh*, *Zh*) and of the foot-bottom (*Xf*, *Yf*, *Zf*) are calculated by substituting the real-world coordinates and the depth values of the head-top and the foot-bottom for (2), respectively, as follows:

$$\begin{aligned} X\_h &= \frac{(x\_h - W/2)}{f} d\_h \\ Y\_h &= \frac{(y\_h - H/2)}{f} d\_h \\ Z\_h &= d\_h. \end{aligned} \tag{3}$$

$$\begin{cases} X\_f = \frac{\left(x\_f - W/2\right)}{f} d\_f\\ Y\_f = \frac{\left(y\_f - W/2\right)}{f} d\_f\\ Z\_f = d\_{f'} \end{cases} \tag{4}$$

where *dh* and *df* are the depth values of the head-top and the foot-bottom, respectively. Human height is estimated by calculating an Euclidean distance between the real-world coordinates of the head-top and the foot-bottom as follows:

$$\begin{split} H &= \sqrt{\left(\mathbf{X}\_h - \mathbf{X}\_f\right)^2 + \left(\mathbf{Y}\_h - \mathbf{Y}\_f\right)^2 + \left(\mathbf{Z}\_h - \mathbf{Z}\_f\right)^2} \\ &= \sqrt{\left(\left(\mathbf{x}\_h d\_h - \mathbf{x}\_f d\_f\right) / f\right)^2 + \left(\left(y\_h d\_h - y\_f d\_f\right) / f\right)^2 + \left(d\_h - d\_f\right)^2} . \end{split} \tag{5}$$

The unit of the estimated human height by (5) is same as the unit of the depth pixels. If the pixels of the depth video store the distance as millimeters, then the unit of *H* is millimeter.

The estimated human height by (5) may have an error. One reason of the error is the noise of *dh*. Generally, (*xh*, *yh*) may be in a hair area. The depth values in the hair area have large noises because the hair causes the diffuse reflection of an infrared ray emitted by the depth camera. Therefore, *dh* should be corrected as the depth value of a point which is close to the head top but is not on the hair area. The depth value of the point which is not on the hair area has a high similarity to the depth values of neighboring pixels. The similarity is obtained by calculating the variance of the pixels located within *r* pixels to the left, right and bottom including the corresponding pixel as follows:

$$\sigma\_r^2 = \left. \frac{1}{(r+1)(2r+1)} \sum\_{i=0}^r \sum\_{j=-r}^r \left( d(\mathbf{x} + i, \mathbf{y} + j)^2 \right) - \left( \frac{1}{(r+1)(2r+1)} \sum\_{i=0}^r \sum\_{j=-r}^r d(\mathbf{x} + i, \mathbf{y} + j) \right)^2 \right. \tag{6}$$

If σ 2*r* is smaller than *T*<sup>σ</sup>, then the *dh* is corrected as the depth value of the corresponding pixel as shown in Figure 7. Otherwise, the point is found between pixels below one pixel and the similarity of the found point is calculated by (6). In (6), *r* is smaller as *dh* is larger because the width of an object is larger as the distance from the camera is closer as follows [42]:

$$\frac{P\_1}{P\_2} = \frac{d\_2}{d\_1} \,\tag{7}$$

where *P1* and *P2* are the pixel lengths of the object widths when the depth values are *d1* and *d2*, respectively. Therefore, *r* depended on the depth value is determined as follows:

$$r = \frac{d\_0}{d\_h}r\_0.\tag{8}$$

In (8), *d0* and *r0* are constants so *d0* × *r0* can be regarded as a parameter. If *d0* × *r0* is represented as γ, (6) is modified as follows:

$$\begin{split} \sigma\_{r}^{2} &= \frac{1}{\left(\mathbf{y}/d\_{\mathrm{h}}+1\right)\left(2\mathbf{y}/d\_{\mathrm{h}}+1\right)} \sum\_{i=0}^{\mathcal{V}/d\_{\mathrm{h}}} \sum\_{j=-\mathcal{V}/d\_{\mathrm{h}}}^{\mathcal{V}/d\_{\mathrm{h}}} \left(d(\mathbf{x}+i,\mathbf{y}+j)^{2}\right) \\ &- \left(\frac{1}{\left(\mathbf{y}/d\_{\mathrm{h}}+1\right)\left(2\mathbf{y}/d\_{\mathrm{h}}+1\right)} \sum\_{i=0}^{2\mathcal{V}/d\_{\mathrm{h}}} \sum\_{j=-2\mathcal{V}/d\_{\mathrm{h}}}^{\mathcal{V}/d\_{\mathrm{h}}} d(\mathbf{x}+i,\mathbf{y}+j) \right)^{2} . \end{split} \tag{9}$$

**Figure 7.** Flow of correcting *dh* by calculating similarity to neighboring pixels.

Mask R-CNN occasionally detects slightly wider human body region than the actual region. In particular, the region detection error in the lower part of human body may causes that a point on the ground is extracted as the foot-bottom. Assuming the ground is flat, the difference in a depth gradient is little between two vertically adjacent pixels that is in the ground area. The depth gradient of certain pixel (*<sup>x</sup>*, *y*) is defined as follows:

$$d\_{\mathcal{S}}(\mathbf{x}, \mathbf{y}) = d(\mathbf{x} + \mathbf{1}, \mathbf{y}) - d(\mathbf{x}, \mathbf{y}). \tag{10}$$

If certain point is on the ground, then the difference in the depth gradients is the same between the point and one pixel down. In order to determine whether the extraction of the foot-bottom is correct, two depth gradients are compared as follows:

$$\begin{split} D &= \quad \operatorname{g}\big(\mathbf{x}\_{f} - \mathbf{1}, \mathbf{y}\_{f}\big) - \operatorname{g}\big(\mathbf{x}\_{f'}, \mathbf{y}\_{f}\big) \\ &= \quad \big(\operatorname{d}\big(\mathbf{x}\_{f}, \mathbf{y}\_{f}\big) - \operatorname{d}\big(\mathbf{x}\_{f} - \mathbf{1}, \mathbf{y}\_{f}\big)\big) - \Big(\operatorname{d}\big(\mathbf{x}\_{f} + \mathbf{1}, \mathbf{y}\_{f}\big) - \operatorname{d}\big(\mathbf{x}\_{f'}, \mathbf{y}\_{f}\big)\big) \\ &= \quad \big(\operatorname{2d}\big(\mathbf{x}\_{f}, \mathbf{y}\_{f}\big) - \operatorname{d}\big(\mathbf{x}\_{f} - \mathbf{1}, \mathbf{y}\_{f}\big) - \operatorname{d}\big(\mathbf{x}\_{f} + \mathbf{1}, \mathbf{y}\_{f}\big). \end{split} \tag{11}$$

If *D* is 0, then the point is removed from the human body region. The comparison of the depth gradients by (11) is applied to the bottommost pixels of the human body region in order to correct the foot-bottom. If all of the bottommost pixels are removed, then this process is repeated for the points of the human body region where is one pixel up. The foot-bottom is extracted as a center pixel among the bottommost pixels which are not removed. Figure 8 shows correcting the position of the foot-bottom point.

**Figure 8.** Foot bottom point correction.

The noises of depth video may cause the temporary error of the human-height estimation. The temporary error of the human height is corrected by the average of the estimated human heights among a sequence of depth frames as follows:

$$\begin{split} \overline{H}(n+1) &= \begin{array}{c} \frac{1}{n+1} \bigg\{ \sum\_{i}^{n+1} H(i) \bigg\} \\ = & \frac{1}{n} \bigg\{ \sum\_{i}^{n} H(i) \bigg\} \times \frac{n}{n+1} + \frac{1}{n+1} H(n+1) \\ = & \frac{n}{n+1} \overline{H}(n) + \frac{1}{n+1} H(n+1) \end{array} \tag{12} $$

where *n* is the order of the captured depth frames and *<sup>H</sup>*(*n*) and *<sup>H</sup>*(*n*) are the estimated and corrected human heights in the *n*th frame order, respectively.
