**1. Introduction**

The physical measurements of a person such as human height, body width and stride length are important bases for identifying a person from video. For example, the height of the person captured by a surveillance video is important evidence for identifying a suspect. Physical quantities are also used as important information for continuously tracking a specific person in video surveillance system consisting of multiple cameras [1]. A specific behavior such as falling down can be recognized by detecting changes in human height. Various studies have been conducted to estimate human height from color video. Human height is estimated by obtaining 3D information of the human body from color video [2–6]. Both the position and the pose of the camera are required in order to obtain 3D information of human body. Human height can also be estimated by calculating the ratio of the length between human body and a reference object whose length is already known [7–12]. The estimation methods of human height based on color video have a disadvantage in that the camera parameters or information about a reference object are required.

Depth video stores depth values, meaning the distances between subjects and the camera. The pixels of depth video are converted to 3D coordinates by the depth values. Object detection [13–15] and behavior recognition [16–18] by depth video are possible by extracting the 3D features of the objects. Recently, smartphones recognize a human face through an equipped TOF sensor for recognizing the identity of a person. Object lengths can be also measured from depth video without the additional information, so the problems of the human-height estimation based on color video can be solved by using depth video.

The field of artificial intelligence has made significant progress by researching neural network structures which consist of multilayers. In particular, convolutional neural network (CNN) [19] respectably improves object detection that categorizes the object and detects the boundary boxes and pixels of the objects [20–26].

In this study, a human-height estimation method is proposed using depth and color information. The human-height estimation is improved by extracting a human body and a human head from color information and by measuring human height from depth information. The human body and the human head of current frame in color video are extracted through mask R-CNN [26]. If color images are not available due to a low light environment, then the human body region is extracted by comparing between current frame in depth video and a pre-stored background depth image. The topmost point of the human head region is extracted as a head-top and bottommost point of the human body region as a foot-bottom. Two top head and foot-bottom points are converted to 3D real-world coordinates by these image coordinates and depth pixel values. Human height is estimated by calculating a Euclidean distance between two real-world coordinates.

The proposed method improves the human-height estimation by using both color and depth information and by applying mask R-CNN which is an art-of-state algorithm for object detection. In addition, the proposed method removes the need for the camera parameters or the length of other object in the human-height estimation using depth information.

This study is organized as follows: In Section 2, the related works for object detection by CNN and for the human-height estimation based on color or depth video are described. In Section 3, the human-height estimation by depth and color videos is proposed. The experimental results of the proposed method are presented in Section 4. Finally, a conclusion for this study is described in Section 5.
