*2.1. Detector*

In recent years, approaches to detection have been dramatically improved with the rise of deep neural networks (DNNs). In the literature, a two-step region identifier and DNN-based classification have been proposed [4]. The basic approach, called R-CNN, follows three steps when generating bounding boxes: (i) detecting areas in the image that may contain objects (region proposal); (ii) extracting CNN features from region candidates; and (iii) classifying objects based on the extracted features. Fast R-CNN [5] also generates region proposals, but it is more efficient than R-CNN because Fast R-CNN pools the CNN features corresponding to each region proposal. Faster R-CNN [6] adds a region proposal network (RPN) to generate a region proposal in the network. Current research focuses on widely divided one-shot detectors such as single-shot multi-box detector (SSD) [7] and you look only once (YOLO) [8].

Recent works have also focused on high-performance detectors, such as M2Det [9], RetinaNet [10] and instance segmentation with Mask R-CNN [11]. In this paper, we applied SSD as a method of detecting people in a dataset. Here, we used a WSPD pre-trained model for self-training.

### *2.2. Pedestrian Detection*

In the past decade, approaches to person detection have dramatically improved. Recent work has proposed configurations to improve recognition and localization, including DNNs, semantic segmentation, combined methods and small image and cloud analysis. However, in order to train these models, it is necessary to prepare a large dataset and fine-tune its architecture (e.g., SSD or M2Det). Wilson et al. tested whether an object detector can correctly detect pedestrians with different skin colors [12]. In addition, they found that it is problematic to accurately detect children because their miss rate is higher than that of adults [1]. In this study, we were able to detect pedestrians more reliably than in previous studies.
