**2. Related Works**

#### *2.1. Safety Management in High Places*

Falls from height in construction sites have earned science mapping research to reveal the existing research gaps [27]. The common causes that lead to falling accidents are defects in protective devices, poor work organization [28], and workers' unsafe actions, such as sleeping on the baseboard. Ref. [29] built a database that dissected the mechanics of workers falling off the baseboard. An IoT infrastructure, combined with the fuzzy markup language for falling objects on the construction site, could greatly help safety managers [30].

Deep learning enhances the automation capabilities of computer vision in safety monitoring [14,23,31]. CNN has shown exceptionally superior performance in high-dimensional data with intricate structures processing. Related research on using computer vision for worker safety management under working-at-height conditions can be described from the following three aspects:

Aspect 1: To automatically check a worker's safety equipment, such as their helmet and seat belt [31–33]. The detection of safety equipment originated from early feature engineering research, such as the histogram of oriented gradients (HOG) [34]. It has been proposed that increasing the setting of the color threshold of the helmet and the upper body detection could improve detection accuracy [35]. Deep learning has been used to develop multiple processing layers to extract unknown information, without the need to set the image features artificially [36]. In addition, a regional convolutional neural network (R-CNN) has been used to identify a helmet and has achieved good results [37].

Aspect 2: To automatically identify hazardous areas. This research is generally based on object recognition, including openings, rims, and groove edges in high places [38–40]. Computer vision has been used to detect whether workers pass through a support structure [39], since workers walking on support structures have a risk of falling.

Aspect 3: To monitor non-compliance of safety regulations [41], in particular, climbing scaffolding and carrying workers with a tower crane [42]. There has been a study on intelligent assessment for interactive work of workers and machines [24]. Research in this area needs to combine computer vision and safety assessment methods so that safety status can be explained using computer semantics [38,41].

#### *2.2. Computer Vision in Construction*

Scientists designed the CNN to describe the primary visual cortex (V1) of the brain by studying how neurons in a cat's brain responded to images projected precisely in front of the cat [43,44]. There are three fundamental properties to a CNN imitating the V1 [45]:

Property 1: The V1 can perform spatial mapping, and CNN describes this property through two-dimensional mapping.

Property 2: The V1 includes many simple cells, and the convolution kernel is used to simulate these simple cells' activities; that is, the linear function of the image in a particular receptive field.

Property 3: The V1 includes many complex cells, which inspire the pooling unit of CNN.

On all accounts, there are still two factors that determine the deep learning target detection model quality:

**Factor 1:** The dataset that is used to train the workers' unsafe actions recognition model should have unique unsafe action features that can be identified by a computer. There are many open-source datasets available for deep learning research, such as ImageNet [46] and COCO [47]; however, not all of them are suitable for object detection on construction sites. Many researchers have also established datasets related to construction engineering. For object recognition and detection on construction sites, there are datasets about workers [31]; construction machinery [31]; and on-site structures, such as railings [48]. For workers' activities on the construction sites, a dataset dedicated to steelworkers engaged in steel processing activities has been established [49]. Scholars have even enhanced datasets by preprocessing Red-Green-Blue (RGB) images to optical and gray images 49], which has provided novel ideas for dataset acquisition.

**Factor 2:** Deep learning algorithms and models as mathematical methods to find optimal solutions are important, as they affect the detection results. Le-Net is the foundation of deep learning models [50], which contains the basic modules of deep learning, convolutional layer, pooling layer, and fully connected layer. AlexNet came first in the 2012 ImageNet Large Scale Visual Recognition Challenge (ILSVRC2012) [36]. Since then, various neural network models have shown their accuracy and efficiency in feature extraction, including ZF-net [51], VGG-net [52], Res-net [53], Inception-net [54], etc. After obtaining the feature map, additional algorithms are needed to classify and locate the object. Faster R-CNN, YOLO, and SSD are deep learning algorithms that are widely used in many applications.

In this research, we propose an improved faster R-CNN utilizing a special dataset composed of unsafe actions when working at heights, which is based on ZF-net. This work's application is that it could provide construction managers with information on workers' unsafe actions, and therefore, assist with interventions. In addition, it could become a new procedure for BBS observation.
