*3.3. Data Acquisition*

The quality and quantity of the dataset is the decisive factor affecting detection accuracy. Many researchers have established datasets for construction safety management, such as datasets used to identify construction machines and workers and to identify the activities of steel processing workers. To date, there is no dataset for the characteristics of workers' unsafe actions when working at heights on construction sites; therefore, even if multiple publicly annotated open-source datasets could be used for deep learning detection model training and testing, none of them could be used as experimental data for this study directly. The dataset used in this research needs to include the five types of workers' unsafe actions that occur when working at heights, including throwing, relying, lying, jumping, and with no helmet. A total of 2200 original sample images were collected. The five unsafe actions sample distribution is shown in Table 2.

#### **Table 2.** Sample distribution.


#### *3.4. Model Development*

The detection algorithm is another decisive factor affecting the detection accuracy. Deep learning algorithms have received wide attention for their potential to improve construction safety and production efficiency. R-CNN and fast R-CNN have been proposed successively and have dramatically improved the accuracy of target recognition. Faster R-CNN, a deep learning algorithm with an "attention" mechanism, introduces the region proposal network (RPN), further shortening the model's training time and the detection network's running time. Based on the convolutional neural network framework of convolutional architecture for fast feature embedding (Caffe), a faster R-CNN is mainly composed of two modules. One of the modules is the RPN, which generates a more accurate, highquality candidate frame position by premarking the targets' possible positions. The other module is the fast R-CNN detector, which aims to improve the target recognition area based on the RPN's candidate frame.

The entire training process works by alternate training of the RPN network and the fast R-CNN detector, and both use the Zeiler and Fergus network (ZF-net) [51]. Figure 3 shows the flowchart of the model training for workers' unsafe actions. The procedure to implement the workers' unsafe action detection model is described as follows:

Stage 1: Input the training samples to the ZF-net for pretraining. Then, conduct alternate training of the model to acquire the first stage RPN and the fast R-CNN detector. Alternate training means to obtain the first stage ZF-net and RPN through the first round of training, and then the first round of the fast R-CNN detector training uses the training samples and the first stage RPN.

Stage 2: The second stage of training is almost the same as the first stage, except the input parameters are the results obtained in the first stage of training. After the second stage of training ends, the obtained ZF-net is saved as the final training network.

Stage 3: The validation sample is entered, and the final ZF-net is used to adjust and update the RPN network and the fast R-CNN. Finally, the proposed detection model is obtained.

#### **4. Experiment and Results**

#### *4.1. Experiment*

The data were mostly acquired from graduate students, while a small amount of data were acquired from workers on the construction site. College students and workers are adults, and their movements are similar. Since the dataset construction needed to consider image quality factors, such as illumination conditions and different shooting angles, using students as experimental subjects facilitated the collection of a large number of images. Therefore, we selected students' images as the training set for the model. There were

31 graduate students, with heights ranging from 158 cm to 181 cm, who participated in the experiments. The actions of throwing, relying, lying, jumping, and not wearing a helmet were examples of unsafe actions. The image sample collection followed 6 principles:


The data collected from construction workers were mainly about wearing a helmet. The total number of participants was approximately 50. Figure 4 shows examples of the 5 unsafe actions image samples. The samples were labeled using the sample labeling tool LabelImag, and the labeled annotation files were saved in the xml format file. The samples were finally divided into 3 groups, i.e., the training data, the validation data, and the testing data. Finally, the dataset was ready, which included 5 action types of images, annotation files, and image sets.

**Figure 4.** Sample examples.

At the implementation stage, the Faster-RCNN algorithm was trained using 20,000 iterations with a learning rate of 0.001. The dataset was the improved VOC 2007, which has been built in Section 3.3. The proposed method was implemented in MATLAB. For the hardware configuration, the model was tested on a computer with Intel(R) Core(TM) i7-6700 CPU @ 3.40 GHz, memory card of 16.0 GB, GPU of NVIDIA GeForce CTX 1080 Ti, and Windows 10 64-bit OS.

#### *4.2. Results*

After training, the ZF-net model file for detecting unsafe actions was obtained. The remaining samples were used to test the final model, and the testing results are shown in Figure 5. The average detection time of the sample test is 0.042 s.

**Figure 5.** Testing results.

Some concepts of target detection need to be explained as follows: True positive (TP), the input images are positive samples, and the detection results are also positive samples; false-positive (FP), the input images are negative samples, but the detection results are positive samples; true negative (TN), the input images are negative samples, and the detection results are negative samples; false-negative (FN), the input images are positive samples, but the detection results are negative. Take the action of relying on a railing as an example. If the testing result is "relying", it is recorded as TP; if the testing result is without relying, an FN is recorded. However, if no worker is relying on the test image's railing, but a "relying" is detected, FP is recorded. Figure 6 shows examples of TP, FN, and FP samples.

**Figure 6.** Examples of (**a**) TP, (**b**) FP, and (**c**) FN samples.

The analysis of TP, TN, FP, and FN's test results with the fourfold table are shown in Figure 7, which helps to understand the distribution of test results.


**Figure 7.** Test results with the fourfold table.

This experiment used four key performance indicators (KPIs) to assess the performance of the model for detecting unsafe actions: (1) accuracy, (2) precision, (3) recall, and (4) F1 measures.

Accuracy, i.e., the ratio of the TPs and TNs to the total detections, is generally used to evaluate the global accuracy of a training model. Precision is an indicator of the accuracy of prediction and recall is a measurement of the coverage area.

The F1 measure is the weighted harmonic average of precision and recall, which evaluates the model's quality. The higher the F1 measure, the more ideal the model.

The overall performance of the model is good. The test results were 93.46% for accuracy, 99.71% for precision, 93.72% for recall, and 96.63% for the F1 measure.

#### 4.2.1. Sample Source Analysis

For the sample source analysis, we established a dataset of workers' unsafe actions and trained a CNN model for workers' unsafe action detection. The model's performance based on four indicators was good; the results for all of the indicators were above 90%. However, the model's performance indicators were based on the examinees added into the dataset. The images in the dataset had a fairly high degree of similarity. The results could be understood as the model mainly aimed at workers in the same project and construction scenes. However, unsafe actions outside of the dataset should be considered, for example, different scenarios and different examinees. Therefore, we tested the examinees outside the dataset as a comparison group to analyze whether the model applied to other types of scenarios and workers.

Similar to the previous dataset establishment method, the comparison group examinees were asked to perform the five unsafe actions according to their usual manner of behavior. The shooting scene of the comparison group was entirely different from the scene in the original dataset. The results are shown in Table 3 and Figure 8.

**Indicators Accuracy Precision Recall F1-Measure**

**Table 3.** Test results of the model's overall performance.

**Figure 8.** Test results of the model's performance on each unsafe action (between the original and comparison group).

We compared the testing results of the comparison group with the original dataset. The performance of the comparison group was worse than the original samples. Nevertheless, they were still more than 60 percent accurate. The reason could be attributed to the individual differences in the actions performed. There are significant numbers of construction sites, as well as numbers of workers. An unsafe action detection dataset built for a specific engineering construction site is more suitable for workers at a specific construction site for a long time period. To detect workers' unsafe actions at other engineering projects, rebuilding the unsafe action model dedicated to those projects would be more conducive to observing workers. The interference environment of images also affects the recognition accuracy, such as light, rain and fog.

#### 4.2.2. FPs Analyzation

In the FP analysis, it was observed that FPs were usually due to the similar characteristics of two unsafe actions. For example, Figure 9 shows an image sequence of a throwing action. The testing result of the throwing action in Figure 9 was erroneously detected as relying on the railing. At the early stage of the throwing action, the detection result was relying (Figure 9a). As the image sequence changed, the detection was a result of both throwing and relying on the railing (Figure 9b). At the end of the throwing action, a throwing result was detected when the projectile was about to leave the hand (Figure 9c).

**Figure 9.** Testing results analysis for a FP sample. (**a**) At the early stage of the throwing action, the detection result was relying; (**b**) As the image sequence changed, the detection was a result of both throwing and relying on the railing; (**c**) At the end of the throwing action, a throwing result was detected when the projectile was about to leave the hand.

There are two explanations for the phenomenon that appeared above.

First, most of the image samples of the relying action in this study were relying on the railing. When performing sample labeling, the railing object was usually included in the sample labeling box of the relying action. The CNN learns all types of features of the images. These features include but are not limited to color, shape, and texture. Therefore, when the ZF-net was learning action features, it might mistake the railing for the feature of the relying action.

Secondly, the recognition method of the human skeleton may help to explain this problem. In previous studies, the posture of the human body could be distinguished by the distribution of human bone parameters [61], such as elbow angle and torso angle. The similarity of the human bone shape parameters in these two actions (throwing and relying) was very high, as shown in Figure 9. When a worker throws an object beside a railing, it is accompanied by a relying posture. Hence, it was not surprising that such a result was reached. In recent related research, Openpose has been used as an advanced algorithm that could achieve accurate human posture. The general process of CNN in learning image features has usually been divided into three layers [57], which could simply reveal the process of accepting and understanding the news for the V1. In the first layer, the learned features usually indicate whether there are edges in a specific direction and position in the image. In the second layer, the pattern is detected by discovering the specific arrangement of the edges, regardless of small changes in edge positions. In the third layer, the patterns are assembled into larger combinations of parts corresponding to familiar objects, and subsequent layers detect the object as a combination of these parts. This is a bottom-up strategy. In contrast to this learning strategy, part of the affinity field (PAF) is used in Openpose, which can improve the ability of computer vision in human pose estimation [62]. This provides a new research method in workers' unsafe actions management.

#### 4.2.3. Helmet Detection Analysis

For the helmet detection analysis, the action detection of whether the worker wears a helmet is different from other unsafe action detection methods. Detecting only the helmet could not determine whether the worker is wearing a helmet. Figure 10 is an intuitive explanation of this conclusion. When a helmet is not worn, it could also be detected that the object is a helmet (Figure 10a). However, based on "people" detection, helmet detection could effectively avoid this problem. Therefore, on the one hand, for the helmet-wearing test, if there is a result that the person and the helmet are detected simultaneously, as shown in Figure 10b, it was defined as safe. On the other hand, if only one of them appears, it could not judge whether it was an unsafe action, Figure 10a.

**Figure 10.** Helmet wearing test (**a**) Helmet; (**b**) Helmet & person.

#### **5. Discussion**

This research proposed an automatic identification method based on an improved Faster R-CNN algorithm to detect workers' unsafe actions considering a heights working environment. Based on the method, this research designs and carries out a series of experiments involving five types of unsafe actions to examine their efficiency and accuracy. The results illustrate and verify the method's feasibility for improving safety inspection and supervision, as well as its limitations. According to the experiment and results, it could be found that it is an excellent way to detect workers' unsafe behaviors through the proposed method.

Compared with previous studies that have utilized computer vision for construction management, this research has the following advantages. This work combines human knowledge with computer semantics that consider a high workplace. Thus, leading to better intervention in construction safety management when workers work at heights. For the observation and recording part of BBS, manually observing was a waste of human resources and could not capture the worker's action information comprehensively. Computer vision has an advantage in this aspect. Computer vision has been utilized to observe workers, including the efficiency of their activities [49], helmet-wearing [37], and even construction activities at night-time [63]. These works are significant to project management and achieved a good effect. However, accidents are most likely to occur when workers work at heights. There is a lack of research on observing workers' behavior in high working environments. The proposed method is for workers' observation in a high working place. There was a dataset of unsafe actions that commonly happen in high places on construction sites. It may provide a more reliable method for observing workers' behavior in a high places.

Three factors affect the robustness of the model:


This research also has application limitations. It lacks a large-scale dataset. There comprised approximately 2200 images utilized for the CNN model training, which was deemed to be small. The quality and quantity of images in the dataset will affect the effect of the model. Future studies will consider further dataset improvement to enhance the robustness of the model.

#### **6. Conclusions**

Most construction sites are equipped with cameras to observe their safety status. However, manual observation is laborious and may not accurately capture workers' unsafe behavior. Computer vision technology's time-sparing and intelligent advantages could help construction safety management when workers work at heights. This paper proposed a deep learning model that could be applied to detect unsafe actions when working in a high place automatically. The workers' unsafe actions worth observing and detecting were defined first to achieve this work. The dataset including five workers' unsafe actions for deep learning was built. Finally, an automatic recognition model was built for training, validation, and testing the unsafe actions model. The model's accuracy in detecting throwing, relying, lying, jumping actions, and helmet wearing were 90.14%, 89.19%, 97.18%, 97.22%, and 93.67%, respectively.

This work combines human knowledge with computer semantics, thus leading to a better intervention in construction safety management when workers work at heights. Its contribution is to enable computers to identify the unsafe behavior of workers in a high working environment. Its application is that it could provide workers' unsafe action information intelligently to safety managers and assist with intervention. Besides, an unsafe action detection dataset built for a specific engineering construction site is more suitable for workers who would engage with the specific construction site for a long time. It could become a new means for the BBS observation procedure. All of these would benefit workers, managers, and supervisors working in a hazardous construction environment.

Since the research mainly focuses on whether unsafe actions could be well-detected, all scenarios where workers made these actions were not fully considered in the dataset production process. According to the occurrence rules of the accident, an accident is often coupled with multiple factors. Although unsafe actions were an essential factor in accidents, they do not lead to accidents directly but in a specific scenario.

**Author Contributions:** Y.B. proposed the conceptualization and wrote this manuscript under the supervision of Q.H. L.H. provided the resources. J.H. and H.W. conducted the experiment planning and setup. G.C. provided valuable insight in preparing this manuscript. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the National Natural Science Foundation of China (52178357), the Sichuan Science and Technology Program (2021YFSY0307, 2021JDRC0076), and the Graduate Research and Innovation Fund Program of SWPU (2020cxyb009).

**Data Availability Statement:** All data generated or analyzed during the study are included in the submitted article.

**Conflicts of Interest:** The authors declare no conflict of interest.
