1. Introduction
The construction industry has been identified as one of the most hazardous industries. And, the nature of construction projects lead to a high incidence of accidents. The interaction between man-machine, man-material, man-environments makes it complex for safety management on construction sites [
1]. Managers have found that construction workers’ unsafe behaviors were an important cause of a series of accidents on construction sites [
2]. According to statistics, nearly 80% of construction accidents are caused by unsafe behaviors of workers [
3], and 20.6% of fatal industrial workplace accidents in the European Union occurred on the construction site [
4]. One important way to prevent accidents is real-time monitoring and managing of those unsafe behaviors. Thurs, behavior-based safety (BBS) is considered as a promising approach to managing unsafe behaviors on construction sites. BBS requires observing and identifying unsafe behaviors on sites and then directly providing feedback to the workers [
5,
6]. The traditional way to realize it is manual inspection, which requires a lot of manpower and material resources but has non-significant effects [
7].
In recent years, with the rapid development of artificial intelligence technology, construction industry practitioners have begun to realize its potential in improving construction safety management, especially in monitoring and managing construction workers’ unsafe behaviors. Many automated technologies have been proposed to monitor the behaviors of construction workers on construction sites to improve the efficiency and accuracy of unsafe behavior management [
8,
9,
10,
11,
12]. The most popular way of detecting and identifying workers’ unsafe behaviors is the computer vision-based intelligent monitoring system, which could detect and identify humans or objects in two-dimensional images.
However, most existing research or products focused only on the workers’ behaviors (i.e., motions) recognition in construction sites and very limited studies considered the interaction between man-machine, man-material, or man-environments. For application, those interactions are very important for judging whether the workers’ behaviors are safe or not, from the standpoint of safety management. For example, suppose throwing a hammer is an unsafe behavior on the construction site, if a worker throws rubbish (e.g., a beverage bottle) using very similar motions, it is very difficult to judge whether the worker’s behavior is safe only based on the motion recognition result. Therefore, identifying unsafe interactions between man-machine/material is necessary and more meaningful, which could provide more direct and valuable information for safety management. To achieve the above goal, it not only needs to recognize the motion and objects, but also needs to detect the interaction. In other words, it needs to make decision rules, which is used to automatically judge whether unsafe interactions between man-machine/material occur.
Considering the importance of identifying construction workers’ unsafe interaction between man-machine/material and the limitations of existing research, this study aims to develop a method of identifying construction workers’ unsafe behaviors, i.e., unsafe interaction between man-machine/material, based on ST-GCN (for motion recognition) and YOLO (for objects, including safety signs, and detection). In this study, two trained YOLO-based models were, respectively, used to detect safety signs in the workplace, and objects that interacted with construction workers. Then, an ST-GCN model was trained to detect and identify construction workers’ behaviors. Lastly, decision rules were made, and the algorithm was developed to detect whether unsafe interactions between man–machine/material exist.
5. Discussion
At present, limited studies investigated the identification of unsafe interaction behaviors on construction sites, most of the research only focused on motion recognition, itself, which might limit its application on real construction site. This study proposed a new method of identifying construction workers’ unsafe behaviors, i.e., unsafe interaction between man–machine/material, based on ST-GCN and YOLO. Identifying the interaction between man–machine/material and evaluating the risk of behaviors by detecting and recognizing safety signs could improve the practicability of the proposed method, which could provide more direct and valuable information for safety management.
In this study, objects (hammer, switch, bottle, railing, obstacle, and safety signs) were detected by using YOLO technology, and the performance was very good (see
Table 5). These results were in line with previous studies [
51,
52,
53,
54]. Moreover, YOLO models have advantages in terms of detection speed and low hardware requirements [
55,
56,
57,
58,
59,
60], which could be used for future real-time monitoring or deployment in lower hardware devices. For motion capture, this study utilized OpenPose technology (COCO model) to obtain time series motion data, which was used for motion identification. In this study, OpenPose had high recognition accuracy. But, when body joints were occluded by objects, the recognition of skeleton keypoints may experience a drift phenomenon. However, compared to other studies using other skeleton keypoints capture techniques (e.g., Kinect) [
41,
61], OpenPose performed significantly better, especially in cases with body occlusions or non-frontal tracking [
62]. And in some application workplaces, the accuracy of OpenPose in capturing skeleton keypoints is not much different from traditional expensive motion analysis devices. [
63]. So OpenPose was widely used in construction sites, where complex behaviors existed and the worker’s body was heavily occluded [
64,
65]. Therefore, YOLO and OpenPose were selected in this study and were recommended computer vision-based technologies for object identification and motion capture, respectively, at least in the application scenarios similar to this study.
The results of this study show that the performance of motion recognition only based on ST-GCN was poor. The overall identification of Throwing, Operating and Crossing was 51.79%, 61.61% and 58.04% (see
Table 6 and
Table 7). The reason is obvious that the motions selected in this study are quite similar. For example, there is nearly no difference in the characteristics of the motion between throwing hammer and throwing bottle, between crossing railing and crossing obstacle. Although only using ST-GCN didn’t perform well in distinguishing between similar motions in this study, it’s still a recommended technology for motion recognition in a general sense. Many previous studies utilized ST-GCN for non-similar motion recognition and found it performed well. Cao et al. [
21] identified miners’ unsafe behavior (10 different types of behaviors) based on ST-GCN in their self-built dataset, with an overall identification accuracy of 86.7%. Lee et al. [
65] used ST-GCN to identify 5 different unsafe behaviors of workers, with an overall identification accuracy of 87.20%. The motions in the above studies were quite different in motion characteristics.
Considering the good performance of ST-GCN in non-similar motions recognition and poor performance in similar motions recognition, this study still chose ST-GCN for motion recognition, it is just that YOLO was added and integrated, which was used for object identification. It could improve the identification accuracy of similar motions in the case when the worker performs similar motions, but the objects that interacted with the worker are different. Since, for application, those interactions are very important for judging whether the workers’ behaviors are safe or not from the standpoint of safety management. The results of this study show that compared with only using ST-GCN, the method based on YOLO-ST-GCN proposed in this paper greatly improved the identification accuracy. The overall accuracy increased from 51.79% to 85.71%, 61.61% to 99.11%, and 58.04% to 100.00%, for throwing, operating, and crossing behaviors. And, all the interactions between man–objects were well detected and identified. As mentioned above, there is limited research that integrated motion identification with objects recognition to detect interaction behaviors between man–machine/material. Liu et al. [
52] studied the interaction between human and robots based on motion recognition and object recognition and found that people’s behavioral intention depends on the possession of objects, which was consistent with this study. They also used the YOLO model for object recognition, and ST-GCN with LSTM for behavior identification, and achieved good recognition results. The difference is they only used YOLO trained by a dataset of handheld objects to detect the interaction, which may achieve a poor performance in the scenario of this study.
To evaluate the effectiveness of other object detection algorithms compared to YOLOv5, we used the latest YOLO-NAS object detection algorithm. The dataset was divided randomly into a training set and a validation set in a ratio of 8:2. The
was set to 8, the
was set to 50, and
was set to 0. 0001. The identification results were drawn into a confusion matrix, as shown in
Figure 12. The comparison results of behavior identification accuracy based on YOLOv5 and YOLO-NAS were shown in
Table 11.
For Type I behaviors, the results show the overall identification accuracy of Type I behaviors was 91.96%, and the overall accuracy of Throwing and Operating were 83.93% and 100.00%. The accuracy of throwing hammer, throwing bottle, turning on switch, and putting bottle were 75.00%, 92.86%, 100.00%, and 100.00%, respectively. The evaluation indicators were also calculated: .
For Type II behaviors, the results show the overall identification accuracy of Type II behaviors was 100.00%, the accuracy of crossing railing, and crossing obstacle were both 100.00%. The crossing railing was set as positive samples, crossing obstacle was set as negative samples. The evaluation indicators were also calculated: .
The results show that there is little difference between the accuracy of behavior identification based on YOLOv5 and YOLO-NAS. Although the latest YOLO-NAS offers state-of-the-art target detection with unmatched accuracy and speed performance, outperforming other models of the YOLO family such as YOLOv5, YOLOv6, YOLOv7, and YOLOv8 [
66], the performance of using YOLOv5 is good enough for this study (i.e., interaction behavior identification based on YOLO-ST-GCN), which can meet the accuracy requirements of object recognition. There are many factors which could affect the accuracy of object recognition, e.g., occlusion of the object, low recording frame rate of the camera, and the light. The influence of these factors may outweigh the improvements in the algorithms (i.e., YOLO v5 to YOLO-NAS). For motion recognition, ST-GCN is based on the coordinates of skeleton keypoints, so accurate coordinates of skeleton keypoints are very important. However, due to the complexity of human motions and the blind field of vision of the camera, when the skeleton keypoints are occluded, the recognition results will drift. This has a certain impact on the results of behavior identification. In the future, multiple-depth cameras can be used and combined them according to certain methods to improve the accuracy of the skeleton keypoint coordinates.
This study proposed the YOLO-ST-GCN method for interaction behaviors identification, the foundation was motion and object recognition. This method also has some limitations in the case that a worker performs different tasks with similar motions and interacted with the same objects. This study added one more task, hammering nail (see
Figure 13B), which similar motion and same object with throwing hammer (see
Figure 13A) to test the performance of the method. The behavior identification results of the confusion matrix were shown in
Figure 14. The overall accuracy is 83.93%, the accuracy of hammering nail is 98.21%, and the accuracy of throwing hammer is 69.64%, the evaluation indicators were calculated:
. The results showed that 30.36% of throwing hammer were misidentified as hammering nail. Therefore, caution should be taken when using the proposed method for some cases like the above.
The limitations of the research need to be acknowledged. Firstly, a more completed dataset for training and testing the models is expected, Since, a more completed dataset that covers more work tasks, different scenarios, different angles, and different lighting conditions could improve its application to real construction sites. Secondly, the experimental tasks (i.e., behaviors in
Table 2) were selected based on the field studies, but the participants in this study were recruited from a convenience sample, not the real construction workers. Thirdly, there still were limitations of the proposed method, as discussed in the above paragraph, and this study did not overcome it.