Multi-Task Intelligent Monitoring of Construction Safety Based on Computer Vision

Liu, Lingfeng; Guo, Zhigang; Liu, Zhengxiong; Zhang, Yaolin; Cai, Ruying; Hu, Xin; Yang, Ran; Wang, Gang

doi:10.3390/buildings14082429

Open AccessArticle

Multi-Task Intelligent Monitoring of Construction Safety Based on Computer Vision

by

Lingfeng Liu

¹,

Zhigang Guo

¹,

Zhengxiong Liu

¹,

Yaolin Zhang

²,

Ruying Cai

^3,*

,

Xin Hu

⁴,

Ran Yang

⁵ and

Gang Wang

³

¹

Shenzhen Municipal Group Co., Ltd., Shenzhen 518000, China

²

CCCC Property (Hainan) Co., Ltd., Sanya 572000, China

³

Key Laboratory for Resilient Infrastructures of Coastal Cities, Shenzhen University, Shenzhen 518060, China

⁴

School of Urban Construction Engineering, Chongqing Technology and Business Institute, Chongqing 400067, China

⁵

CCDC Shuyu Engineering Construction Co., Ltd., Chengdu 610000, China

^*

Author to whom correspondence should be addressed.

Buildings 2024, 14(8), 2429; https://doi.org/10.3390/buildings14082429 (registering DOI)

Submission received: 5 July 2024 / Revised: 25 July 2024 / Accepted: 30 July 2024 / Published: 6 August 2024

(This article belongs to the Special Issue Intelligence and Automation in Construction Industry)

Download

Browse Figures

Versions Notes

Abstract

:

Effective safety management is vital for ensuring construction safety. Traditional safety inspections in construction heavily rely on manual labor, which is both time-consuming and labor-intensive. Extensive research has been conducted integrating computer-vision technologies to facilitate intelligent surveillance and improve safety measures. However, existing research predominantly focuses on singular tasks, while construction environments necessitate comprehensive analysis. This study introduces a multi-task computer vision technology approach for the enhanced monitoring of construction safety. The process begins with the collection and processing of multi-source video surveillance data. Subsequently, YOLOv8, a deep learning-based computer vision model, is adapted to meet specific task requirements by modifying the head component of the framework. This adaptation enables efficient detection and segmentation of construction elements, as well as the estimation of person and machine poses. Moreover, a tracking algorithm integrates these capabilities to continuously monitor detected elements, thereby facilitating the proactive identification of unsafe practices on construction sites. This paper also presents a novel Integrated Excavator Pose (IEP) dataset designed to address the common challenges associated with different single datasets, thereby ensuring accurate detection and robust application in practical scenarios.

Keywords:

computer vision; construction safety; unsafe behaviors; intelligent monitoring; object detection; segmentation; pose; YOLO

1. Introduction

Due to the complex and ever-changing environment of construction sites, the incidence of safety accidents remains consistently high. Between 2010 and 2020, there were 6705 production safety accidents in housing and municipal engineering projects in China, resulting in 7988 deaths [1]. Construction safety accidents not only cause casualties and economic losses but also reduce the industry’s attractiveness, significantly impacting the healthy development of the construction sector. Currently, over 80% of construction safety accidents are attributed to unsafe behaviors on construction sites [2,3,4]. Therefore, researching effective methods to enhance safety management at construction sites is of crucial importance for the sustainable and healthy development of the construction industry.

Currently, computer vision technology is widely applied in construction safety management. This technology encompasses a range of complex tasks, such as object detection, object segmentation, pose estimation, and object tracking, all of which can be effectively utilized in construction safety management. By leveraging these techniques, automated visual monitoring can be achieved, significantly enhancing the efficiency of safety monitoring on construction sites and accurately identifying and assessing potential safety risks. Existing research focuses on Personal Protective Equipment (PPE) detection, hazardous areas, and unsafe human behaviors [5]. For instance, the Faster Region-based Convolutional Neural Network (Faster R-CNN) object detection algorithm is employed to identify workers not wearing helmets on construction sites [6]. In another study, You Only Look Once version 3 (YOLOv3) is used to detect collision risks between excavators, wheel loaders, and workers [7]. Additionally, a small object detection system for construction sites based on YOLOv5 has been developed, enabling multi-scale detection from small workers to large equipment [8]. Moreover, the Mask R-CNN segmentation algorithm is utilized to differentiate in detail between personnel and structural supports in images, identifying workers crossing structural supports illegally [9]. The Mask Transformer is applied to segment and provide risk information about three types of hazardous work zones—those with pits, non-hardened areas, and waterlogged zones with poor subgrade bearing capacity—in order to prevent crane tipping accidents [10]. Most current research focuses on single tasks within construction safety management.

A single task based on computer vision has limitations, whereas multi-task visual methods can comprehensively capture the complexity and dynamic changes of construction sites. Some studies have attempted to use multiple tasks in their research. For instance, the YOLOv5 object detection algorithm and the OpenPose pose estimation algorithm are integrated to identify workers who are not properly wearing safety harnesses, thus preventing fall accidents from heights [11]. The YOLOv4 object detection algorithm and a Siamese Network are combined for object tracking, continuously tracking the same trajectory to avoid collisions with excavator equipment [12]. However, most current research is still limited to a few tasks. Single-task or limited-task approaches can only achieve partial monitoring of construction safety, failing to provide comprehensive oversight of the construction site. To address this shortcoming, multi-task visual methods are proposed. These methods integrate various computer vision tasks such as object detection, segmentation, pose estimation, and tracking into a unified framework. By leveraging the strengths of each task, multi-task approaches can provide real-time, multi-faceted critical information.

This paper employs the high-precision, high-speed YOLOv8 for multi-task research, effectively achieving intelligent monitoring of construction sites. It not only performs detection and segmentation of construction elements and person and excavator pose estimation but also tracks the intelligently perceived relevant information. Furthermore, by analyzing the multi-faceted intelligently perceived information, it can identify unsafe behaviors and conditions on construction sites, significantly improving the efficiency and effectiveness of construction safety management. Multi-task visual methods can provide real-time, multi-faceted critical information, enabling in-depth analysis of perceived information to prevent accidents and enhance overall operational safety.

2. Related Works

In traditional construction site management, safety management is a labor-intensive task that requires safety managers to conduct on-site inspections to ensure construction safety. However, manual safety management suffers from low efficiency and poor effectiveness. With the advancement of computer vision technology, its application in construction safety has become increasingly prevalent. Techniques such as object detection, object segmentation, pose estimation, and object tracking enable automatic identification of personnel, machinery, and environmental elements in site images or surveillance videos, thereby facilitating the analysis and resolution of safety issues.

2.1. Research on Object Detection in Construction Safety

Currently, object detection technology is increasingly being applied in construction safety management. By identifying and locating personnel, materials, and construction machinery on site, it effectively prevents potential safety hazards. In recent years, deep learning algorithms such as YOLO [13,14,15,16] and Faster R-CNN [17,18,19] have been widely adopted in construction safety management, enhancing the capability of safety monitoring on construction sites. Tang et al. [20] improved the accuracy of safety compliance inspections by training object detection models to detect interactions between workers and tools on construction sites. Alateeq et al. [13] developed a model that combines computer vision and deep learning methods to quickly and accurately identify workers, PPE usage, and construction machinery on site, thus identifying potential hazards. Fang et al. [6] proposed an automatic detection method based on convolutional neural networks to detect materials, workers, and safety equipment on construction sites, effectively improving safety and productivity. To promote research on object detection in construction, many scholars have provided datasets [20,21] for model training and testing. However, most current research is conducted on proprietary or small-scale datasets [6,18] and has not been validated in actual construction sites [17,22,23,24,25]. Additionally, many systems and methods rely on large amounts of training data [26,27], which may lead to data insufficiency issues [20].

2.2. Research on Object Segmentation in Construction Safety

Object segmentation was initially used in construction management for structural health damage detection. Over time, its application has expanded to safety management on construction sites, aiding in the identification of workers, various materials, and machinery, thereby providing a comprehensive understanding of the scene for enhanced safety management. Numerous scholars have researched the application of object segmentation in construction safety management. Wang et al. [28] developed a system that combines RGB cameras and depth cameras for the visual understanding of construction sites and created a semantic segmentation dataset. Wei et al. [29] compiled an annotated image dataset containing over 150 construction scenes to effectively support safety management tasks. Da et al. [30] proposed a method for segmenting moving objects on construction sites to enhance site safety. Wu et al. [31] introduced a framework combining computer vision and ontology techniques to infer hazardous information from visual segmentation data, thereby preventing accidents. Seo et al. [32] reviewed methods for extracting safety information from image data. However, current research still faces several challenges. There is a lack of high-quality segmentation datasets, and existing datasets often suffer from limited scene diversity [28,29]. Additionally, segmentation techniques require significant computational resources [33], which may lead to increased costs [34].

2.3. Research on Pose Estimation in Construction Safety

In construction site management, the behavior of workers and the motion of construction machinery are crucial pieces of information. Pose data can provide insights into workers’ health [35,36] and the actions of construction equipment [37,38,39], making pose estimation vital for construction safety management. Regarding construction machinery, Zhao et al. [37] improved the accuracy of pose estimation for construction equipment by using modified AlphaPose and YOLOv5-FastPose models. Chen et al. [38] proposed a theoretical framework for a camera marker network, which can estimate the poses of complex construction equipment, reducing estimation uncertainty. Liang et al. [40] utilized deep convolutional networks to achieve real-time 2D and 3D pose estimation for construction robots without the need for additional markers or sensors. Luo et al. [39] predicted the poses of construction equipment using neural networks, based on historical monitoring data, to prevent safety hazards. In terms of worker pose estimation, Ray et al. [35] employed deep learning and visual techniques to monitor workers’ poses in real time, assessing ergonomic compliance to reduce the risk of injury. Paudel et al. [41] integrated CMU OpenPose and ergonomic assessment tools to automatically determine whether or not workers’ poses present a risk, thereby creating a safer working environment. Pose estimation is widely applied in construction safety management, and real-time pose analysis can significantly enhance site safety. However, current research still faces several challenges, including the scarcity of datasets [37], poor performance in complex environments [38] and under occlusion conditions [40], and limited generalization capability of models [39].

2.4. Research on Object Tracking in Construction Safety

The real-time positions of workers and the movements of construction machinery significantly impact safety management on construction sites. Son et al. [12] developed a real-time monitoring system for construction workers by combining YOLO with Siamese networks, aiming to ensure the safe operation of construction machinery. Park et al. [42] proposed a method that integrates detection and tracking, enabling continuous worker localization through video surveillance and improving issues related to occlusion and identity switching. Brilakis et al. [43] introduced a vision-based tracking framework that employs a multi-camera system for the automated tracking of workers and equipment on site. Zhu et al. [44] proposed a new framework that integrates visual tracking with worker and equipment detection, enhancing the detection recall rate. Angah et al. [45] combined Mask R-CNN with gradient methods to improve the accuracy of multi-worker tracking through re-matching, achieving high multi-object tracking accuracy. Zhu et al. [46] addressed occlusion issues on construction sites using particle filters, enabling the effective tracking of workers and construction machinery and improving tracking accuracy. However, due to the complex and volatile nature of construction site environments, current research still faces challenges related to occlusion and identification accuracy [12]. Moreover, even for completed studies, the applicability and efficiency of systems in complex construction environments require further validation [46,47].

Overall, the advent of computer vision technologies, including object detection, segmentation, pose estimation, and tracking, has provided a transformative approach to enhancing construction safety management. However, existing research predominantly focuses on a single or a limited number of tasks, restricting a comprehensive understanding of the complex and dynamic construction site environment. This study utilizes the high-precision, high-speed YOLOv8 model to propose a computer vision-based multi-task intelligent monitoring system. Our approach leverages detection, segmentation, and pose estimation to provide a more comprehensive understanding of construction sites. The capability to track intelligently perceived information further enhances the system’s ability to analyze, identify, and monitor unsafe behaviors and conditions in real-time.

3. Methodology

This study employs computer vision technology to facilitate multi-task intelligent monitoring of construction safety, effectively identifying unsafe behaviors and conditions on site. As illustrated in Figure 1, the process initiates with the collection and processing of multi-source image data. Subsequently, it applies convolutional neural network (CNN)-based computer vision techniques to detect and segment critical construction elements and to estimate the poses of humans and excavators. Finally, this paper integrates the tracking algorithm to continuously monitor and analyze the outcomes of detection, segmentation, and pose estimation efforts.

3.1. Data Collection and Preprocessing

The data sources for this paper were collected through a literature review, video surveillance capture, and virtual data generated with Unity for excavator pose estimation. The datasets obtained are divided into three main categories: object detection, segmentation, and pose estimation. The detection and segmentation datasets utilize open-source datasets, namely the SODA and MOCS datasets, while the pose estimation datasets include open-source data, composite data, captured data, and virtual data. The SODA dataset was chosen for object detection due to its extensive variety of categories, including person, PPE, and various materials, which broadly represent the types of objects encountered on construction sites. The MOCS dataset was selected for segmentation because it is an existing, well-annotated dataset specifically designed for object segmentation, making it readily available and suitable for training and evaluating segmentation models. The data processing phase primarily involves the standardization of annotation files to conform to the YOLO format.

The SODA dataset is an object detection dataset with a total of 19,846 images. The categories and their quantities are displayed in Figure 2. The annotation files are in multiple XML formats, which are processed by iterating through the content of each XML file. This data is then converted into multiple TXT files in YOLO format, containing class numbers and normalized bounding box information, including center coordinates, width, and height.

The MOCS dataset is a segmentation dataset consisting of 23,404 images. The categories and their quantities are shown in Figure 3. The annotation files are in JSON format, including training, validation, and test subdivisions. These files are converted into multiple TXT files in YOLO format, containing the class numbers and normalized coordinates of the mask points.

The open-source dataset provides an excavator pose dataset consisting of 1280 cropped images, as shown in Figure 4. The annotation files are in CSV format, including information about six key points, but they lack bounding box information for the excavators. Therefore, this paper adds bounding box information corresponding to the image dimensions and converts the CSV annotation files into multiple TXT files in YOLO format. These files contain class numbers, normalized bounding box information, coordinates of six key points, and their visibility information.

The cropped images exhibit clean backgrounds, which may lead to issues with generalization. To address this, the study enhances these images by adding more complex backgrounds, thereby creating a library of background images. By superimposing the cropped original images onto these diverse backgrounds, composite images are generated, as shown in Figure 5. Concurrently, the annotation files are updated to reflect the new positions of the cropped original images within these background scenes.

This paper also involves the creation of a proprietary dataset by photographing excavator models and generating virtual images of excavator poses using Unity, as depicted in Figure 6. The dataset consists of 7914 images. The annotation files are in YOLO format with multiple TXT files, which include class numbers, normalized bounding box information, and coordinates for four key points without visibility information.

Additionally, this paper integrates four types of data, including open-source cropped images [48], composite images, captured images, and virtual images, to form the Integrated Excavator Pose (IEP) dataset. The integration process primarily addressed inconsistencies in key points annotation and visibility labeling. The open-source cropped and composite images include annotations for six key points with visibility information, as shown in Figure 7a. The captured and virtual images have been uniformly annotated with four key points without visibility information, as depicted in Figure 7b. For integration purposes, annotations are standardized to the format shown in Figure 7b by removing the last two key points and their visibility information from the cropped and composite images. Consequently, the IEP dataset’s annotation files contain class numbers, normalized bounding box information, and coordinates for four key points.

3.2. YOLOv8 Model Framework

There are numerous existing technologies, but to achieve high precision and speed, we selected the lightweight YOLOv8. The architecture of the model, as shown in Figure 1, primarily includes the backbone for extracting image features, the neck for feature processing and fusion, and the head for outputting results to perform various tasks.

The backbone component primarily consists of CBS and C2f modules, with the final layer being Spatial Pyramid Pooling Fusion (SPPF). CBS includes convolution, Batch Normalization (BN), and the SILU activation function. The C2f module represents a significant improvement over YOLOv5, as illustrated in Figure 8. C2f features more connections across different layers, enabling the concatenation of more features and thereby capturing richer feature information. SPPF, shown in Figure 9, is an enhancement of the SPP network, reducing parameters and improving computation speed while extracting and fusing multi-scale features. The neck component integrates the feature map’s output from the backbone by concatenating three different scales of feature maps through upsampling. After passing through the C2f module, the first feature map is produced. Subsequent processing through additional C2f and CBS modules results in the second and third feature maps, as shown in Figure 1. The head component processes the three feature maps from the neck. Unlike YOLOv5, it utilizes a decoupled head approach, where the feature maps are passed through CBS modules and convolutional layers to generate feature maps for predicting classes and bounding boxes, as illustrated in Figure 1.

3.3. Multi-Task Intelligent Monitoring

This study explores the application of various downstream tasks using YOLOv8 in construction safety, including object detection, segmentation, pose estimation, and tracking. These tasks are implemented by modifying the head component of the YOLOv8 model architecture, as illustrated in Figure 1.

3.3.1. Object Detection

The YOLOv8 model framework is inherently designed for object detection tasks. For object detection, the head component of YOLOv8 is used directly. The output from the head component consists of three sets of class (cls) and bounding box (box) information, which are then combined to produce two vectors that predict the location and category of objects. In the context of construction safety, this enables the detection of workers, their PPE, and large machinery, along with their positions in the image. Such capabilities are crucial for analyzing and identifying unsafe behaviors on construction sites.

3.3.2. Object Segmentation

The task of object segmentation is accomplished by augmenting the head component of YOLOv8, as depicted in Figure 1, with mask coefficients and mask feature maps. This approach is based on YOLACT [49], an instance segmentation method for single-stage detection models. In addition to class (cls) and bounding box (box) branches, a mask coefficient branch is added to the three scales of feature maps. A prototype for segmentation is also added at the largest scale. In construction safety applications, this task allows for detailed segmentation of workers, their PPE, and large machinery, along with their positions in the image, enabling more precise analysis and identification of unsafe behaviors on construction sites.

3.3.3. Object Pose Estimation

The pose estimation task draws from the design principles of YOLO-pose [50] and involves adding the key points branch to the head component, as depicted in Figure 1. This branch is added at all three scales, like the class (cls) and bounding box (box) branches, and is output through two CBS modules and a convolutional layer. In the context of construction safety, this task facilitates the pose estimation of workers and large machinery, enabling the analysis and identification of unsafe postures and behaviors on construction sites.

3.3.4. Object Tracking

The object tracking task involves detecting and tracking multiple objects in video or image sequences, assigning a unique ID to each object. This paper integrates the ByteTrack [51] and BoT-SORT [52] tracking algorithms, enabling the tracking of objects that have already undergone detection, segmentation, and pose estimation, as illustrated in Figure 1. ByteTrack utilizes the similarity between detection boxes and tracking trajectories, retaining high-confidence detections to accurately identify true objects, thereby reducing missed detections and improving trajectory coherence. BoT-SORT enhances ByteTrack with three additional improvements. In the context of construction safety, this task enables the tracking of workers and heavy machinery and the poses of workers and excavators, facilitating continuous monitoring and identification of unsafe behaviors on construction sites.

3.4. Model Evaluation Metrics

There are several evaluation metrics in models based on computer vision. IoU (Intersection over Union) represents the ratio of the intersection to the union of two areas, as shown in Equation (1). For example, it is the ratio of the intersection of the true bounding box and the predicted bounding box to their union. Precision is the ratio of true positives (

T P

) to the sum of true positives and false positives (

T P + F P

), as depicted in Equation (2), which indicates the proportion of correctly predicted objects among all predicted objects. Recall is the ratio of true positives (

T P

) to the sum of true positives and false negatives (

T P + F N

), as shown in Equation (3), representing the proportion of correctly predicted objects among all actual objects. AP (Average Precision) is calculated by averaging the maximum precision values at 11 different recall levels (0, 0.1, 0.2, …), as described in Equation (4). The mAP (mean Average Precision) is the average of AP across all classes, where

N_{c l s}

is the number of all classes, as shown in Equation (5). The F1-score finds a balance between

Precision

and

Recall

, as shown in Equation (6). Precision and recall are traditional evaluation metrics but are not suitable for multi-object evaluation. Therefore, this paper uses AP, mAP, and F1-score, which consider both precision and recall, as the evaluation metrics for training detection, segmentation, and pose estimation models.

I o U = \frac{|A \cup B|}{|A \cap B|}

(1)

P r e c i s i o n = \frac{T P}{T P + F P}

(2)

R e c a l l = \frac{T P}{T P + F N}

(3)

A P = \frac{1}{11} \sum_{r \in \{0,0.1, \dots, 1\}} {m a x}_{\tilde{r} : \tilde{r} \geq r} p (\tilde{r})

(4)

m A P = \frac{1}{N_{c l s}} \sum_{i} A P_{i}

(5)

F 1 = 2 \cdot \frac{P r e c i s i o n \cdot R e c a l l}{P r e c i s i o n + R e c a l l}

(6)

4. Experiments

4.1. Experimental Configuration

The experiments in this paper are conducted using four NVIDIA GeForce RTX 3090 GPUs. Four sets of experiments are performed: detection, segmentation, pose, and tracking. These are detailed as follows. The first three experiments involve training the models and then evaluating their performance. Each dataset is split into training (80%), validation (10%), and test (10%) sets. The datasets used for different tasks and their corresponding quantities are shown in Table 1. The batch size is set to 16, and the models are trained for a total of 100 epochs. Finally, based on the three trained models, the tracking algorithm is implemented and validated through practical applications.

4.2. Detection Experiment

The detection dataset used is SODA, as shown in Figure 2. The training process is illustrated in Figure 10, where it can be observed that the loss consistently decreases during both the training and validation phases, while precision and recall steadily increase, indicating the effectiveness of the trained detection model. The precision–recall and F1 curves are shown in Figure 11, with an mAP of 0.732 at an IoU of 0.5 and an F1-score of 0.72 at a confidence level of 0.326 for all categories. However, the mAP for the categories ‘helmet’ and ‘fence’ are only 0.517 and 0.551, respectively. Overall, the mAP and F1 score for the SODA dataset are not very high, which could be attributed to the complexity and similarity of the categories. For example, the ‘fence’ category has a large bounding box range, making it difficult to focus on specific features, and the features of the ‘helmet’ and ‘electric box’ categories are similar, making precise differentiation challenging.

The normalized confusion matrix is shown in Figure 12. The detection results do not show confusion with other similar categories; the low precision is due to the high background detection rate and high miss rate. The detection results are presented in Figure 13. From Figure 13a, it can be seen that the ground truth in the fourth image is incorrect, while our detection results are accurate and rich. Thus, the low precision of our detection model may be due to the poor quality of the dataset, but the results demonstrate effective detection of certain construction elements on site. The detection speed is 17.6 ms per image, including 3.6ms for preprocessing, 12.7 ms for inference, and 2.0 ms for postprocessing per image. Further analysis of detected workers and the IoU of detected helmet and vest bounding boxes can identify unsafe behaviors.

4.3. Segmentation Experiment

The segmentation dataset used is MOCS, as shown in Figure 3. The training process is illustrated in Figure 14, where it can be observed that although there are some fluctuations, the overall loss consistently decreases during both the training and validation phases. Similarly, despite some variability, the precision and recall metrics show a general upward trend, indicating the effectiveness of the trained segmentation model. The precision–recall and F1 curves for the box and mask are shown in Figure 15. At an IoU of 0.5, the mAP for all categories is 0.637 for box and 0.615 for mask, with many large machinery categories showing low mAP values. The F1 scores for all categories are 0.65 for box and 0.64 for mask. Overall, the mAP and F1-scores for the MOCS dataset are not high. The normalized confusion matrix in Figure 16 shows that there is some category confusion in the detection results, though it is minimal. However, many categories are misidentified as background, indicating a high miss rate.

The segmentation results are presented in Figure 17. It can be seen that the predictions are generally correct, except for some instances of missed detection. For example, in the third detection result in Figure 17, the ground truth is “crane”, while the prediction is “pump truck”, likely due to the similarity in their features. Although the segmentation model’s precision is not high, the detection effectiveness is still reasonable. The segmentation speed is 21.7 ms per image, including 3.3 ms for preprocessing, 16.0 ms for inference, and 2.4 ms for postprocessing per image. While the inference speed for segmentation is slightly slower than for detection, it remains very fast. Further analysis involves evaluating the IoU of masks for workers, helmets, vests, and medium to large machinery to determine whether there is overlap, which helps in identifying unsafe behaviors or potential collisions between workers and machinery.

4.4. Pose Experiment

The pose estimation tasks primarily include person and excavator poses. Person pose estimation utilizes pre-trained models from open-source datasets, while excavator pose estimation is trained on open-source, composite, captured, and virtual datasets. The experiments on these different categories of datasets are conducted separately. In addition, an experiment on an integrated dataset combining all four categories is also conducted.

The precision–recall and F1 curves for the box and pose on the open-source dataset are shown in Figure 18. At an IoU threshold of 0.5, the mAP for all categories is 0.995 for both box and pose. The F1 scores for all categories are 1.00 for box and 1.00 for pose. The detection results within the open-source dataset are illustrated in Figure 19, including six key points. Although there are slight deviations in key point predictions, the overall performance is satisfactory. To validate the model’s performance across different data types, the trained model is applied in practical scenarios. As shown in Figure 20a, the results are notably poor, possibly due to the characteristics of the cropped images and the key points’ positions within them. Given the limited backgrounds in the open-source dataset, composite images are generated by adding backgrounds to the original dataset. It is found that the training results maintain the same mAP as the open-source dataset, as shown in Figure 18. The detection results on the composite dataset are displayed in Figure 19, indicating that this approach addressed the limitations of using cropped excavator images. Applying this model in real-world scenarios, as shown in Figure 20b, yielded mixed results, with accurate key point recognition in some images but poor performance in many others.

The captured and virtual datasets, which included only the excavator model, also maintained consistent mAP results, as shown in Figure 18, when trained. The detection results on the virtual dataset, illustrated in Figure 19, included four key points and demonstrated good performance in both bounding box detection and key point prediction. When applied in real-world scenarios, as shown in Figure 20c, the model effectively identified the excavator’s bounding box and key points, though some key point deviations were observed. The detection speed is 20.4 ms per image, including 3.2 ms for preprocessing, 15.1 ms for inference, and 2.1 ms for postprocessing per image. The inference speed for pose estimation is slightly slower than for detection but faster than for segmentation.

To address the limitations of open-source datasets, which are confined to cropped excavator regions, and the detection inaccuracies in composite and virtual datasets, this paper integrates these datasets into the Integrated Excavator Pose (IEP) dataset. The IEP dataset includes excavator bounding boxes and four key points. The training results are consistent with the mAP of the open-source dataset shown in Figure 18. When the trained model is applied to real-world scenarios, as illustrated in Figure 21, it achieves excellent results with more accurate detection of bounding boxes and key points, effectively resolving the issues observed with individual datasets.

4.5. Tracking Experiment

This paper integrates two tracking algorithms to intelligently track information perceived through detection, segmentation, and pose estimation. The method is applied to real-world scenarios, with the actual tracking results for detection and segmentation shown in Figure 22a; pose estimation results in Figure 22b; and the combined tracking results of detection, segmentation, and pose estimation illustrated in Figure 22c. The effectiveness of the model in performing intelligent perception of detection, segmentation, and pose estimation is evident from these figures. This method facilitates further tracking and analysis of unsafe behaviors in construction.

Mismatched IDs can be observed in the tracking results for detection, segmentation, and pose estimation, as shown in Figure 23. To address this issue, the IoU between the bounding boxes of detection and segmentation and the bounding boxes of pose estimation is calculated. Different IDs with an IoU exceeding 90% are replaced with the smallest ID to achieve unified ID detection. Additionally, when the video footage is shaky or the scene transitions, the tracked IDs become inconsistent, with the same object being assigned multiple IDs, as depicted in Figure 24.

5. Discussions

This paper applies computer vision technology to multi-task intelligent monitoring of construction safety, enabling the intelligent extraction of detection, segmentation, and pose information of construction elements. Furthermore, this information can be combined with tracking algorithms for continuous monitoring. By using YOLOv8, the system achieves real-time, effective monitoring while maintaining accuracy and improving detection speed. Among the tasks, detection inference is the fastest, followed by pose estimation, with segmentation being the slowest; however, all tasks are performed quickly overall.

However, there are several limitations in the current approach that need to be addressed. Firstly, the accuracy of the model for detection and segmentation is relatively low and requires further improvement. This limitation affects the reliability of safety monitoring and could potentially lead to missed detections or false alarms. Secondly, in the presence of camera shakes, transitions, or occlusions, the tracking ID of the same object may change, leading to inconsistencies in continuous monitoring. This issue necessitates further calibration and the development of more robust tracking algorithms. To address these limitations, future research can consider the following aspects:

Improving precision: Future work should focus on enhancing the precision of detection and segmentation models. This could involve refining the training process, utilizing more diverse and comprehensive datasets, and incorporating advanced techniques to reduce false positives and false negatives.

Maintaining continuous tracking IDs: Ensuring the continuity of tracking IDs for the same object, especially in scenarios involving camera shakes, transitions, and occlusions, is crucial. Future research could explore robust tracking algorithms and techniques to maintain consistent tracking IDs across different frames and scenes.

Further exploration of intelligent perception information: The intelligent perception information extracted from the model can be further utilized for deeper analysis. Future studies could investigate advanced analytics, machine learning algorithms, and integration with other data sources to derive more meaningful insights and improve overall construction site safety monitoring.

While the primary focus of this study is on enhancing construction site safety through advanced computer vision techniques, it is essential to address potential ethical concerns related to worker rights and surveillance. For example, there is a risk that surveillance data could be misused to unfairly blame workers for accidents and deny them proper compensation. Additionally, blanket surveillance may impinge on workers’ autonomy and their sense of control. To mitigate these concerns, future research should consider integrating real-time feedback mechanisms into the system. This would help workers immediately understand and correct dangerous behaviors, thereby reducing the likelihood of unfairly blaming workers and enhancing overall safety.

6. Conclusions

Current research on intelligent construction safety monitoring based on computer vision is typically limited to single tasks. However, real-world construction sites require consideration of multiple aspects simultaneously. This paper proposes a multi-task intelligent monitoring method for construction safety based on computer vision. The primary framework of this study is based on the YOLOv8 model, including the backbone, neck, and head components. Different tasks, such as detection, segmentation, and pose estimation, are achieved by modifying the head part of the model. This approach not only enables effective detection and segmentation of construction elements and estimation of human and excavator poses but also facilitates tracking in video surveillance based on these outputs. Experiments conducted on multiple datasets validate the effectiveness of the model. Notably, this paper constructs an integrated excavator pose dataset by combining open-source, composite, captured, and virtual datasets. The experiments demonstrate that this integrated dataset can be more effectively applied to actual construction sites. However, the accuracy of detection and segmentation still needs improvement, and the continuity of tracking IDs remains an issue that requires further resolution.

Author Contributions

Conceptualization, L.L. and R.C.; methodology, R.C.; software, Z.L.; validation, L.L. and Y.Z.; formal analysis, R.C.; investigation, L.L. and Z.G.; resources, Z.G. and G.W.; data curation, Z.L., R.Y. and Y.Z.; writing—original draft preparation, R.C.; writing—review and editing, X.H. and G.W.; visualization, R.Y.; supervision, X.H.; project administration, L.L.; funding acquisition, L.L., R.C. and G.W. All authors have read and agreed to the published version of the manuscript.

Funding

This paper is supported by the National Natural Science Foundation of China [Grant No. 52308319], the Natural Science Foundation of Guangdong Province, China [Grant No. 2023A1515011119], the Shenzhen Municipal Natural Science Foundation [JCYJ20220531101209020].

Data Availability Statement

The data used in this study are available upon reasonable request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest. Lingfeng Liu, Zhigang Guo, Zhengxiong Liu wre employed by the Shenzhen Municipal Group Co., Ltd. Yaolin Zhang was employed by the CCCC Property (Hainan) Co., Ltd. Ran Yang was employed by the CCDC Shuyu Engineering Construction Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

General Office of the Ministry of Housing and Urban-Rural Development of the People’s Republic of China. Notice of the General Office of the Ministry of Housing and Urban-Rural Development on the Production Safety Accidents of Housing Municipal Engineering in 2019. 2020. Available online: https://www.mohurd.gov.cn/gongkai/zhengce/zhengcefilelib/202006/20200624_246031.html (accessed on 24 June 2020).
Chi, C.-F.; Lin, S.-Z.; Dewi, R.S. Graphical fault tree analysis for fatal falls in the construction industry. Accid. Anal. Prev. 2014, 72, 359–369. [Google Scholar] [CrossRef] [PubMed]
Jiang, Z.; Fang, D.; Zhang, M. Understanding the Causation of Construction Workers’ Unsafe Behaviors Based on System Dynamics Modeling. J. Manag. Eng. 2015, 31, 04014099. [Google Scholar] [CrossRef]
Wei, R.; Love, P.E.D.; Fang, W.L.; Luo, H.B.; Xu, S.J. Recognizing people’s identity in construction sites with computer vision: A spatial and temporal attention pooling network. Adv. Eng. Inform. 2019, 42, 9. [Google Scholar] [CrossRef]
Fang, W.; Love, P.E.D.; Luo, H.; Ding, L. Computer vision for behaviour-based safety in construction: A review and future directions. Adv. Eng. Inform. 2020, 43, 100980. [Google Scholar] [CrossRef]
Fang, Q.; Li, H.; Luo, X.; Ding, L.; Luo, H.; Rose, T.M.; An, W. Detecting non-hardhat-use by a deep learning method from far-field surveillance videos. Autom. Constr. 2018, 85, 1–9. [Google Scholar] [CrossRef]
Kim, D.; Liu, M.; Lee, S.; Kamat, V.R. Remote proximity monitoring between mobile construction resources using camera-mounted UAVs. Autom. Constr. 2019, 99, 168–182. [Google Scholar] [CrossRef]
Kim, S.; Hong, S.H.; Kim, H.; Lee, M.; Hwang, S. Small object detection (SOD) system for comprehensive construction site safety monitoring. Autom. Constr. 2023, 156, 105103. [Google Scholar] [CrossRef]
Fang, W.; Zhong, B.; Zhao, N.; Love, P.E.D.; Luo, H.; Xue, J.; Xu, S. A deep learning-based approach for mitigating falls from height with computer vision: Convolutional neural network. Adv. Eng. Inform. 2019, 39, 170–177. [Google Scholar] [CrossRef]
Lu, Y.; Qin, W.; Zhou, C.; Liu, Z. Automated detection of dangerous work zone for crawler crane guided by UAV images via Swin Transformer. Autom. Constr. 2023, 147, 104744. [Google Scholar] [CrossRef]
Li, J.; Zhao, X.; Zhou, G.; Zhang, M. Standardized use inspection of workers’ personal protective equipment based on deep learning. Saf. Sci. 2022, 150, 105689. [Google Scholar] [CrossRef]
Son, H.; Kim, C. Integrated worker detection and tracking for the safe operation of construction machinery. Autom. Constr. 2021, 126, 103670. [Google Scholar] [CrossRef]
Alateeq, M.M.; PP, F.R.; Ali, M.A. Construction site hazards identification using deep learning and computer vision. Sustainability 2023, 15, 2358. [Google Scholar] [CrossRef]
Li, Z.; Xie, W.; Zhang, L.; Lu, S.; Xie, L.; Su, H.; Du, W.; Hou, W. Toward efficient safety helmet detection based on YoloV5 with hierarchical positive sample selection and box density filtering. IEEE Trans. Instrum. Meas. 2022, 71, 1–14. [Google Scholar] [CrossRef]
Han, K.; Zeng, X. Deep learning-based workers safety helmet wearing detection on construction sites using multi-scale features. IEEE Access 2021, 10, 718–729. [Google Scholar] [CrossRef]
Nath, N.D.; Behzadan, A.H.; Paal, S.G. Deep learning for site safety: Real-time detection of personal protective equipment. Autom. Constr. 2020, 112, 20. [Google Scholar] [CrossRef]
Li, J.; Zhou, G.; Li, D.; Zhang, M.; Zhao, X. Recognizing workers’ construction activities on a reinforcement processing area through the position relationship of objects detected by faster R-CNN. Eng. Constr. Archit. Manag. 2023, 30, 1657–1678. [Google Scholar] [CrossRef]
Lin, Z.-H.; Chen, A.Y.; Hsieh, S.-H. Temporal image analytics for abnormal construction activity identification. Autom. Constr. 2021, 124, 103572. [Google Scholar] [CrossRef]
Son, H.; Choi, H.; Seong, H.; Kim, C. Detection of construction workers under varying poses and changing background in image sequences via very deep residual networks. Autom. Constr. 2019, 99, 27–38. [Google Scholar] [CrossRef]
Tang, S.; Roberts, D.; Golparvar-Fard, M. Human-object interaction recognition for automatic construction site safety inspection. Autom. Constr. 2020, 120, 103356. [Google Scholar] [CrossRef]
Xuehui, A.; Li, Z.; Zuguang, L.; Chengzhi, W.; Pengfei, L.; Zhiwei, L. Dataset and benchmark for detecting moving objects in construction sites. Autom. Constr. 2021, 122, 103482. [Google Scholar] [CrossRef]
Fang, Q.; Li, H.; Luo, X.; Ding, L.; Rose, T.M.; An, W.; Yu, Y. A deep learning-based method for detecting non-certified work on construction sites. Adv. Eng. Inform. 2018, 35, 56–68. [Google Scholar] [CrossRef]
Liu, X.; Jing, X.; Zhu, Q.; Du, W.; Wang, X. Automatic Construction Hazard Identification Integrating On-Site Scene Graphs with Information Extraction in Outfield Test. Buildings 2023, 13, 377. [Google Scholar] [CrossRef]
Fang, W.; Ding, L.; Luo, H.; Love, P.E. Falls from heights: A computer vision-based approach for safety harness detection. Autom. Constr. 2018, 91, 53–61. [Google Scholar] [CrossRef]
Jeelani, I.; Asadi, K.; Ramshankar, H.; Han, K.; Albert, A. Real-time vision-based worker localization & hazard detection for construction. Autom. Constr. 2021, 121, 13. [Google Scholar] [CrossRef]
Kim, H.; Kim, H. 3D reconstruction of a concrete mixer truck for training object detectors. Autom. Constr. 2018, 88, 23–30. [Google Scholar] [CrossRef]
Zhang, S.; Teizer, J.; Lee, J.-K.; Eastman, C.M.; Venugopal, M. Building information modeling (BIM) and safety: Automatic safety checking of construction models and schedules. Autom. Constr. 2013, 29, 183–195. [Google Scholar] [CrossRef]
Wang, Z.; Zhang, Y.; Mosalam, K.M.; Gao, Y.; Huang, S.L. Deep semantic segmentation for visual understanding on construction sites. Comput. -Aided Civ. Infrastruct. Eng. 2022, 37, 145–162. [Google Scholar] [CrossRef]
Wei, Y.; Akinci, B. Construction Scene Parsing (CSP): Structured annotations of image segmentation for construction semantic understanding. In Proceedings of the 18th International Conference on Computing in Civil and Building Engineering: ICCCBE 2020, São Paulo, Brazil, 18–20 August 2020; Springer: Cham, Switzerland, 2021. Available online: https://link.springer.com/chapter/10.1007/978-3-030-51295-8_80 (accessed on 14 July 2020).
Hu, D.; Al Shafian, S. Segmentation and Tracking of Moving Objects on Dynamic Construction Sites. In Proceedings of the Construction Research Congress 2024, Des Moines, IA, USA, 18 March 2024. [Google Scholar] [CrossRef]
Wu, H.; Zhong, B.; Li, H.; Love, P.; Pan, X.; Zhao, N. Combining computer vision with semantic reasoning for on-site safety management in construction. J. Build. Eng. 2021, 42, 103036. [Google Scholar] [CrossRef]
Seo, J.; Han, S.; Lee, S.; Kim, H. Computer vision techniques for construction safety and health monitoring. Adv. Eng. Inform. 2015, 29, 239–251. [Google Scholar] [CrossRef]
Chouhan, S.S.; Kaul, A.; Singh, U.P. Image Segmentation Using Computational Intelligence Techniques: Review. Arch. Comput. Methods Eng. 2019, 26, 533–596. [Google Scholar] [CrossRef]
Kim, J.-S.; Yi, C.-Y.; Park, Y.-J. Image processing and QR code application method for construction safety management. Appl. Sci. 2021, 11, 4400. [Google Scholar] [CrossRef]
Ray, S.J.; Teizer, J. Real-time construction worker posture analysis for ergonomics training. Adv. Eng. Inform. 2012, 26, 439–455. [Google Scholar] [CrossRef]
Kong, L.; Li, H.; Yu, Y.; Luo, H.; Skitmore, M.; Antwi-Afari, M.F. Quantifying the physical intensity of construction workers, a mechanical energy approach. Adv. Eng. Inform. 2018, 38, 404–419. [Google Scholar] [CrossRef]
Zhao, J.; Cao, Y.; Xiang, Y. Pose estimation method for construction machine based on improved AlphaPose model. Eng. Constr. Archit. Manag. 2024, 31, 976–996. [Google Scholar] [CrossRef]
Feng, C.; Kamat, V.R.; Cai, H. Camera marker networks for articulated machine pose estimation. Autom. Constr. 2018, 96, 148–160. [Google Scholar] [CrossRef]
Luo, H.; Wang, M.; Wong, P.K.-Y.; Tang, J.; Cheng, J.C. Vision-based pose forecasting of construction equipment for monitoring construction site safety. In Proceedings of the International Conference on Computing in Civil and Building Engineering, São Paulo, Brazil, 18–20 August 2020; Springer: Cham, Switzerland, 2021. Available online: https://link.springer.com/chapter/10.1007/978-3-030-51295-8_78 (accessed on 14 July 2020).
Liang, C.-J.; Lundeen, K.M.; McGee, W.; Menassa, C.C.; Lee, S.; Kamat, V.R. A vision-based marker-less pose estimation system for articulated construction robots. Autom. Constr. 2019, 104, 80–94. [Google Scholar] [CrossRef]
Paudel, P.; Choi, K.-H. A deep-learning based worker’s pose estimation. In Proceedings of the Frontiers of Computer Vision: 26th International Workshop, IW-FCV 2020, Ibusuki, Kagoshima, Japan, 20–22 February 2020; Revised Selected Papers 26. Springer: Singapore, 2020. [Google Scholar] [CrossRef]
Park, M.-W.; Brilakis, I. Continuous localization of construction workers via integration of detection and tracking. Autom. Constr. 2016, 72, 129–142. [Google Scholar] [CrossRef]
Brilakis, I.; Park, M.-W.; Jog, G. Automated vision tracking of project related entities. Adv. Eng. Inform. 2011, 25, 713–724. [Google Scholar] [CrossRef]
Zhu, Z.; Ren, X.; Chen, Z. Integrated detection and tracking of workforce and equipment from construction jobsite videos. Autom. Constr. 2017, 81, 161–171. [Google Scholar] [CrossRef]
Angah, O.; Chen, A.Y. Tracking multiple construction workers through deep learning and the gradient based method with re-matching based on multi-object tracking accuracy. Autom. Constr. 2020, 119, 103308. [Google Scholar] [CrossRef]
Zhu, Z.; Ren, X.; Chen, Z. Visual tracking of construction jobsite workforce and equipment with particle filtering. J. Comput. Civil. Eng. 2016, 30, 04016023. [Google Scholar] [CrossRef]
Lin, J.J.-C.; Hung, W.-H.; Kang, S.-C. Motion planning and coordination for mobile construction machinery. J. Comput. Civil. Eng. 2015, 29, 04014082. [Google Scholar] [CrossRef]
Luo, H.; Wang, M.; Wong, P.K.-Y.; Cheng, J.C.P. Full body pose estimation of construction equipment using computer vision and deep learning techniques. Autom. Constr. 2020, 110, 103016. [Google Scholar] [CrossRef]
Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. YOLACT: Real-time Instance Segmentation. arXiv 2019, arXiv:1904.02689. [Google Scholar] [CrossRef]
Maji, D.; Nagori, S.; Mathew, M.; Poddar, D. YOLO-Pose: Enhancing YOLO for Multi Person Pose Estimation Using Object Keypoint Similarity Loss. 2022. Available online: https://openaccess.thecvf.com/content/CVPR2022W/ECV/html/Maji_YOLO-Pose_Enhancing_YOLO_for_Multi_Person_Pose_Estimation_Using_Object_CVPRW_2022_paper.html (accessed on 6 April 2022).
Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. ByteTrack: Multi-Object Tracking by Associating Every Detection Box. arXiv 2021, arXiv:2110.06864. [Google Scholar] [CrossRef]
Aharon, N.; Orfaig, R.; Bobrovsky, B.-Z. BoT-SORT: Robust Associations Multi-Pedestrian Tracking. arXiv 2022, arXiv:2206.14651. [Google Scholar] [CrossRef]

Figure 1. Technology roadmap.

Figure 2. The number of classes in the SODA dataset.

Figure 3. The number of classes in the MOCS dataset.

Figure 4. Image information provided by Luo et al.

Figure 5. The composite image.

Figure 6. The created images.

Figure 7. Different datasets of excavator key points information [48].

Figure 8. C2f module.

Figure 9. SPPF.

Figure 10. The training process of the object detection model.

Figure 11. Precision–recall and F1 curves of the object detection model.

Figure 12. Normalized confusion matrix of the object detection model.

Figure 13. The detection results of the object detection model.

Figure 14. The training process of the object segmentation model.

Figure 15. Precision–recall and F1 curves of the object segmentation model.

Figure 16. Normalized confusion matrix of the object segmentation model.

Figure 17. The segmentation results of the object segmentation model.

Figure 18. Precision–recall and F1 curves of the object pose model.

Figure 19. The pose detection results of different object pose models on the corresponding dataset.

Figure 20. Real-world effects of the corresponding model.

Figure 21. Real-world effects of the trained model on the integrated excavator dataset.

Figure 22. The tracking results.

Figure 23. Detection and segmentation, pose tracking ID mismatch.

Figure 24. The changes in tracking ID.

Table 1. Dataset division.

Tasks	Datasets	Training	Validation	Test
Detection	SODA	15,877	1985	1984
Segmentation	MOCS	18,724	2340	2340
Pose	Open-source data	1024	128	128
	Composite data	1024	128	128
	Captured and virtual data	6332	791	791
	Integrated excavator poses dataset	8380	1047	1047

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, L.; Guo, Z.; Liu, Z.; Zhang, Y.; Cai, R.; Hu, X.; Yang, R.; Wang, G. Multi-Task Intelligent Monitoring of Construction Safety Based on Computer Vision. Buildings 2024, 14, 2429. https://doi.org/10.3390/buildings14082429

AMA Style

Liu L, Guo Z, Liu Z, Zhang Y, Cai R, Hu X, Yang R, Wang G. Multi-Task Intelligent Monitoring of Construction Safety Based on Computer Vision. Buildings. 2024; 14(8):2429. https://doi.org/10.3390/buildings14082429

Chicago/Turabian Style

Liu, Lingfeng, Zhigang Guo, Zhengxiong Liu, Yaolin Zhang, Ruying Cai, Xin Hu, Ran Yang, and Gang Wang. 2024. "Multi-Task Intelligent Monitoring of Construction Safety Based on Computer Vision" Buildings 14, no. 8: 2429. https://doi.org/10.3390/buildings14082429

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Task Intelligent Monitoring of Construction Safety Based on Computer Vision

Abstract

1. Introduction

2. Related Works

2.1. Research on Object Detection in Construction Safety

2.2. Research on Object Segmentation in Construction Safety

2.3. Research on Pose Estimation in Construction Safety

2.4. Research on Object Tracking in Construction Safety

3. Methodology

3.1. Data Collection and Preprocessing

3.2. YOLOv8 Model Framework

3.3. Multi-Task Intelligent Monitoring

3.3.1. Object Detection

3.3.2. Object Segmentation

3.3.3. Object Pose Estimation

3.3.4. Object Tracking

3.4. Model Evaluation Metrics

4. Experiments

4.1. Experimental Configuration

4.2. Detection Experiment

4.3. Segmentation Experiment

4.4. Pose Experiment

4.5. Tracking Experiment

5. Discussions

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI