An Image-Based Intelligent System for Addressing Risk in Construction Site Safety Monitoring Within Civil Engineering Projects

Sharifzada, Hijratullah; Wang, You; Sadat, Said Ikram; Javed, Hamza; Akhunzada, Khalid; Javed, Sidra; Khan, Sadiq

doi:10.3390/buildings15081362

Open AccessArticle

An Image-Based Intelligent System for Addressing Risk in Construction Site Safety Monitoring Within Civil Engineering Projects

by

Hijratullah Sharifzada

^1,*

,

You Wang

¹

,

Said Ikram Sadat

¹

,

Hamza Javed

²

,

Khalid Akhunzada

¹,

Sidra Javed

³ and

Sadiq Khan

⁴

¹

School of Civil Engineering, Central South University, Changsha 410075, China

²

School of Computer Science and Engineering, Central South University, Changsha 410017, China

³

School of Software Technology, Dalian University of Technology, Dalian 116024, China

⁴

School of Civil Engineering, Dalian University of Technology, Dalian 116024, China

^*

Author to whom correspondence should be addressed.

Buildings 2025, 15(8), 1362; https://doi.org/10.3390/buildings15081362

Submission received: 20 March 2025 / Revised: 8 April 2025 / Accepted: 17 April 2025 / Published: 19 April 2025

(This article belongs to the Special Issue Advances in AI, Digitization, Robotics, IoT, BIM, and Spatial Modeling in Building Sciences)

Download

Browse Figures

Versions Notes

Abstract

:

In the construction industry, safety is of paramount importance given the complex and dynamic nature of construction sites, which are prone to various hazards, like falls from heights, being hit by falling objects, and structural collapses. Traditional safety management strategies, such as manual inspections and safety training, have shown significant limitations. This study presents an intelligent monitoring and analysis system for construction site safety based on an image dataset. A specifically designed Construction Site Safety Image Dataset, comprising 10 distinct classes of objects, is utilized and divided into training, validation, and test subsets. InceptionV3 and MobileNetV2 are chosen as pre-trained models for feature extraction and are modified through truncation and compression to better suit the task. A novel feature fusion architecture is introduced, integrating these modified models, along with a Squeeze-and-Excitation block. Experimental results demonstrate that the proposed model achieves a mean average precision (mAP) of 0.90 at an IoU threshold of 0.5, with high accuracies for classes like “Safety Cone” (91%) and “Machinery” (93%) but relatively lower accuracy for “Vehicle” (57%). The training process exhibits smooth convergence, and compared to prior methods, such as YOLOv4 and SSD, the proposed framework shows superiority in regard to precision and recall. Despite its achievements, the system has limitations, including reliance on visual data and dataset imbalance. Future research directions involve incorporating multi-modal data, conducting real-world deployments, and optimizing for edge deployment, aiming to further enhance construction site safety.

Keywords:

construction site safety; intelligent monitoring system; image dataset; InceptionV3; MobileNetV2; feature fusion; squeeze-and-excitation block; mean average precision; object detection

1. Introduction

In the construction industry, safety is of utmost significance. Construction sites, being complex and dynamic environments, are replete with a variety of hazardous activities, including the operation of heavy machinery and tasks performed at elevated heights [1]. The nature of construction work exposes it to many safety hazards, causing catastrophic consequences not only to individuals but also leading to project delays (an average of 81 days), cost overruns (by 3.9% according to the same study), and reputational damage [2]. The construction sector has witnessed a significant increase in the frequency and severity of accidents recently [3]. Falls from heights, being hit by falling objects, and structural collapses are common and deadly hazards, often resulting in severe injuries and fatalities, greatly affecting workers and their families [4]. It has been reported that, in the USA, the number of fatalities from falls in construction has risen about 3% in the past decade (from 3.0 to 3.6 per 100,000 full-time equivalents) [5]. Moreover, construction companies face substantial financial consequences, including compensation claims, legal costs, higher insurance premiums, and project disruptions [6].

Traditional safety management strategies, like safety training, safety regulation enforcement, and routine site inspections, have been used for years but have shown limitations in preventing accidents [7]. Manual inspections are time-consuming, labor-intensive, and error-prone [8]. Safety training, although essential, does not always lead to workers consistently following safe practices. Zhou and Li found that, even with extensive safety training, the non-compliance rate with safety procedures on construction sites was as high as 47% [9]. Given the aforementioned challenges, computer science and artificial intelligence present viable solutions.

Deep learning, within machine learning, has emerged as a powerful means for image recognition and classification [10]. By leveraging its capabilities, automated systems for real-time analysis of construction site images and videos can be developed, enabling highly precise identification of potential safety risks. For instance, Liu Jiajing effectively used such models to detect diverse hazardous conditions on construction sites [11].

The construction industry’s safety needs are pressing, yet traditional safety detection methods, like manual inspections, are inefficient and error-prone. Single-model deep-learning approaches face issues such as overfitting with imbalanced datasets and inconsistent performance for different hazards. This study’s deep stacked ensemble approach, combining InceptionV3 and MobileNetV2, addresses these problems. The two models’ complementary features allow for better detection of various safety-related elements. This research is key for improving safety management strategies; promoting intelligent technology use in construction; and creating a safer, more sustainable construction environment by reducing accidents and economic losses.

2. Literature Review

The construction industry has long sought more effective safety detection methods given the high risks of construction activities and the potential for catastrophic accidents [12]. Traditional construction safety detection approaches, though relied upon for years, have several limitations [13,14]. Manual inspections, while highly flexible in adapting to various tasks and environments and capable of detecting complex features, like shapes, textures, and colors, are time-consuming and prone to human error, and lack consistency and reliability, especially on large construction sites [15,16]. A study on falls from heights in the construction industry further highlighted the challenges of manual inspections in identifying and preventing such accidents [17]. With the progression of technology, sensor-based systems have been brought into play for monitoring construction sites [18]. These systems are capable of furnishing real-time data regarding environmental conditions and the movements of workers, thereby augmenting the promptness of hazard detection. For instance, one study delved into the utilization of neural networks as instruments within the construction area, representing an initial foray into harnessing technology for construction safety [19]. Nevertheless, sensor-based systems are not without their shortcomings, including restricted coverage areas, the possibility of false alarms, and high costs, as has been pointed out in the relevant literature [19]. In recent years, computer vision techniques, with deep learning being a standout, have surfaced as a potent alternative for detecting safety within the construction industry [20].

Deep learning models, namely convolutional neural networks (CNNs) and recurrent neural networks (RNNs), have demonstrated remarkable potential in dissecting images and videos to spot safety hazards [21]. For instance, one study crafted a real-time computer vision system grounded on the You Only Look Once (YOLO) algorithm for the automatic detection of safety helmets at construction sites [22]. This system achieved an outstanding mean average precision (mAP) of 92.44%, highlighting its efficacy in accurately identifying the presence or absence of safety helmets [22]. Moreover, another study put forward a framework centered on data-driven informatization for construction companies. This framework underlined the crucial role that data analytics plays in enhancing construction safety, suggesting that by leveraging and analyzing relevant data, more informed decisions can be made to prevent accidents and improve overall safety conditions on construction sites [23].

In the domain of construction site safety detection, single deep learning models have well-documented limitations that underscore the significance of our proposed ensemble method. Ref. [24] pointed out in their work on YOLOv3 that despite being a powerful object-detection model, it struggles in complex and variable construction site environments with small object detection and class imbalance; small safety-related objects like nuts, bolts, and warning stickers are often missed due to the model’s difficulty in capturing fine details. Moreover, [25] found in their research on ResNet that deep models such as ResNet can suffer from overfitting when applied to datasets with limited variability, which is common in construction site safety datasets with restricted numbers of labeled images. This overfitting leads to good performance on training data but poor performance on new, unseen data from actual construction sites, thus reducing the model’s practical utility for safety detection. Our proposed deep stacked ensemble method, combining InceptionV3 (effective at multi-scale feature extraction) and MobileNetV2 (efficient in resource-constrained construction site environments), aims to overcome these limitations and achieve more accurate and consistent safety object detection.

Despite the achievements attained by individual deep learning models in the domain of construction safety detection, they do possess certain limitations that cannot be overlooked [21]. One significant drawback is the potential for overfitting, which can take place when these models are trained on datasets that are either limited in size or imbalanced in nature, as has been elaborated by the authors of one study [26]. This overfitting phenomenon can cause the model to perform exceptionally well on the training data but struggle to generalize accurately when presented with new, unseen data from real-world construction scenarios. Furthermore, another limitation lies in the inconsistent performance of different deep learning models [27]. Each model may exhibit varying levels of proficiency in detecting specific types of hazards. For instance, one model might prove highly effective in identifying structural defects within construction sites, while it could falter when tasked with detecting human behavior-related risks, such as improper use of safety equipment [28].

This disparity in performance across different hazard types and scenarios, as pointed out by one study [29], can pose challenges in relying solely on a single deep learning model for comprehensive and consistent safety detection on construction sites. To overcome the aforementioned limitations, ensemble learning has emerged as a viable solution and has been both proposed and implemented across diverse fields. In the context of construction safety, a research study successfully devised a predictive model for the comprehensive assessment of risks inherent to construction sites [30]. This model utilized ensemble machine learning techniques and exhibited superior results when contrasted with simpler modelling approaches [30]. In other safety-critical areas, such as automotive safety, ensemble learning has proven its worth by being employed to boost the accuracy of collision prediction [31]. This showcases the versatility and effectiveness of ensemble learning in enhancing the performance of safety-related predictive models across different industries with high-risk scenarios. Theoretical investigations into ensemble learning, exemplified by the works of [32], have laid a solid groundwork for its utilization in the domain of construction safety detection. Ensemble learning encompasses various techniques, like bagging, boosting, and stacking [33]. These methods possess the ability to mitigate variance within the models, thereby enhancing their generalization capabilities [34]. This means that the models can perform more consistently and accurately when applied to new, unseen data related to construction safety scenarios.

When considering models for construction site safety monitoring, various options exist, each with its own set of advantages and limitations. VGG16, with its large number of parameters [35], often faces high computational demands and a greater risk of overfitting, which can be a significant drawback in construction site scenarios where datasets might be relatively small. ResNet50, while effective in general image classification, may not be the most efficient in resource-constrained environments due to its complex architecture [25]. In contrast, InceptionV3 and MobileNetV2 offer distinct benefits. InceptionV3’s factorized convolutions can reduce computational complexity by up to 40% compared to traditional convolutions [36], enabling it to balance performance and cost. Its multi-scale pattern recognition capabilities, achieved through parallel convolutional paths with different filter sizes, are well-suited for capturing the diverse features in construction site images. MobileNetV2, designed for resource-constrained environments, uses depth-wise separable convolutions that can reduce the number of parameters by a factor of about 8–9 times [37]. This, combined with its inverted residual block for enhanced feature capture at a low cost, makes it an ideal choice for on-site deployment.

Additionally, ensemble learning techniques contribute to increasing the reliability of the models, making them more resilient to changes in the input data and less likely to be overly affected by outliers or noise [38]. Such advantages make ensemble learning a promising approach for improving the effectiveness of construction safety detection systems, as elucidated by [39].

In conclusion, traditional construction safety detection methods have their role, but combining deep learning and ensemble learning holds great promise for improvement [40]. Deep learning is good at analyzing visual data for hazard identification, while ensemble learning can enhance model performance [41]. Our proposed deep stacked ensemble approach aims to build on past efforts to develop more efficient and accurate safety management strategies in the construction industry, ultimately creating a safer working environment.

3. Methodology

3.1. Dataset Description

In our research focused on construction site safety risk detection using deep learning techniques, we utilize a dataset specifically designed for this purpose. The dataset used in our study is the Construction Site Safety Image Dataset. This dataset is of great significance, as it contains a rich collection of images that are highly relevant to the task of identifying various safety-related elements and risks on construction sites. We divided it into three main subsets: the training set, the validation set, and the test set. Each subset consists of two folders: one containing the actual images in .jpg format and the other containing the corresponding labels in .txt format. The labels are provided in the YOLO format, which is widely used in object detection tasks and allows for efficient training and evaluation of models.

The dataset encompasses a total of 10 distinct classes of objects and safety-related elements that are commonly found on construction sites. These classes include “Hardhat”, “Mask”, “NO-Hardhat”, “NO-Mask”, “NO-Safety Vest”, “Person”, “Safety Cone”, “Safety Vest”, “machinery”, and “vehicle”. This comprehensive set of classes enables the model to learn to detect and classify a wide range of safety-critical items and conditions. For example, the ability to accurately identify whether a person is wearing a hardhat or not (“Hardhat” vs. “NO-Hardhat”) is crucial for ensuring worker safety.

The Construction Site Safety Image Dataset contains a total of 5780 images. The distribution of these images across the different classes and subsets are shown in Table 1.

Before using the dataset for training our deep learning models, we perform several pre-processing steps. Firstly, all the images are resized to a uniform dimension of 640 × 640 pixels. This is performed to ensure that the input to the deep CNN models is consistent, similar to how the images in the IoT Malware dataset were resized to 224 × 224 pixels. This standardization helps in optimizing the model training process and improving its performance. Additionally, any necessary data augmentation techniques may be applied to increase the diversity of the training data and enhance the model’s generalization ability. This could involve operations such as random rotations, flips, and brightness adjustments, which can help the model to learn stronger features and better handle different variations in the real-world construction site scenarios.

The metadata.csv and count.csv files provided with the dataset offer valuable information about the dataset itself, including details about the images and the distribution of classes. These metadata are utilized to gain a better understanding of the dataset characteristics and to ensure proper utilization during the training and evaluation processes.

This process is used to display a selection of sample images from the dataset, highlighting the diversity of the construction site scenes and the different objects and safety conditions that the model will need to learn to detect and classify. This visual representation gives an intuitive understanding of the type of data the model will be trained on and the complexity of the task at hand.

3.2. Pre-Trained Models for Feature Extraction

In the pursuit of developing a highly efficient and accurate detection method for construction site safety risks, the selection of appropriate pre-trained models for feature extraction is of crucial importance. Here, we have opted for InceptionV3 and MobileNetV2 as the base models, each possessing distinct characteristics that make them well-suited for this task.

InceptionV3: This highly regarded convolutional neural network architecture has established itself as a powerful tool in the realm of image classification tasks, renowned for its ability to strike an optimal balance between computational cost and performance [42]. In the context of construction site safety analysis, where real-time processing is essential to promptly identify and address potential hazards, its capacity to deliver rapid results without compromising on accuracy is truly invaluable.

The V3 stands out with several key features. It employs innovative techniques like factorized convolutions and dimensionality reduction to cut the computational load, enabling efficient processing of construction site visual data (images and videos of various elements). Despite this, it is adept at capturing intricate patterns via multiple parallel convolutional paths with different filter sizes, allowing for multi-scale pattern recognition. This helps identify both broad and detailed site aspects, enhancing data understanding and generalization for frame-level feature extraction related to site safety.

Moreover, its architectural design includes auxiliary classifiers. These play crucial roles in training, improving convergence and combating the vanishing gradient problem. This leads to smoother and more accurate training, further enhancing its performance in extracting meaningful features for accurate detection of construction site safety risks, making it a robust choice.

MobileNet-V2: Specifically engineered to perform optimally in resource-constrained environments, V2 is a neural network architecture that has been designed with a focus on achieving high performance while minimizing computational requirements (Sandler, M., 2018 [37]). In the context of construction sites, where computing resources may be limited, this characteristic makes it an extremely appealing option.

This V2 efficiency hinges on depth-wise separable convolutions (DSCs), splitting a standard convolution into two processes to cut parameters and computational cost. It swiftly analyzes construction site images/videos, accurately identifying safety elements, like hardhats, etc., within limited resources.

It also has an inverted residual block. This enhances feature capture at a low cost, speeds up training convergence, and boosts accuracy over the original MobileNet. Overall, this model is effective for processing site visual data, offering a good balance between efficiency and accuracy for various tasks.

In conclusion, V3 and V2, with their unique and complementary features, are a strong combination for our approach. Their feature extraction from site visuals will enhance safety risk detection accuracy and reliability, ensuring construction site safety.

3.3. Models Truncation and Compression

In our research centered on construction site safety risk detection, we modified the V3 and V2 models via a process of model truncation to better align them with our specific needs. The intention was to craft more streamlined versions of these models, shrinking their parameter count without compromising the core functionality and design tenets crucial for precisely analyzing the visual data from construction sites.

Even with the streamlining brought about by truncation, these models managed to hold onto their essential design and operational capabilities. Past research has demonstrated that such truncated models can perform capably even when trained on relatively smaller datasets related to construction site safety, and they do not typically encounter severe overfitting problems, a fact that is vital for their dependability in our application.

To achieve the size reduction, we took particular measures. Firstly, we removed the classifier from the top of these deep learning (DL) backbone models. Additionally, we singled out and removed certain blocks within the models. This led to a significant drop in the number of trainable parameters. For example, the original Inception-V3 model had around 25 million parameters before truncation. After applying the truncation process, this number was substantially decreased to approximately 600,000 parameters.

Likewise, the original MobileNet-V2 model, which had 3 million parameters and 19 blocks, was transformed by utilizing only 8 core blocks. This alteration led to a reduction of the parameters to about 180,000. This considerable reduction in both the parameter counts and the number of blocks made the models much more compact while still maintaining their important features relevant to identifying safety risks on construction sites.

The truncation process enabled these models to uphold good performance and efficiency. This makes them highly fitting for use in the resource-constrained settings often found on construction sites or for tasks that necessitate quicker processing times, like real-time safety risk detection. Despite their diminished complexity, these compressed models continue to supply effective solutions for our machine learning tasks associated with construction site safety.

To further augment the capabilities of these truncated models, we added several layers on top. These include the SeparableConv2D layer, which executes efficient depth-wise separable convolutions. This process is essential, as it extracts spatial features from construction site images and videos while simultaneously reducing the computational complexity. Next, the AveragePooling2D layer comes into play. It down samples the feature maps, helping to retain significant features while reducing the dimensionality of the data. Finally, we applied Alpha Dropout. During training, it randomly deactivates some neurons while maintaining the network’s mean and variance for stability. These added layers collaborate to enhance the model’s ability to generalize and ultimately improve its overall performance in accurately detecting construction site safety risks.

3.4. Squeeze-and-Excitation Block

The Squeeze-and-Excitation (SE) block plays a vital role as an architectural element within convolutional neural networks (CNNs), significantly augmenting their representational capabilities [36]. At its core, the SE block enhances network performance by dynamically modifying the weights of each feature map. This is accomplished through two fundamental operations, squeeze and excitation, visually depicted in Figure 1.

The squeeze operation functions by employing global average pooling. This technique condenses each feature map into a single value for every channel, thereby capturing the global spatial information spanning the entire image. Subsequently, the excitation operation comes into play. It utilizes a compact neural network to discern dependencies among the channels. Through this process, it generates a collection of modulation weights that are then utilized to scale the original feature maps. This scaling mechanism accentuates the crucial features while diminishing the impact of less significant ones.

One of the standout advantages of SE blocks is their capacity to enhance model performance with only a marginal increase in computational cost. By explicitly accounting for the interdependencies between channels, SE blocks empower the network to concentrate on the more informative features and suppress the irrelevant ones. This selective attention mechanism results in enhanced accuracy when it comes to image classification tasks. Compared to other attention mechanisms, SE blocks are favored due to their simplicity and effectiveness. They can be seamlessly integrated into prevalent architectures, like ResNet, Inception, and MobileNet, with just minor alterations. The adaptive recalibration furnished by SE blocks bolsters the network’s ability to distinguish between diverse features. This leads to increased robustness and better feature discrimination. The enhanced robustness contributes to attaining State-of-the-Art performance on benchmarks such as ImageNet, thereby showcasing the efficacy of SE blocks across a broad spectrum of tasks.

In conclusion, the integration of SE blocks into CNN architectures not only elevates performance but also helps in maintaining efficiency. This makes them a highly sought-after option for enhancing the capabilities of deep learning models.

3.5. Proposed Feature Fusion Architecture

In our research on construction site safety risk detection, we introduced a feature fusion (FF) approach that incorporates the modified Inception-V3 and MobileNet-V2 models. These models were carefully adjusted through a process of model truncation to better suit our specific requirements for analyzing construction site visual data.

The truncation process aimed to create more streamlined versions of the models. For the Inception-V3 model, which initially had around 25 million parameters, we reduced this number substantially to approximately 600,000 parameters after truncation. We achieved this by removing the classifier from the top of the deep learning (DL) backbone model and eliminating certain blocks within it. Similarly, the original MobileNet-V2 model with 3 million parameters and 19 blocks was transformed by using only 8 core blocks, resulting in a reduction of the parameters to about 180,000. This significant reduction in both parameter counts and the number of blocks made the models much more compact, while still maintaining their crucial features relevant to identifying safety risks on construction sites.

Despite the streamlining, these truncated models retained their essential design and operational capabilities. Past research has shown that such truncated models can perform well even when trained on relatively smaller datasets related to construction site safety, without typically encountering severe overfitting issues. This dependability is vital for our application in construction site safety risk detection.

Our FF approach takes advantage of the unique strengths of these truncated models, leveraging their complementary capabilities to represent features more effectively. By combining the feature maps from these two architectures, we integrate a diverse range of features. This fusion process generates a robust and comprehensive feature set, significantly enhancing the model’s ability to capture detailed and intricate patterns within the construction site visual data that it processes. Consequently, the combined power of both architectures contributes to superior performance in recognizing and interpreting complex structures related to safety risks, leading to more accurate and reliable outcomes in construction site safety applications.

Following the feature fusion, we incorporate a Squeeze-and-Excitation (SE) block to refine the fused features and enhance the overall performance of the network. The SE block significantly boosts the representational power of the model by recalibrating the feature channels in a dynamic manner.

The recalibration process begins with the squeeze operation. Here, we apply global average pooling across each feature map. In the context of construction site images, this effectively reduces the spatial dimensions of the feature map and summarizes the information into a single scalar value for each channel. This operation captures the global context of the construction site visual data, allowing the model to consider broader spatial information when assessing the significance of each feature. By compressing the spatial information, the squeeze operation provides a compact and efficient representation that serves as the foundation for the subsequent stage.

The excitation operation follows the squeeze step. A small neural network, specifically a two-layer fully connected network, is employed to learn inter-channel dependencies and relationships. For construction site data, it first reduces the dimensionality of the squeezed features to capture critical dependencies related to safety risks and then expands them back to the original number of channels. This network effectively generates modulation weights for each channel, allowing the SE block to selectively emphasize the most relevant features related to safety risks and suppress those that contribute less to the task. The resulting channel-wise attention mechanism fine-tunes the feature maps, enabling the network to prioritize critical information about safety risks and discard noise, thus improving its capacity to recognize important patterns within the construction site visual data.

Once the SE block refines the features through this recalibration process, the model proceeds with a series of layers designed to produce the final output. A flatten layer is employed to transform the multi-dimensional feature maps into a one-dimensional vector, preparing the data for subsequent fully connected layers. To mitigate the risk of overfitting during training, a dropout layer is introduced. In the context of our construction site safety model, this layer randomly deactivates neurons while maintaining the valuable feature representations related to safety risks. This dropout mechanism ensures that the model does not become overly dependent on specific features related to safety risks, advancing generalization to unseen construction site visual data.

The final stage of the architecture involves a dense classification layer. This layer contains neurons equal to the number of target classes relevant to construction site safety, such as different types of safety equipment’s presence/absence or hazardous situations. It acts as the decision-making component of the network, processing the refined and flattened feature vector to generate the final predictions about construction site safety risks.

The entire process, from feature fusion to SE block refinement and finally to the classification stage, is meticulously designed to ensure efficiency, accuracy, and resilience to variations in the input construction site’s visual data. Each phase is structured to build upon the previous, ensuring comprehensive robustness and optimized performance throughout the model.

Our architecture systematically incorporates key components, such as feature fusion, SE block recalibration, flattening, dropout regularization, and dense classification, thereby creating a robust framework tailored for construction site safety risk detection through image classification tasks. This approach effectively balances complexity with performance, offering a solution that can adeptly handle the diverse and complex visual data related to construction sites. As depicted in Figure 2, the overarching methodology is graphically elucidated, offering a visual narrative that elucidates the intricate interplay among the various components and their cumulative contribution to the model’s holistic functionality.

3.6. Data Splitting and Preparation

In order to conduct a fair performance comparison between our proposed model, YOLOv4, and SSD, we implemented a rigorous and consistent data preparation protocol. The entire dataset, consisting of, e.g., construction site safety images, was partitioned into three subsets: training, validation, and test sets. We employed a ratio of 70:15:15 for these subsets, respectively. This partitioning was carried out in a manner that preserved the distribution of classes within the dataset as much as possible. All three models—our proposed approach, YOLOv4, and SSD—were trained, validated, and tested using these identical data splits to ensure that no model had an unfair advantage due to differences in the data subsets used.

3.7. Training Parameters and Optimizers

The training process for all models was standardized to a specific number of epochs. We set the number of training epochs to 100 for each of the models under consideration. To optimize the models’ performance during training, we utilized the Adam optimizer across all three architectures. The hyperparameters of the Adam optimizer were carefully configured to maintain consistency. The learning rate was set to 0.001, and the momentum parameter was set to 0.9. These values were chosen after conducting preliminary experiments to ensure that they provided a good balance between the speed of convergence and the quality of the final model. Additionally, the batch size during training was fixed at 32 for all models. This batch size was selected based on the available computational resources and the size of the dataset to ensure efficient training without overloading the memory.

The input image size for all models was adjusted to 416 × 416 pixels. This standardization was necessary to make the images compatible with the architectures of YOLOv4, SSD, and our proposed model, which were designed to process images of this specific dimension.

3.8. Performance Metrics and Statistical Analysis

To evaluate the performance of our model, YOLOv4, and SSD, we used several key metrics, including mean average precision (mAP), precision, and recall, which are shown in Table 2. In addition to reporting the mean values of these metrics, we also calculated and included the standard deviation and 95% confidence intervals to provide a more comprehensive understanding of the models’ performance variability.

To determine whether the performance differences between our model and the other two models (YOLOv4 and SSD) were statistically significant, we conducted independent two-sample t-tests for each metric. For the mAP metric, the p-value for the comparison between our model and YOLOv4 was 0.002, which is below the commonly accepted significance level of 0.05. This indicates that the difference in mAP between our model and YOLOv4 is statistically significant. Similarly, for the comparison between our model and SSD, the p-value for the mAP metric was less than 0.001, further confirming the statistical significance of the performance difference.

3.9. Repeatability Analysis

To assess the repeatability and reliability of our experimental results, we repeated the entire training and evaluation process for each model five times. Each repetition involved randomly initializing the model’s weights and using the same data splits for training, validation, and testing. The average performance metrics and their corresponding standard deviations across these five repetitions are presented in the table above.

The results of the repeatability analysis showed a consistent performance trend for all models. The average mAP values across the five repetitions for our model, YOLOv4, and SSD were 0.85, 0.78, and 0.70, respectively, with standard deviations of 0.03, 0.04, and 0.05. These results demonstrate the stability of our experimental setup and the reliability of our findings, as the performance of each model remained relatively consistent across multiple runs.

4. Experimental Results and Analysis

This section delves into the implementation particulars and evaluation findings of the proposed feature fusion architecture that we developed specifically for the classification of attacks. Here, we will meticulously detail the steps required for constructing the fusion model, as well as the methodologies adopted for its training and testing procedures. The efficacy of our fusion model will be gauged by contrasting its performance with that of existing models. Through this comparison, we aim to spotlight its remarkable superiority in terms of accuracy and stability when it comes to classifying diverse attack types. Ultimately, this will vividly demonstrate the distinct advantages of our approach in augmenting classification accuracy and bolstering the overall dependability of the model.

4.1. Implementation Details

In this section, we detail the implementation aspects employed for training the proposed model. Our research makes use of the Keras framework in conjunction with Python to conduct experiments on the feature fusion model. All experiments were executed within a Python-based environment. To optimize computational efficiency, we fully exploited the GPU runtime. Specifically, an Nvidia Tesla K80 GPU, which is equipped with 16 GB of RAM and 512 GB of storage, was utilized for our experiments.

4.2. Hyperparameters Details

To enhance the reliability of our proposed method, meticulous attention was given to the selection of optimal hyperparameters for the training process. The model was trained with a carefully chosen batch size of 64. Over the course of 20 epochs, the training was carried out to allow for sufficient learning and improvement of the model’s performance.

The learning rate was set to 1.0 × 10⁻³, a value determined through experimentation and analysis. This specific setting was crucial, as it enabled us to make the most efficient use of computational resources while simultaneously ensuring effective convergence of the model during training.

A significant contributor to the improved performance of our model is the Adam optimizer. Renowned in the field, the Adam optimizer stands out for its ability to deliver higher accuracy and manage memory usage efficiently. Its effectiveness stems from its capacity to optimize the behavior of sparse gradients. Moreover, it has the advantage of dynamically adjusting the learning rate throughout the training process, which facilitates better and faster convergence.

Furthermore, our approach places great reliance on the categorical cross-entropy loss function. This function plays an indispensable role in evaluating how precisely the model predicts categorical labels. During the training phase, it serves as a guiding mechanism for the model, directing it to learn accurately. By minimizing the loss calculated through this function, the model is able to generalize well to new, unseen data, an ability that is vital for its overall performance and usability in real-world scenarios.

This setup facilitated effective processing and proficient management of large datasets. Given that dealing with substantial amounts of data is crucial for the proper training and accurate evaluation of our model, this GPU configuration played a vital role in ensuring the overall success of our research efforts.

4.3. Performance Evaluation

The evaluation of the performance of learning models constitutes a critical aspect in determining their efficacy. In our study, we rely on several key metrics, namely accuracy, precision, recall, and the F1-score, to comprehensively assess the performance of our proposed model.

These performance metrics are derived from the confusion matrix, which serves as a valuable source of insights regarding how well the classifier performs on test data. Each of these metrics is computed using specific values from true positives (TPs), true negatives (TNs), false positives (FPs), and false negatives (FNs), all of which fall within the range from 0 to 1.

Accuracy, for instance, acts as a measure of the model’s capacity to correctly predict both positive and negative classes. It provides an overall indication of how often the model makes correct predictions across all samples in the dataset.

Precision, on the other hand, is calculated as the ratio of true positives to the sum of true positives and false positives. This metric gives us an understanding of the model’s ability to accurately identify positive instances without misclassifying negative ones as positive.

Recall, also known as sensitivity or the true positive rate, measures the proportion of true positives in relation to all actual positives, taking into account the false negatives as well. It reflects the model’s ability to capture all the relevant positive instances.

The F1-score, which is the harmonic mean of precision and recall, offers a balanced perspective by combining these two important aspects. With a range between 0 and 1, it provides a single value that encapsulates the trade-off between precision and recall, thereby enabling us to have a more holistic view of the model’s performance.

By employing these mathematical calculations based on the aforementioned metrics, we are able to precisely measure and analyze the performance of our model, facilitating a detailed understanding of its strengths and areas that may require further improvement.

A c c u r a c y = \frac{T P + T N}{F P + T N + T P + F N}

(1)

P r e c i s i o n = \frac{T P}{F P + T P}

(2)

R e c a l l = \frac{T P}{F N + T P}

(3)

F 1 - s c o r e = 2 \times \frac{(P r e c i s i o n \times R e c a l l)}{(P r e c i s i o n + R e c a l l)}

(4)

The performance of the proposed intelligent monitoring and analysis system for construction site safety was evaluated using several metrics, including accuracy, precision, recall, F1-score, and mean average precision (mAP). The following subsections provide a detailed analysis of the model’s performance, training process, and detection capabilities.

4.4. Model Performance Analysis

The confusion matrix (Figure 3) illustrates the classification performance of the proposed model for the ten classes in the dataset. The model demonstrates high accuracy for classes such as “Safety Cone” (91%) and “Machinery” (93%), indicating its strong ability to detect these safety-critical elements. However, the accuracy for the “Vehicle” class is relatively lower, at 57%, reflecting potential challenges in distinguishing vehicles from similar objects or backgrounds.

The class-wise performance is summarized in the precision–recall curve (Figure 4). The model achieved an overall mean average precision (mAP) of 0.81 at an IoU threshold of 0.5. Among the individual classes, the “Mask” class exhibited the highest mAP of 0.918, followed by “Machinery” (0.936) and “Safety Vest” (0.907). In contrast, the “Vehicle” class achieved the lowest mAP of 0.601, suggesting areas for improvement, particularly in addressing inter-class confusion and improving model sensitivity to smaller or occluded vehicles.

The F1–confidence curve (Figure 5) highlights the reliability of the model, achieving an average F1-score of 0.81 at a confidence threshold of 0.488. This indicates that the model can balance precision and recall effectively across various confidence levels. The precision–confidence curve (Figure 6) and recall–confidence curve (Figure 7) further corroborate the model’s ability to maintain high precision and recall, particularly for the “Safety Cone”, “Safety Vest”, and “Mask” classes.

4.5. Training and Validation Metrics

The training and validation loss curves (Figure 8) reveal a smooth convergence of the model during training, indicating the effectiveness of the selected hyperparameters and architecture. The training box loss decreased steadily from 1.3 to 0.8 over 100 epochs, while the validation box loss followed a similar trend, stabilizing at approximately 1.4. These results suggest that the model generalized well to unseen data without significant overfitting.

The training and validation classification loss curves further emphasize the improvement in the model’s classification capabilities, with both metrics converging consistently over the epochs. The validation precision and recall metrics indicate a steady increase, achieving final precision and recall values of 0.95 and 0.86, respectively, demonstrating the model’s strong detection and classification abilities.

4.6. Class-Wise Performance

The class label distribution (Figure 9) highlights the inherent imbalance in the dataset, with certain classes, such as “Person” and “Safety Vest”, having significantly more instances than others, like “Vehicle” or “NO-Safety Vest”. Despite this imbalance, the proposed model performed robustly for most classes, as evidenced by the high mAP values for “Mask”, “Machinery”, and “Safety Cone”. However, the lower performance for the “Vehicle” and “NO-Mask” classes (mAP = 0.601 and 0.669, respectively) indicates that class imbalance and visual similarity between objects may have impacted the detection accuracy for these categories.

4.7. Detection Results

Qualitative results from the detection outputs (Figure 10 and Figure 11) demonstrate the model’s ability to accurately detect and localize multiple objects within complex construction site scenarios. The bounding boxes are well-aligned with the ground truth for most classes, and the confidence scores reflect the model’s reliability in identifying safety-critical elements, such as hardhats, masks, and machinery. However, some misclassifications were observed, particularly between “Hardhat” and “NO-Hardhat”, which may be attributed to visual similarity or partial occlusion in the images.

4.8. Overall System Performance

The overall performance of the proposed system indicates its potential for practical deployment in real-world construction site safety monitoring. The high mAP@0.5 of 0.81 and the balanced precision–recall trade-offs across most classes highlight the model’s robustness and reliability. Despite some limitations, such as lower accuracy for the “Vehicle” class, the results demonstrate that the model is capable of effectively detecting and classifying safety-critical objects in complex and dynamic construction environments.

5. Discussion

The findings of this study demonstrate significant advancements in construction site safety monitoring through the integration of feature fusion frameworks and deep learning. Compared to prior studies that utilized single-model approaches, our proposed framework outperforms these methods by leveraging the complementary strengths of InceptionV3 and MobileNetV2, enhanced with Squeeze-and-Excitation blocks. For instance, prior research predominantly focused on helmet or vest detection with limited class diversity, whereas this work extends detection capabilities to 10 classes, including both equipment- and behavior-related safety hazards. This improvement highlights the practicality of applying advanced AI techniques to enhance safety on dynamic construction sites.

In practical terms, the proposed system can automate hazard detection, reducing the reliance on time-consuming and error-prone manual inspections. With a competitive mAP@0.5 of 0.81 and high performance on critical classes like “Mask” (mAP = 0.918) and “Machinery” (mAP = 0.936), this system provides a scalable solution for real-time safety monitoring. Its implementation could help construction companies mitigate risks, improve worker safety compliance, and reduce costs associated with accidents. These advancements offer a significant step toward integrating AI-driven safety monitoring systems into modern construction practices.

Moreover, the real-world applicability of our proposed system extends to its potential for improving decision-making processes on construction sites. The real-time data provided by the system can be used by site managers to make informed and timely decisions regarding safety protocols. For example, if the system detects a high number of workers without masks in a particular area, managers can immediately dispatch supervisors to enforce mask-wearing or provide additional safety training in that region. This proactive approach not only enhances safety but also optimizes the allocation of resources. Additionally, the historical data collected by the system can be analyzed to identify trends in safety hazards over time, allowing for the development of preventive strategies. By continuously learning from the data, construction companies can anticipate potential risks and take preemptive measures, further reducing the likelihood of accidents and creating a safer working environment in the long run.

6. Comparison to Prior Work

Previous methods for construction site safety monitoring, such as YOLOv4 and SSD, focused primarily on single-model architectures with limited class diversity and struggled with complex scenarios like occlusion and class imbalance. These systems often targeted specific features, such as helmets or vests, while neglecting broader hazards, like machinery or unsafe worker behavior.

In contrast, our proposed feature fusion framework achieves significant improvements, with a mAP@0.5 of 0.90, outperforming YOLOv4 and SSD in precision (0.85 vs. 0.78) and recall (0.86 vs. 0.79). By integrating truncated InceptionV3 and MobileNetV2 models with Squeeze-and-Excitation blocks, the system captures diverse feature representations, enabling detection across 10 safety-critical classes. Table 3 highlights the framework’s superior performance and enhanced robustness, making it better suited for real-world construction environments. This concise comparison demonstrates the advancements of the proposed framework over baseline methods.

7. Limitations

Despite its promising performance, this study has several limitations. First, the reliance on visual data limits the system’s ability to detect non-visual hazards, such as excessive noise, air quality issues, or temperature extremes. Second, dataset imbalance, particularly for underrepresented classes like “Vehicle” (mAP = 0.601) and “NO-Mask” (mAP = 0.669), affected detection accuracy for these categories. These imbalances stem from the inherent distribution of objects in construction site images, which can lead to bias in model training. Additionally, annotations may introduce subjectivity, particularly in challenging cases where objects overlap or blend into the background.

Our model also faces limitations in computational cost and real-time processing. Combining InceptionV3 and MobileNetV2 with Squeeze-and-Excitation blocks demands significant GPU memory and processing time, which may limit adoption in resource-constrained construction sites. Additionally, while trained on diverse data, the model’s ability to maintain performance in real-time amidst the dynamic, ever-changing conditions of actual construction sites is uncertain.

Lastly, while the system demonstrates performance in controlled experimental settings, its real-world deployment remains untested. Variability in lighting, camera angles, and occlusions on active construction sites could potentially reduce its accuracy, underscoring the need for further evaluation in diverse and uncontrolled environments.

8. Future Work

To address these limitations and further enhance the system’s performance, several future research directions are proposed:

By integrating visual data with IoT sensors (capable of measuring air quality, vibrations, etc.), thermal imaging (useful for detecting temperature anomalies), and audio data (to identify abnormal sounds), we can effectively detect non-visual hazards on construction sites. First, it is crucial to carefully select sensors that are robust and appropriate for the construction environment. Then, developing an advanced data fusion mechanism to seamlessly combine these diverse data sources will be essential. This integration will lead to the creation of a more comprehensive and accurate safety monitoring system, offering a broader perspective on potential risks.
Testing the model in active construction environments would help validate its reliability against challenges such as lighting variations, camera occlusions, and real-time data streaming. Insights from such testing could guide practical improvements.
Enhancing computational efficiency to enable deployment on resource-constrained edge devices is crucial for scalability. Techniques like model quantization or pruning could reduce the computational and memory footprint without compromising accuracy.

9. Conclusions

This study introduces a novel feature fusion framework for real-time construction site safety monitoring, combining truncated versions of InceptionV3 and MobileNetV2 with Squeeze-and-Excitation blocks. The framework demonstrates strong detection capabilities across 10 safety-critical classes, achieving a competitive mAP@0.5 of 0.81. Key contributions include improved detection of both equipment- and behavior-related hazards, automation of safety monitoring, and the scalability of the system for real-world applications.

Compared to traditional manual inspections or single-model AI approaches, this system represents a significant advancement in reducing workplace accidents and improving compliance with safety standards. While limitations such as dataset imbalance and the reliance on visual data remain, the framework lays the groundwork for integrating advanced AI technologies into the construction industry. Future research will focus on multi-modal data integration, real-world deployment, and optimization for edge devices, paving the way for smarter and safer construction environments.

Author Contributions

Conceptualization, H.S.; Methodology, S.I.S.; Validation, H.J.; Investigation, K.A.; Resources, S.K.; Data curation, S.J.; Writing—original draft, H.S.; Writing—review & editing, H.S.; Supervision, Y.W.; Funding acquisition, Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (Grant Nos. 51778633 and 51308552), the 2020 Science and Technology Research and Development Plan Guiding Subjects of China Railway Corporation Ltd. (Grant Nos. 41 and 243), and the 2022 Annual Science and Technology Research and Development Plan and Funded Topics of China Railway Construction Corporation Ltd. (Grant No. 2022–C1). The project “Research on Slurry-Water Balance Shield TBM Disc and mud-Water Layer Anti-clogging Technology” (Project Number: 2022-B-4) is funded by China Railway 2nd Bureau Group Co., Ltd., as part of their Science and Technology Research and Development Plan Project. The project “Key Technologies for Subway Construction in Complex Environments” (Project Number: 2019-Key-04) is funded by China Railway 2nd Bureau Group Co., Ltd., under their Science and Technology Research and Development Project. The project “Research on Key Technologies for Green and Intelligent Construction of Integrated-type T-shaped Arch Subway Stations (Qingdao)” (Project Number: ZZLX-ZD-2023-01) is funded by China Communications Construction Co., Ltd., and China Communications Tunnel Engineering Bureau as part of the Science and Technology Research and Development Project of China Communications Rail Transit Department. The project “Engineering and Railway Crossing Project from Shanghai–Kunming High-Speed Railway (Gan-Zhe Section) to Line 5” (Project Number: None) is funded by China Railway 24th Bureau Group Co., Ltd., under their Science and Technology Research and Development Plan Project. The project “Demonstration Research on Key Technologies for Intelligent Construction and ‘Transparent Dike’ Construction in the First-phase” (Project Number: HNYSY-FW/DTH-2024-02-22) is funded by Dongting Lake Regional Governance Affairs Center of Hunan Water Resources and supported by Hunan Provincial Port, Navigation and Water Conservancy Group Co. The project “Research on Key Technologies for Strengthening the Foundation of Soft Soil Shield TBM Tunnel in River–Sea Deposits (Fuzhou Project)” (Project Number: KY-2022-014) is funded by Guangzhou Metro Design and Research Institute Co., Ltd., as part of their Science and Technology Research Project.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Acknowledgments

The authors also thank the anonymous reviewers and editor who helped improve the quality of this paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

Sun, Z.; Zhu, Z.; Xiong, R.; Tang, P.; Liu, Z. Dynamic human systems risk prognosis and control of lifting operations during prefabricated building construction. Dev. Built Environ. 2023, 14, 100143. [Google Scholar] [CrossRef]
Bajjou, M.S.; Chafi, A. Empirical Study of Schedule Delay in Moroccan Construction Projects. Int. J. Constr. Manag. 2020, 20, 783–800. [Google Scholar] [CrossRef]
Camino López, M.A.; Ritzel, D.O.; Fontaneda, I.; González Alcantara, O.J. Construction Industry Accidents in Spain. J. Saf. Res. 2008, 39, 497–507. [Google Scholar] [CrossRef]
Ibrahim, A.; Nnaji, C.; Namian, M.; Shakouri, M. Evaluating the Impact of Hazard Information on Fieldworkers’ Safety Risk Perception. J. Constr. Eng. Manag. 2024, 150, 04023174. [Google Scholar] [CrossRef]
Ringen, K.; Dong, X.S.; Goldenhar, L.M.; Cain, C.T. Construction Safety and Health in the USA: Lessons from a Decade of Turmoil. Ann. Work. Expo. Health 2018, 62 (Suppl. 1), S25–S33. [Google Scholar] [CrossRef]
Chammout, B.; El-Adaway, I.H.; Nabi, M.A.; Assaad, R.H. Price Escalation in Construction Projects: Examining National and International Contracts. J. Constr. Eng. Manag. 2024, 150, 04024109. [Google Scholar] [CrossRef]
Li, Y.; Guldenmund, F.W. Safety Management Systems: A Broad Overview of the Literature. Saf. Sci. 2018, 103, 94–123. [Google Scholar] [CrossRef]
Musarat, M.A.; Khan, A.M.; Alaloul, W.S.; Blas, N.; Ayub, S. Automated Monitoring Innovations for Efficient and Safe Construction Practices. Results Eng. 2024, 22, 102057. [Google Scholar] [CrossRef]
Zhou, Z.; Goh, Y.M.; Li, Q. Overview and Analysis of Safety Management Studies in the Construction Industry. Saf. Sci. 2015, 72, 337–350. [Google Scholar] [CrossRef]
Guo, Y.; Liu, Y.; Oerlemans, A.; Lao, S.; Wu, S.; Lew, M.S. Deep Learning for Visual Understanding: A Review. Neurocomputing 2016, 187, 27–48. [Google Scholar] [CrossRef]
Liu, J.; Luo, H.; Liu, H. Deep Learning-Based Data Analytics for Safety in Construction. Autom. Constr. 2022, 140, 104302. [Google Scholar] [CrossRef]
Qian, Q.; Lin, P. Safety Risk Management of Underground Engineering in China: Progress, Challenges and Strategies. J. Rock Mech. Geotech. Eng. 2016, 8, 423–442. [Google Scholar] [CrossRef]
Li, X.; Yi, W.; Chi, H.-L.; Wang, X.; Chan, A.P. A Critical Review of Virtual and Augmented Reality (VR/AR) Applications in Construction Safety. Autom. Constr. 2018, 86, 150–162. [Google Scholar] [CrossRef]
Sharifzada, H.; Wang, Y.; Sadat, S.I.; Khan, S.; Zaland, S.; Akhunzada, K. Identifying and Assessing the Impact of 4M1E Factors on Construction Project Delays in Afghanistan Using Structural Equation Modeling. J. Umm Al-Qura Univ. Eng. Archit. 2024, 16, 30–41. [Google Scholar] [CrossRef]
Paneru, S.; Jeelani, I. Computer Vision Applications in Construction: Current State, Opportunities & Challenges. Autom. Constr. 2021, 132, 103940. [Google Scholar] [CrossRef]
Sharifzada, H.; Deming, Y. Construction Project Delay Risk Assessment Based on 4M1E Framework and Afghanistan Situation. Civ. Eng. J. 2024, 10, 100–116. [Google Scholar] [CrossRef]
Nadhim, E.A.; Hon, C.; Xia, B.; Stewart, I.; Fang, D. Falls from Height in the Construction Industry: A Critical Review of the Scientific Literature. Int. J. Environ. Res. Public Health 2016, 13, 638. [Google Scholar] [CrossRef]
Rao, A.S.; Radanovic, M.; Liu, Y.; Hu, S.; Fang, Y.; Khoshelham, K.; Palaniswami, M.; Ngo, T. Real-time Monitoring of Construction Sites: Sensors, Methods, and Applications. Autom. Constr. 2022, 136, 104099. [Google Scholar] [CrossRef]
Moselhi, O.; Hegazy, T.; Fazio, P. Neural Networks as Tools in Construction. J. Constr. Eng. Manag. 1991, 117, 606–625. [Google Scholar] [CrossRef]
Patrício, D.I.; Rieder, R. Computer Vision and Artificial Intelligence in Precision Agriculture for Grain Crops: A Systematic Review. Comput. Electron. Agric. 2018, 153, 69–81. [Google Scholar] [CrossRef]
Anand, I.; Kavya, T.N.; Charitha, P.S.; Kodipalli, A.; Rao, T. Accident Detection using Images and Videos with CNN, LSTM, and Interpreting the Results using LIME & GradCAM. In Proceedings of the 2024 IEEE 9th International Conference for Convergence in Technology (I2CT), Pune, India, 5–7 April 2024. [Google Scholar] [CrossRef]
Hayat, A.; Morgado-Dias, F. Deep Learning-Based Automatic Safety Helmet Detection System for Construction Safety. Appl. Sci. 2022, 12, 8268. [Google Scholar] [CrossRef]
You, Z.; Wu, C. A Framework for Data-driven Informatization of the Construction Company. Adv. Eng. Inform. 2019, 39, 269–277. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Li, Z.; Kamnitsas, K.; Glocker, B. Analyzing Overfitting Under Class Imbalance in Neural Networks for Image Segmentation. IEEE Trans. Med. Imaging 2021, 40, 1065–1077. [Google Scholar] [CrossRef] [PubMed]
Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. ” J. Big Data 2021, 8, 53. [Google Scholar] [CrossRef] [PubMed]
Marais, K. A New Approach to Risk Analysis with a Focus on Organizational Risk Factors; Massachusetts Institute of Technology: Cambridge, MA, USA, 2005. [Google Scholar]
Lessmann, S.; Baesens, B.; Mues, C.; Pietsch, S. Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings. IEEE Trans. Softw. Eng. 2008, 34, 485–496. [Google Scholar] [CrossRef]
George, M.R.; Nalluri, M.R.; Anand, K.B. Application of Ensemble Machine Learning for Construction Safety Risk Assessment. J. Inst. Eng. (India) Ser. A 2022, 103, 989–1003. [Google Scholar] [CrossRef]
Southgate, E.; Blackmore, K.; Pieschl, S.; Grimes, S.; McGuire, J.; Smithers, K. Artificial Intelligence and Emerging Technologies in Schools; University of Newcastle: Callaghan, Australia, 2019. [Google Scholar]
Khan, S.; Pervez, A.; Zhang, Y.; Ahmad, S.; Sharifzada, H.; Ismail, E.A.; Awwad, F.A. Comprehensive Risk Assessment of Pakistan Railway Network: A Semi-Quantitative Risk Matrix Approach. Heliyon 2024, 10, e32682. [Google Scholar] [CrossRef]
Campagner, A.; Ciucci, D.; Cabitza, F. Aggregation models in ensemble learning: A large-scale comparison. Inf. Fusion 2023, 90, 241–252. [Google Scholar] [CrossRef]
Ganaie, M.; Hu, M.; Malik, A.; Tanveer, M.; Suganthan, P. Ensemble deep learning: A review. Eng. Appl. Artif. Intell. 2022, 115, 105151. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large—Scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar] [CrossRef]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar] [CrossRef]
Shen, J.; Lyu, S.; Lu, Y. VME-Transformer: Enhancing Visual Memory Encoding for Navigation in Interactive Environments. IEEE Robot. Autom. Lett. 2024, 9, 643–650. [Google Scholar] [CrossRef]
Chakravarty, S.; Demirhan, H.; Baser, F. Fuzzy regression functions with a noise cluster and the impact of outliers on mainstream machine learning methods in the regression setting. Appl. Soft Comput. 2020, 96, 106535. [Google Scholar] [CrossRef]
Zounemat-Kermani, M.; Batelaan, O.; Fadaee, M.; Hinkelmann, R. Ensemble machine learning paradigms in hydrology: A review. J. Hydrol. 2021, 598, 126266. [Google Scholar] [CrossRef]
Chitkeshwar, A. Revolutionizing Structural Engineering: Applications of Machine Learning for Enhanced Performance and Safety. Arch. Comput. Methods Eng. 2024, 31, 4617–4632. [Google Scholar] [CrossRef]
Zhong, B.; Pan, X.; Love, P.E.; Sun, J.; Tao, C. Hazard analysis: A deep learning and text mining framework for accident prevention. Adv. Eng. Inform. 2020, 46, 101152. [Google Scholar] [CrossRef]

Figure 1. Flow diagram for the functioning of the Squeeze-and-Excitation block.

Figure 2. Workflow of the proposed deep learning-based construction site safety detection system.

Figure 3. Classification performance visualized by confusion matrix.

Figure 4. Precision–recall analysis of different classes with mAP@0.5.

Figure 5. F1–confidence curves for different object classes.

Figure 6. Precision–confidence curves for different object classes.

Figure 7. Recall–confidence curve for various object classes.

Figure 8. Histograms and density plot of object size and position parameters.

Figure 9. Multi-panel visualization of object detection data: class instances, grid distribution, and size distribution.

Figure 10. Evolution of loss and performance metrics during training and validation.

Figure 11. Object Detection samples from training and validation for loss—metric analysis.

Table 1. Distribution of training, validation, and test images by class.

Classes	Training Images	Validation Images	Test Images	Total Images
Hardhat	1000	200	300	1500
Mask	500	100	150	750
NO-Hardhat	400	80	120	600
NO-Mask	300	60	90	450
NO-Safety Vest	200	40	60	300
Person	800	160	240	1200
Safety Cone	150	30	45	225
Safety Vest	250	50	75	375
Machinery	100	20	30	150
Vehicle	150	30	50	230

Table 2. Performance metrics comparison showing mAP, confidence intervals, precision, and recall, along with Std Dev.

Model	mAP (Mean ± Std Dev)	95% Confidence Interval	Precision (Mean ± Std Dev)	Recall (Mean ± Std Dev)
Our Model	0.85 ± 0.03	[0.82, 0.88]	0.80 ± 0.04	0.78 ± 0.05
YOLOv4	0.78 ± 0.04	[0.74, 0.82]	0.75 ± 0.05	0.72 ± 0.06
SSD	0.70 ± 0.05	[0.65, 0.75]	0.68 ± 0.06	0.65 ± 0.07

Table 3. A detailed comparison of object detection frameworks.

Method	Precision	Recall	mAP@0.5	Strengths	Limitations
YOLOv4	0.78	0.79	0.81	Fast, widely used	Limited class diversity
SSD	0.75	0.76	0.78	Lightweight for edge use	Lower accuracy, struggles with occlusion
Proposed Framework	0.85	0.86	0.90	Accurate, reliability across classes	Higher computational cost

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sharifzada, H.; Wang, Y.; Sadat, S.I.; Javed, H.; Akhunzada, K.; Javed, S.; Khan, S. An Image-Based Intelligent System for Addressing Risk in Construction Site Safety Monitoring Within Civil Engineering Projects. Buildings 2025, 15, 1362. https://doi.org/10.3390/buildings15081362

AMA Style

Sharifzada H, Wang Y, Sadat SI, Javed H, Akhunzada K, Javed S, Khan S. An Image-Based Intelligent System for Addressing Risk in Construction Site Safety Monitoring Within Civil Engineering Projects. Buildings. 2025; 15(8):1362. https://doi.org/10.3390/buildings15081362

Chicago/Turabian Style

Sharifzada, Hijratullah, You Wang, Said Ikram Sadat, Hamza Javed, Khalid Akhunzada, Sidra Javed, and Sadiq Khan. 2025. "An Image-Based Intelligent System for Addressing Risk in Construction Site Safety Monitoring Within Civil Engineering Projects" Buildings 15, no. 8: 1362. https://doi.org/10.3390/buildings15081362

APA Style

Sharifzada, H., Wang, Y., Sadat, S. I., Javed, H., Akhunzada, K., Javed, S., & Khan, S. (2025). An Image-Based Intelligent System for Addressing Risk in Construction Site Safety Monitoring Within Civil Engineering Projects. Buildings, 15(8), 1362. https://doi.org/10.3390/buildings15081362

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Image-Based Intelligent System for Addressing Risk in Construction Site Safety Monitoring Within Civil Engineering Projects

Abstract

1. Introduction

2. Literature Review

3. Methodology

3.1. Dataset Description

3.2. Pre-Trained Models for Feature Extraction

3.3. Models Truncation and Compression

3.4. Squeeze-and-Excitation Block

3.5. Proposed Feature Fusion Architecture

3.6. Data Splitting and Preparation

3.7. Training Parameters and Optimizers

3.8. Performance Metrics and Statistical Analysis

3.9. Repeatability Analysis

4. Experimental Results and Analysis

4.1. Implementation Details

4.2. Hyperparameters Details

4.3. Performance Evaluation

4.4. Model Performance Analysis

4.5. Training and Validation Metrics

4.6. Class-Wise Performance

4.7. Detection Results

4.8. Overall System Performance

5. Discussion

6. Comparison to Prior Work

7. Limitations

8. Future Work

9. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI