1. Introduction
In the construction industry, safety is of utmost significance. Construction sites, being complex and dynamic environments, are replete with a variety of hazardous activities, including the operation of heavy machinery and tasks performed at elevated heights [
1]. The nature of construction work exposes it to many safety hazards, causing catastrophic consequences not only to individuals but also leading to project delays (an average of 81 days), cost overruns (by 3.9% according to the same study), and reputational damage [
2]. The construction sector has witnessed a significant increase in the frequency and severity of accidents recently [
3]. Falls from heights, being hit by falling objects, and structural collapses are common and deadly hazards, often resulting in severe injuries and fatalities, greatly affecting workers and their families [
4]. It has been reported that, in the USA, the number of fatalities from falls in construction has risen about 3% in the past decade (from 3.0 to 3.6 per 100,000 full-time equivalents) [
5]. Moreover, construction companies face substantial financial consequences, including compensation claims, legal costs, higher insurance premiums, and project disruptions [
6].
Traditional safety management strategies, like safety training, safety regulation enforcement, and routine site inspections, have been used for years but have shown limitations in preventing accidents [
7]. Manual inspections are time-consuming, labor-intensive, and error-prone [
8]. Safety training, although essential, does not always lead to workers consistently following safe practices. Zhou and Li found that, even with extensive safety training, the non-compliance rate with safety procedures on construction sites was as high as 47% [
9]. Given the aforementioned challenges, computer science and artificial intelligence present viable solutions.
Deep learning, within machine learning, has emerged as a powerful means for image recognition and classification [
10]. By leveraging its capabilities, automated systems for real-time analysis of construction site images and videos can be developed, enabling highly precise identification of potential safety risks. For instance, Liu Jiajing effectively used such models to detect diverse hazardous conditions on construction sites [
11].
The construction industry’s safety needs are pressing, yet traditional safety detection methods, like manual inspections, are inefficient and error-prone. Single-model deep-learning approaches face issues such as overfitting with imbalanced datasets and inconsistent performance for different hazards. This study’s deep stacked ensemble approach, combining InceptionV3 and MobileNetV2, addresses these problems. The two models’ complementary features allow for better detection of various safety-related elements. This research is key for improving safety management strategies; promoting intelligent technology use in construction; and creating a safer, more sustainable construction environment by reducing accidents and economic losses.
2. Literature Review
The construction industry has long sought more effective safety detection methods given the high risks of construction activities and the potential for catastrophic accidents [
12]. Traditional construction safety detection approaches, though relied upon for years, have several limitations [
13,
14]. Manual inspections, while highly flexible in adapting to various tasks and environments and capable of detecting complex features, like shapes, textures, and colors, are time-consuming and prone to human error, and lack consistency and reliability, especially on large construction sites [
15,
16]. A study on falls from heights in the construction industry further highlighted the challenges of manual inspections in identifying and preventing such accidents [
17]. With the progression of technology, sensor-based systems have been brought into play for monitoring construction sites [
18]. These systems are capable of furnishing real-time data regarding environmental conditions and the movements of workers, thereby augmenting the promptness of hazard detection. For instance, one study delved into the utilization of neural networks as instruments within the construction area, representing an initial foray into harnessing technology for construction safety [
19]. Nevertheless, sensor-based systems are not without their shortcomings, including restricted coverage areas, the possibility of false alarms, and high costs, as has been pointed out in the relevant literature [
19]. In recent years, computer vision techniques, with deep learning being a standout, have surfaced as a potent alternative for detecting safety within the construction industry [
20].
Deep learning models, namely convolutional neural networks (CNNs) and recurrent neural networks (RNNs), have demonstrated remarkable potential in dissecting images and videos to spot safety hazards [
21]. For instance, one study crafted a real-time computer vision system grounded on the You Only Look Once (YOLO) algorithm for the automatic detection of safety helmets at construction sites [
22]. This system achieved an outstanding mean average precision (mAP) of 92.44%, highlighting its efficacy in accurately identifying the presence or absence of safety helmets [
22]. Moreover, another study put forward a framework centered on data-driven informatization for construction companies. This framework underlined the crucial role that data analytics plays in enhancing construction safety, suggesting that by leveraging and analyzing relevant data, more informed decisions can be made to prevent accidents and improve overall safety conditions on construction sites [
23].
In the domain of construction site safety detection, single deep learning models have well-documented limitations that underscore the significance of our proposed ensemble method. Ref. [
24] pointed out in their work on YOLOv3 that despite being a powerful object-detection model, it struggles in complex and variable construction site environments with small object detection and class imbalance; small safety-related objects like nuts, bolts, and warning stickers are often missed due to the model’s difficulty in capturing fine details. Moreover, [
25] found in their research on ResNet that deep models such as ResNet can suffer from overfitting when applied to datasets with limited variability, which is common in construction site safety datasets with restricted numbers of labeled images. This overfitting leads to good performance on training data but poor performance on new, unseen data from actual construction sites, thus reducing the model’s practical utility for safety detection. Our proposed deep stacked ensemble method, combining InceptionV3 (effective at multi-scale feature extraction) and MobileNetV2 (efficient in resource-constrained construction site environments), aims to overcome these limitations and achieve more accurate and consistent safety object detection.
Despite the achievements attained by individual deep learning models in the domain of construction safety detection, they do possess certain limitations that cannot be overlooked [
21]. One significant drawback is the potential for overfitting, which can take place when these models are trained on datasets that are either limited in size or imbalanced in nature, as has been elaborated by the authors of one study [
26]. This overfitting phenomenon can cause the model to perform exceptionally well on the training data but struggle to generalize accurately when presented with new, unseen data from real-world construction scenarios. Furthermore, another limitation lies in the inconsistent performance of different deep learning models [
27]. Each model may exhibit varying levels of proficiency in detecting specific types of hazards. For instance, one model might prove highly effective in identifying structural defects within construction sites, while it could falter when tasked with detecting human behavior-related risks, such as improper use of safety equipment [
28].
This disparity in performance across different hazard types and scenarios, as pointed out by one study [
29], can pose challenges in relying solely on a single deep learning model for comprehensive and consistent safety detection on construction sites. To overcome the aforementioned limitations, ensemble learning has emerged as a viable solution and has been both proposed and implemented across diverse fields. In the context of construction safety, a research study successfully devised a predictive model for the comprehensive assessment of risks inherent to construction sites [
30]. This model utilized ensemble machine learning techniques and exhibited superior results when contrasted with simpler modelling approaches [
30]. In other safety-critical areas, such as automotive safety, ensemble learning has proven its worth by being employed to boost the accuracy of collision prediction [
31]. This showcases the versatility and effectiveness of ensemble learning in enhancing the performance of safety-related predictive models across different industries with high-risk scenarios. Theoretical investigations into ensemble learning, exemplified by the works of [
32], have laid a solid groundwork for its utilization in the domain of construction safety detection. Ensemble learning encompasses various techniques, like bagging, boosting, and stacking [
33]. These methods possess the ability to mitigate variance within the models, thereby enhancing their generalization capabilities [
34]. This means that the models can perform more consistently and accurately when applied to new, unseen data related to construction safety scenarios.
When considering models for construction site safety monitoring, various options exist, each with its own set of advantages and limitations. VGG16, with its large number of parameters [
35], often faces high computational demands and a greater risk of overfitting, which can be a significant drawback in construction site scenarios where datasets might be relatively small. ResNet50, while effective in general image classification, may not be the most efficient in resource-constrained environments due to its complex architecture [
25]. In contrast, InceptionV3 and MobileNetV2 offer distinct benefits. InceptionV3’s factorized convolutions can reduce computational complexity by up to 40% compared to traditional convolutions [
36], enabling it to balance performance and cost. Its multi-scale pattern recognition capabilities, achieved through parallel convolutional paths with different filter sizes, are well-suited for capturing the diverse features in construction site images. MobileNetV2, designed for resource-constrained environments, uses depth-wise separable convolutions that can reduce the number of parameters by a factor of about 8–9 times [
37]. This, combined with its inverted residual block for enhanced feature capture at a low cost, makes it an ideal choice for on-site deployment.
Additionally, ensemble learning techniques contribute to increasing the reliability of the models, making them more resilient to changes in the input data and less likely to be overly affected by outliers or noise [
38]. Such advantages make ensemble learning a promising approach for improving the effectiveness of construction safety detection systems, as elucidated by [
39].
In conclusion, traditional construction safety detection methods have their role, but combining deep learning and ensemble learning holds great promise for improvement [
40]. Deep learning is good at analyzing visual data for hazard identification, while ensemble learning can enhance model performance [
41]. Our proposed deep stacked ensemble approach aims to build on past efforts to develop more efficient and accurate safety management strategies in the construction industry, ultimately creating a safer working environment.
3. Methodology
3.1. Dataset Description
In our research focused on construction site safety risk detection using deep learning techniques, we utilize a dataset specifically designed for this purpose. The dataset used in our study is the Construction Site Safety Image Dataset. This dataset is of great significance, as it contains a rich collection of images that are highly relevant to the task of identifying various safety-related elements and risks on construction sites. We divided it into three main subsets: the training set, the validation set, and the test set. Each subset consists of two folders: one containing the actual images in .jpg format and the other containing the corresponding labels in .txt format. The labels are provided in the YOLO format, which is widely used in object detection tasks and allows for efficient training and evaluation of models.
The dataset encompasses a total of 10 distinct classes of objects and safety-related elements that are commonly found on construction sites. These classes include “Hardhat”, “Mask”, “NO-Hardhat”, “NO-Mask”, “NO-Safety Vest”, “Person”, “Safety Cone”, “Safety Vest”, “machinery”, and “vehicle”. This comprehensive set of classes enables the model to learn to detect and classify a wide range of safety-critical items and conditions. For example, the ability to accurately identify whether a person is wearing a hardhat or not (“Hardhat” vs. “NO-Hardhat”) is crucial for ensuring worker safety.
The Construction Site Safety Image Dataset contains a total of 5780 images. The distribution of these images across the different classes and subsets are shown in
Table 1.
Before using the dataset for training our deep learning models, we perform several pre-processing steps. Firstly, all the images are resized to a uniform dimension of 640 × 640 pixels. This is performed to ensure that the input to the deep CNN models is consistent, similar to how the images in the IoT Malware dataset were resized to 224 × 224 pixels. This standardization helps in optimizing the model training process and improving its performance. Additionally, any necessary data augmentation techniques may be applied to increase the diversity of the training data and enhance the model’s generalization ability. This could involve operations such as random rotations, flips, and brightness adjustments, which can help the model to learn stronger features and better handle different variations in the real-world construction site scenarios.
The metadata.csv and count.csv files provided with the dataset offer valuable information about the dataset itself, including details about the images and the distribution of classes. These metadata are utilized to gain a better understanding of the dataset characteristics and to ensure proper utilization during the training and evaluation processes.
This process is used to display a selection of sample images from the dataset, highlighting the diversity of the construction site scenes and the different objects and safety conditions that the model will need to learn to detect and classify. This visual representation gives an intuitive understanding of the type of data the model will be trained on and the complexity of the task at hand.
3.2. Pre-Trained Models for Feature Extraction
In the pursuit of developing a highly efficient and accurate detection method for construction site safety risks, the selection of appropriate pre-trained models for feature extraction is of crucial importance. Here, we have opted for InceptionV3 and MobileNetV2 as the base models, each possessing distinct characteristics that make them well-suited for this task.
InceptionV3: This highly regarded convolutional neural network architecture has established itself as a powerful tool in the realm of image classification tasks, renowned for its ability to strike an optimal balance between computational cost and performance [
42]. In the context of construction site safety analysis, where real-time processing is essential to promptly identify and address potential hazards, its capacity to deliver rapid results without compromising on accuracy is truly invaluable.
The V3 stands out with several key features. It employs innovative techniques like factorized convolutions and dimensionality reduction to cut the computational load, enabling efficient processing of construction site visual data (images and videos of various elements). Despite this, it is adept at capturing intricate patterns via multiple parallel convolutional paths with different filter sizes, allowing for multi-scale pattern recognition. This helps identify both broad and detailed site aspects, enhancing data understanding and generalization for frame-level feature extraction related to site safety.
Moreover, its architectural design includes auxiliary classifiers. These play crucial roles in training, improving convergence and combating the vanishing gradient problem. This leads to smoother and more accurate training, further enhancing its performance in extracting meaningful features for accurate detection of construction site safety risks, making it a robust choice.
MobileNet-V2: Specifically engineered to perform optimally in resource-constrained environments, V2 is a neural network architecture that has been designed with a focus on achieving high performance while minimizing computational requirements (Sandler, M., 2018 [
37]). In the context of construction sites, where computing resources may be limited, this characteristic makes it an extremely appealing option.
This V2 efficiency hinges on depth-wise separable convolutions (DSCs), splitting a standard convolution into two processes to cut parameters and computational cost. It swiftly analyzes construction site images/videos, accurately identifying safety elements, like hardhats, etc., within limited resources.
It also has an inverted residual block. This enhances feature capture at a low cost, speeds up training convergence, and boosts accuracy over the original MobileNet. Overall, this model is effective for processing site visual data, offering a good balance between efficiency and accuracy for various tasks.
In conclusion, V3 and V2, with their unique and complementary features, are a strong combination for our approach. Their feature extraction from site visuals will enhance safety risk detection accuracy and reliability, ensuring construction site safety.
3.3. Models Truncation and Compression
In our research centered on construction site safety risk detection, we modified the V3 and V2 models via a process of model truncation to better align them with our specific needs. The intention was to craft more streamlined versions of these models, shrinking their parameter count without compromising the core functionality and design tenets crucial for precisely analyzing the visual data from construction sites.
Even with the streamlining brought about by truncation, these models managed to hold onto their essential design and operational capabilities. Past research has demonstrated that such truncated models can perform capably even when trained on relatively smaller datasets related to construction site safety, and they do not typically encounter severe overfitting problems, a fact that is vital for their dependability in our application.
To achieve the size reduction, we took particular measures. Firstly, we removed the classifier from the top of these deep learning (DL) backbone models. Additionally, we singled out and removed certain blocks within the models. This led to a significant drop in the number of trainable parameters. For example, the original Inception-V3 model had around 25 million parameters before truncation. After applying the truncation process, this number was substantially decreased to approximately 600,000 parameters.
Likewise, the original MobileNet-V2 model, which had 3 million parameters and 19 blocks, was transformed by utilizing only 8 core blocks. This alteration led to a reduction of the parameters to about 180,000. This considerable reduction in both the parameter counts and the number of blocks made the models much more compact while still maintaining their important features relevant to identifying safety risks on construction sites.
The truncation process enabled these models to uphold good performance and efficiency. This makes them highly fitting for use in the resource-constrained settings often found on construction sites or for tasks that necessitate quicker processing times, like real-time safety risk detection. Despite their diminished complexity, these compressed models continue to supply effective solutions for our machine learning tasks associated with construction site safety.
To further augment the capabilities of these truncated models, we added several layers on top. These include the SeparableConv2D layer, which executes efficient depth-wise separable convolutions. This process is essential, as it extracts spatial features from construction site images and videos while simultaneously reducing the computational complexity. Next, the AveragePooling2D layer comes into play. It down samples the feature maps, helping to retain significant features while reducing the dimensionality of the data. Finally, we applied Alpha Dropout. During training, it randomly deactivates some neurons while maintaining the network’s mean and variance for stability. These added layers collaborate to enhance the model’s ability to generalize and ultimately improve its overall performance in accurately detecting construction site safety risks.
3.4. Squeeze-and-Excitation Block
The Squeeze-and-Excitation (SE) block plays a vital role as an architectural element within convolutional neural networks (CNNs), significantly augmenting their representational capabilities [
36]. At its core, the SE block enhances network performance by dynamically modifying the weights of each feature map. This is accomplished through two fundamental operations, squeeze and excitation, visually depicted in
Figure 1.
The squeeze operation functions by employing global average pooling. This technique condenses each feature map into a single value for every channel, thereby capturing the global spatial information spanning the entire image. Subsequently, the excitation operation comes into play. It utilizes a compact neural network to discern dependencies among the channels. Through this process, it generates a collection of modulation weights that are then utilized to scale the original feature maps. This scaling mechanism accentuates the crucial features while diminishing the impact of less significant ones.
One of the standout advantages of SE blocks is their capacity to enhance model performance with only a marginal increase in computational cost. By explicitly accounting for the interdependencies between channels, SE blocks empower the network to concentrate on the more informative features and suppress the irrelevant ones. This selective attention mechanism results in enhanced accuracy when it comes to image classification tasks. Compared to other attention mechanisms, SE blocks are favored due to their simplicity and effectiveness. They can be seamlessly integrated into prevalent architectures, like ResNet, Inception, and MobileNet, with just minor alterations. The adaptive recalibration furnished by SE blocks bolsters the network’s ability to distinguish between diverse features. This leads to increased robustness and better feature discrimination. The enhanced robustness contributes to attaining State-of-the-Art performance on benchmarks such as ImageNet, thereby showcasing the efficacy of SE blocks across a broad spectrum of tasks.
In conclusion, the integration of SE blocks into CNN architectures not only elevates performance but also helps in maintaining efficiency. This makes them a highly sought-after option for enhancing the capabilities of deep learning models.
3.5. Proposed Feature Fusion Architecture
In our research on construction site safety risk detection, we introduced a feature fusion (FF) approach that incorporates the modified Inception-V3 and MobileNet-V2 models. These models were carefully adjusted through a process of model truncation to better suit our specific requirements for analyzing construction site visual data.
The truncation process aimed to create more streamlined versions of the models. For the Inception-V3 model, which initially had around 25 million parameters, we reduced this number substantially to approximately 600,000 parameters after truncation. We achieved this by removing the classifier from the top of the deep learning (DL) backbone model and eliminating certain blocks within it. Similarly, the original MobileNet-V2 model with 3 million parameters and 19 blocks was transformed by using only 8 core blocks, resulting in a reduction of the parameters to about 180,000. This significant reduction in both parameter counts and the number of blocks made the models much more compact, while still maintaining their crucial features relevant to identifying safety risks on construction sites.
Despite the streamlining, these truncated models retained their essential design and operational capabilities. Past research has shown that such truncated models can perform well even when trained on relatively smaller datasets related to construction site safety, without typically encountering severe overfitting issues. This dependability is vital for our application in construction site safety risk detection.
Our FF approach takes advantage of the unique strengths of these truncated models, leveraging their complementary capabilities to represent features more effectively. By combining the feature maps from these two architectures, we integrate a diverse range of features. This fusion process generates a robust and comprehensive feature set, significantly enhancing the model’s ability to capture detailed and intricate patterns within the construction site visual data that it processes. Consequently, the combined power of both architectures contributes to superior performance in recognizing and interpreting complex structures related to safety risks, leading to more accurate and reliable outcomes in construction site safety applications.
Following the feature fusion, we incorporate a Squeeze-and-Excitation (SE) block to refine the fused features and enhance the overall performance of the network. The SE block significantly boosts the representational power of the model by recalibrating the feature channels in a dynamic manner.
The recalibration process begins with the squeeze operation. Here, we apply global average pooling across each feature map. In the context of construction site images, this effectively reduces the spatial dimensions of the feature map and summarizes the information into a single scalar value for each channel. This operation captures the global context of the construction site visual data, allowing the model to consider broader spatial information when assessing the significance of each feature. By compressing the spatial information, the squeeze operation provides a compact and efficient representation that serves as the foundation for the subsequent stage.
The excitation operation follows the squeeze step. A small neural network, specifically a two-layer fully connected network, is employed to learn inter-channel dependencies and relationships. For construction site data, it first reduces the dimensionality of the squeezed features to capture critical dependencies related to safety risks and then expands them back to the original number of channels. This network effectively generates modulation weights for each channel, allowing the SE block to selectively emphasize the most relevant features related to safety risks and suppress those that contribute less to the task. The resulting channel-wise attention mechanism fine-tunes the feature maps, enabling the network to prioritize critical information about safety risks and discard noise, thus improving its capacity to recognize important patterns within the construction site visual data.
Once the SE block refines the features through this recalibration process, the model proceeds with a series of layers designed to produce the final output. A flatten layer is employed to transform the multi-dimensional feature maps into a one-dimensional vector, preparing the data for subsequent fully connected layers. To mitigate the risk of overfitting during training, a dropout layer is introduced. In the context of our construction site safety model, this layer randomly deactivates neurons while maintaining the valuable feature representations related to safety risks. This dropout mechanism ensures that the model does not become overly dependent on specific features related to safety risks, advancing generalization to unseen construction site visual data.
The final stage of the architecture involves a dense classification layer. This layer contains neurons equal to the number of target classes relevant to construction site safety, such as different types of safety equipment’s presence/absence or hazardous situations. It acts as the decision-making component of the network, processing the refined and flattened feature vector to generate the final predictions about construction site safety risks.
The entire process, from feature fusion to SE block refinement and finally to the classification stage, is meticulously designed to ensure efficiency, accuracy, and resilience to variations in the input construction site’s visual data. Each phase is structured to build upon the previous, ensuring comprehensive robustness and optimized performance throughout the model.
Our architecture systematically incorporates key components, such as feature fusion, SE block recalibration, flattening, dropout regularization, and dense classification, thereby creating a robust framework tailored for construction site safety risk detection through image classification tasks. This approach effectively balances complexity with performance, offering a solution that can adeptly handle the diverse and complex visual data related to construction sites. As depicted in
Figure 2, the overarching methodology is graphically elucidated, offering a visual narrative that elucidates the intricate interplay among the various components and their cumulative contribution to the model’s holistic functionality.
3.6. Data Splitting and Preparation
In order to conduct a fair performance comparison between our proposed model, YOLOv4, and SSD, we implemented a rigorous and consistent data preparation protocol. The entire dataset, consisting of, e.g., construction site safety images, was partitioned into three subsets: training, validation, and test sets. We employed a ratio of 70:15:15 for these subsets, respectively. This partitioning was carried out in a manner that preserved the distribution of classes within the dataset as much as possible. All three models—our proposed approach, YOLOv4, and SSD—were trained, validated, and tested using these identical data splits to ensure that no model had an unfair advantage due to differences in the data subsets used.
3.7. Training Parameters and Optimizers
The training process for all models was standardized to a specific number of epochs. We set the number of training epochs to 100 for each of the models under consideration. To optimize the models’ performance during training, we utilized the Adam optimizer across all three architectures. The hyperparameters of the Adam optimizer were carefully configured to maintain consistency. The learning rate was set to 0.001, and the momentum parameter was set to 0.9. These values were chosen after conducting preliminary experiments to ensure that they provided a good balance between the speed of convergence and the quality of the final model. Additionally, the batch size during training was fixed at 32 for all models. This batch size was selected based on the available computational resources and the size of the dataset to ensure efficient training without overloading the memory.
The input image size for all models was adjusted to 416 × 416 pixels. This standardization was necessary to make the images compatible with the architectures of YOLOv4, SSD, and our proposed model, which were designed to process images of this specific dimension.
3.8. Performance Metrics and Statistical Analysis
To evaluate the performance of our model, YOLOv4, and SSD, we used several key metrics, including mean average precision (mAP), precision, and recall, which are shown in
Table 2. In addition to reporting the mean values of these metrics, we also calculated and included the standard deviation and 95% confidence intervals to provide a more comprehensive understanding of the models’ performance variability.
To determine whether the performance differences between our model and the other two models (YOLOv4 and SSD) were statistically significant, we conducted independent two-sample t-tests for each metric. For the mAP metric, the p-value for the comparison between our model and YOLOv4 was 0.002, which is below the commonly accepted significance level of 0.05. This indicates that the difference in mAP between our model and YOLOv4 is statistically significant. Similarly, for the comparison between our model and SSD, the p-value for the mAP metric was less than 0.001, further confirming the statistical significance of the performance difference.
3.9. Repeatability Analysis
To assess the repeatability and reliability of our experimental results, we repeated the entire training and evaluation process for each model five times. Each repetition involved randomly initializing the model’s weights and using the same data splits for training, validation, and testing. The average performance metrics and their corresponding standard deviations across these five repetitions are presented in the table above.
The results of the repeatability analysis showed a consistent performance trend for all models. The average mAP values across the five repetitions for our model, YOLOv4, and SSD were 0.85, 0.78, and 0.70, respectively, with standard deviations of 0.03, 0.04, and 0.05. These results demonstrate the stability of our experimental setup and the reliability of our findings, as the performance of each model remained relatively consistent across multiple runs.
4. Experimental Results and Analysis
This section delves into the implementation particulars and evaluation findings of the proposed feature fusion architecture that we developed specifically for the classification of attacks. Here, we will meticulously detail the steps required for constructing the fusion model, as well as the methodologies adopted for its training and testing procedures. The efficacy of our fusion model will be gauged by contrasting its performance with that of existing models. Through this comparison, we aim to spotlight its remarkable superiority in terms of accuracy and stability when it comes to classifying diverse attack types. Ultimately, this will vividly demonstrate the distinct advantages of our approach in augmenting classification accuracy and bolstering the overall dependability of the model.
4.1. Implementation Details
In this section, we detail the implementation aspects employed for training the proposed model. Our research makes use of the Keras framework in conjunction with Python to conduct experiments on the feature fusion model. All experiments were executed within a Python-based environment. To optimize computational efficiency, we fully exploited the GPU runtime. Specifically, an Nvidia Tesla K80 GPU, which is equipped with 16 GB of RAM and 512 GB of storage, was utilized for our experiments.
4.2. Hyperparameters Details
To enhance the reliability of our proposed method, meticulous attention was given to the selection of optimal hyperparameters for the training process. The model was trained with a carefully chosen batch size of 64. Over the course of 20 epochs, the training was carried out to allow for sufficient learning and improvement of the model’s performance.
The learning rate was set to 1.0 × 10−3, a value determined through experimentation and analysis. This specific setting was crucial, as it enabled us to make the most efficient use of computational resources while simultaneously ensuring effective convergence of the model during training.
A significant contributor to the improved performance of our model is the Adam optimizer. Renowned in the field, the Adam optimizer stands out for its ability to deliver higher accuracy and manage memory usage efficiently. Its effectiveness stems from its capacity to optimize the behavior of sparse gradients. Moreover, it has the advantage of dynamically adjusting the learning rate throughout the training process, which facilitates better and faster convergence.
Furthermore, our approach places great reliance on the categorical cross-entropy loss function. This function plays an indispensable role in evaluating how precisely the model predicts categorical labels. During the training phase, it serves as a guiding mechanism for the model, directing it to learn accurately. By minimizing the loss calculated through this function, the model is able to generalize well to new, unseen data, an ability that is vital for its overall performance and usability in real-world scenarios.
This setup facilitated effective processing and proficient management of large datasets. Given that dealing with substantial amounts of data is crucial for the proper training and accurate evaluation of our model, this GPU configuration played a vital role in ensuring the overall success of our research efforts.
4.3. Performance Evaluation
The evaluation of the performance of learning models constitutes a critical aspect in determining their efficacy. In our study, we rely on several key metrics, namely accuracy, precision, recall, and the F1-score, to comprehensively assess the performance of our proposed model.
These performance metrics are derived from the confusion matrix, which serves as a valuable source of insights regarding how well the classifier performs on test data. Each of these metrics is computed using specific values from true positives (TPs), true negatives (TNs), false positives (FPs), and false negatives (FNs), all of which fall within the range from 0 to 1.
Accuracy, for instance, acts as a measure of the model’s capacity to correctly predict both positive and negative classes. It provides an overall indication of how often the model makes correct predictions across all samples in the dataset.
Precision, on the other hand, is calculated as the ratio of true positives to the sum of true positives and false positives. This metric gives us an understanding of the model’s ability to accurately identify positive instances without misclassifying negative ones as positive.
Recall, also known as sensitivity or the true positive rate, measures the proportion of true positives in relation to all actual positives, taking into account the false negatives as well. It reflects the model’s ability to capture all the relevant positive instances.
The F1-score, which is the harmonic mean of precision and recall, offers a balanced perspective by combining these two important aspects. With a range between 0 and 1, it provides a single value that encapsulates the trade-off between precision and recall, thereby enabling us to have a more holistic view of the model’s performance.
By employing these mathematical calculations based on the aforementioned metrics, we are able to precisely measure and analyze the performance of our model, facilitating a detailed understanding of its strengths and areas that may require further improvement.
The performance of the proposed intelligent monitoring and analysis system for construction site safety was evaluated using several metrics, including accuracy, precision, recall, F1-score, and mean average precision (mAP). The following subsections provide a detailed analysis of the model’s performance, training process, and detection capabilities.
4.4. Model Performance Analysis
The confusion matrix (
Figure 3) illustrates the classification performance of the proposed model for the ten classes in the dataset. The model demonstrates high accuracy for classes such as “Safety Cone” (91%) and “Machinery” (93%), indicating its strong ability to detect these safety-critical elements. However, the accuracy for the “Vehicle” class is relatively lower, at 57%, reflecting potential challenges in distinguishing vehicles from similar objects or backgrounds.
The class-wise performance is summarized in the precision–recall curve (
Figure 4). The model achieved an overall mean average precision (mAP) of 0.81 at an IoU threshold of 0.5. Among the individual classes, the “Mask” class exhibited the highest mAP of 0.918, followed by “Machinery” (0.936) and “Safety Vest” (0.907). In contrast, the “Vehicle” class achieved the lowest mAP of 0.601, suggesting areas for improvement, particularly in addressing inter-class confusion and improving model sensitivity to smaller or occluded vehicles.
The F1–confidence curve (
Figure 5) highlights the reliability of the model, achieving an average F1-score of 0.81 at a confidence threshold of 0.488. This indicates that the model can balance precision and recall effectively across various confidence levels. The precision–confidence curve (
Figure 6) and recall–confidence curve (
Figure 7) further corroborate the model’s ability to maintain high precision and recall, particularly for the “Safety Cone”, “Safety Vest”, and “Mask” classes.
4.5. Training and Validation Metrics
The training and validation loss curves (
Figure 8) reveal a smooth convergence of the model during training, indicating the effectiveness of the selected hyperparameters and architecture. The training box loss decreased steadily from 1.3 to 0.8 over 100 epochs, while the validation box loss followed a similar trend, stabilizing at approximately 1.4. These results suggest that the model generalized well to unseen data without significant overfitting.
The training and validation classification loss curves further emphasize the improvement in the model’s classification capabilities, with both metrics converging consistently over the epochs. The validation precision and recall metrics indicate a steady increase, achieving final precision and recall values of 0.95 and 0.86, respectively, demonstrating the model’s strong detection and classification abilities.
4.6. Class-Wise Performance
The class label distribution (
Figure 9) highlights the inherent imbalance in the dataset, with certain classes, such as “Person” and “Safety Vest”, having significantly more instances than others, like “Vehicle” or “NO-Safety Vest”. Despite this imbalance, the proposed model performed robustly for most classes, as evidenced by the high mAP values for “Mask”, “Machinery”, and “Safety Cone”. However, the lower performance for the “Vehicle” and “NO-Mask” classes (mAP = 0.601 and 0.669, respectively) indicates that class imbalance and visual similarity between objects may have impacted the detection accuracy for these categories.
4.7. Detection Results
Qualitative results from the detection outputs (
Figure 10 and
Figure 11) demonstrate the model’s ability to accurately detect and localize multiple objects within complex construction site scenarios. The bounding boxes are well-aligned with the ground truth for most classes, and the confidence scores reflect the model’s reliability in identifying safety-critical elements, such as hardhats, masks, and machinery. However, some misclassifications were observed, particularly between “Hardhat” and “NO-Hardhat”, which may be attributed to visual similarity or partial occlusion in the images.
4.8. Overall System Performance
The overall performance of the proposed system indicates its potential for practical deployment in real-world construction site safety monitoring. The high mAP@0.5 of 0.81 and the balanced precision–recall trade-offs across most classes highlight the model’s robustness and reliability. Despite some limitations, such as lower accuracy for the “Vehicle” class, the results demonstrate that the model is capable of effectively detecting and classifying safety-critical objects in complex and dynamic construction environments.
7. Limitations
Despite its promising performance, this study has several limitations. First, the reliance on visual data limits the system’s ability to detect non-visual hazards, such as excessive noise, air quality issues, or temperature extremes. Second, dataset imbalance, particularly for underrepresented classes like “Vehicle” (mAP = 0.601) and “NO-Mask” (mAP = 0.669), affected detection accuracy for these categories. These imbalances stem from the inherent distribution of objects in construction site images, which can lead to bias in model training. Additionally, annotations may introduce subjectivity, particularly in challenging cases where objects overlap or blend into the background.
Our model also faces limitations in computational cost and real-time processing. Combining InceptionV3 and MobileNetV2 with Squeeze-and-Excitation blocks demands significant GPU memory and processing time, which may limit adoption in resource-constrained construction sites. Additionally, while trained on diverse data, the model’s ability to maintain performance in real-time amidst the dynamic, ever-changing conditions of actual construction sites is uncertain.
Lastly, while the system demonstrates performance in controlled experimental settings, its real-world deployment remains untested. Variability in lighting, camera angles, and occlusions on active construction sites could potentially reduce its accuracy, underscoring the need for further evaluation in diverse and uncontrolled environments.