SDCB-YOLO: A High-Precision Model for Detecting Safety Helmets and Reflective Clothing in Complex Environments

Yang, Xiang; Wang, Jizhen; Dong, Minggang

doi:10.3390/app14167267

Open AccessArticle

SDCB-YOLO: A High-Precision Model for Detecting Safety Helmets and Reflective Clothing in Complex Environments

by

Xiang Yang

^1,2,

Jizhen Wang

^1,2,*

and

Minggang Dong

^1,2

¹

College of Computer Science and Engineering, Guilin University of Technology, Guilin 541006, China

²

Guangxi Key Laboratory of Embedded Technology and Intelligent System, Guilin University of Technology, Guilin 541004, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(16), 7267; https://doi.org/10.3390/app14167267

Submission received: 15 July 2024 / Revised: 9 August 2024 / Accepted: 14 August 2024 / Published: 19 August 2024

Download

Browse Figures

Versions Notes

Abstract

:

The correct wearing of safety helmets and reflective vests is of great significance in construction sites, offices, and civil engineering sites. Aiming to address the issues of low detection accuracy and high algorithm complexity caused by complex background environments in the small target detection of safety helmets and reflective clothing using existing algorithms, an improved algorithm based on YOLOv8n is proposed. Firstly, the SE module is utilized to reduce interference in complex environments. Next, the IOU function is modified to speed up calculations. Then, a lightweight universal upsampling operator (CARAFE) is employed to obtain a larger receptive field. Finally, the Bidirectional Feature Pyramid Network is used to replace the Concat module of the original head layer. Based on these four modifications made to the model, this article names the new model SDCB-YOLO, derived from the initial letters of the four respective modules. The experimental results show that the mAP of the SDCB-YOLO model on the test set reached 97.1%, which is 4.6% higher than YOLOv5s and 3.5% higher than YOLOv8n. Additionally, the model boasts a parameter count of 3,094,304, a computational load of 8.4 GFLOPs, and a model size of 6.13 MB. Compared to YOLOv5s, with a parameter count of 7,030,417, a computational cost of 16.0 GFLOPs, and a model size of 13.79 MB, the SDCB-YOLO model is significantly smaller. When compared to YOLOv8n, with a parameter count of 3,011,628, a computational complexity of 8.2 GFLOPs, and a model size of 6.11 MB, the SDCB-YOLO model’s parameters and model size are only slightly increased, while maintaining a comparable computational load. Therefore, the improved detection algorithm presented in this article not only ensures the lightweight nature of the model but also significantly enhances its detection accuracy.

Keywords:

object detection; YOLOv8; deep learning; artificial intelligence; helmet detection

1. Introduction

In modern industrial production, there are numerous potential risks, and the life safety of construction workers is often threatened. According to the analysis of construction site accidents in recent years, fatal accidents caused solely by objects falling from a height account for 24% of all accidents [1]. Additionally, 90% of accidents are attributed to not wearing protective equipment or wearing it irregularly [2]. This underscores the particular importance of detecting safety helmets and sun protective clothing.

Up to now, there have been roughly two types of detection methods. The first is completely reliant on manual inspection, and the second is based on artificial intelligence technology [3]. Manual supervision involves using surveillance footage to monitor the on-site situation in real time. While there are no technical difficulties involved, the scene can be chaotic with numerous surveillance screens, leading to oversight and missed observations during manual supervision. Furthermore, it requires a significant amount of manual labor for shift rotations. With the improvement in GPU computing power and the advancement of artificial intelligence over the years, object detection algorithms have also made significant progress. Using object detection algorithms to detect safety helmets and reflective clothing can not only improve detection accuracy but also reduce labor and time costs, while enabling intelligent monitoring and control. The mainstream object detection algorithms can be divided into two major categories: the first category is single-stage detection algorithms, such as R-CNN [4], Fast R-CNN [5], and Faster R-CNN [6]; the second category is referred to as two-stage detection algorithms, including SSD [7] and YOLO [8,9,10,11].

Currently, numerous scholars have undertaken research on the detection of safety helmets and reflective clothing. Jin employed the k-means++ algorithm on the fundamental YOLOv5 model to enhance the size matching accuracy of anchor boxes and incorporated the DWCA attention mechanism into the backbone network. Jin observed that previous studies predominantly relied on prior human knowledge to determine anchor box sizes, which often introduced inaccuracies and overlooked nuances. Consequently, leveraging an algorithmic approach to supersede such prior knowledge yielded superior matching precision. The degree of alignment between anchor boxes and ground truth boxes crucially impacts detection accuracy. In designing the DWCA attention mechanism, Jin incorporated positional information, enabling the network to engage with a broader spatial context. On the custom dataset curated by Jin’s team, the model achieved a remarkable mAP accuracy of 95.9% [12]. However, Jin’s study is not devoid of limitations. Firstly, it neglects considerations regarding model size and computational cost, which are crucial factors for practical deployments. Secondly, the optimization of the backbone network appears relatively straightforward, potentially limiting the model’s generalizability across diverse domains. These aspects represent opportunities for future research to further refine and extend Jin’s findings.

Zhang endeavors to achieve breakthroughs in addressing small object detection challenges. Firstly, he introduces an additional layer of structure to bolster the detection capabilities for small targets. Secondly, he redesigns the pre-selection box sizes utilizing the K-means algorithm (specifically, Zhang employed the fundamental K-means algorithm). Lastly, he modifies the loss function, transitioning from CIOU to EIOU. Given that existing public datasets did not meet Zhang’s specific requirements, he created a custom dataset, categorizing it into six distinct groups based on the wearing status of safety helmets. This approach, which is relatively uncommon in helmet detection, underscores Zhang’s algorithm’s enhanced capabilities in detecting small objects and refines the categorization of helmet detection [13]. However, it is worth noting that these six categories, while meticulous, might introduce complexity in real-world applications, potentially resulting in some detection outcomes with less distinct boundaries. Furthermore, the algorithm still exhibits lower accuracy in distinguishing between categories with subtle differences.

Wang’s algorithm presents a substantial improvement over YOLOv3 by introducing several key enhancements. Firstly, the integration of a larger input layer significantly enlarges the perceptual field of the feature map. Secondly, the adjustment of candidate box sizes is a strategic move aimed at optimizing the detection of small-sized objects. Remarkably, Wang’s algorithm maintains a satisfactory frames per second (FPS) while achieving a superior mean average precision (mAP) at the same resolution compared to YOLOv3. This dual performance ensures real-time monitoring capabilities without compromising on accuracy. However, it is acknowledged that the enhanced algorithm’s increased model size poses challenges for practical deployment, particularly in resource-constrained environments. Additionally, the reliance on the relatively outdated YOLOv3 framework as a starting point limits the algorithm’s potential for further advancements [14].

Zhang’s algorithm builds upon enhancements to YOLOv5. Initially, it employs Ghost modules to replace certain convolutional and C3 modules. Subsequently, it incorporates a CA attention mechanism into the backbone network, and ultimately replaces the C3 module in the neck layer with C3CBAM. On a custom-built dataset, the algorithm achieved a mAP accuracy of 93.6%, reducing computation by 41.7% and the model size by 37.3%. Optimizations have focused on lightweighting. Although the algorithm is primarily aimed at lightweighting, there has been no concomitant improvement in accuracy. Thus, future enhancements and optimizations are required to boost accuracy [15].

Although these methods have increased the accuracy of the object detection algorithm to a certain extent, the [email protected] has not reached 95%, and the [email protected]:0.95 fluctuates around 60%. This paper argues that this cannot meet our requirements for model accuracy. Furthermore, most researchers have focused solely on optimizing the detection of safety helmets, neglecting the detection of reflective clothing, thereby limiting the application scenarios. In response to this existing issue, this paper proposes the SDCB-YOLOv8 model, which achieves higher accuracy and more comprehensive detection capabilities. Firstly, this paper introduces the SE attention mechanism to optimize the allocation of computational resources, aiming to achieve higher accuracy with the same computational power consumption. Secondly, the loss function was switched to DIOU to reduce computation. Additionally, a more lightweight upsampling operator was used, addressing the issue of simplistic sampling steps in traditional algorithms while maintaining lightweight characteristics. Finally, a BiFPN was adopted to replace the original FPN, offering more targeted feature fusion steps.

The first two sections of Section 2 in this paper introduce the dataset and the experimental environment used. Section 2.3, Section 2.4, Section 2.5, Section 2.6, Section 2.7 and Section 2.8 of Section 2 detail the modification process and objectives of the model. Section 3 introduces the evaluation metrics of the model and compares the models using these metrics. The fourth part introduces the achievements of this paper and the direction for future improvements.

2. Materials and Methods

2.1. Data Acquistion

The dataset utilized in this study is a publicly accessible one sourced from Roboflow [16], featuring images captured from diverse real-world construction site scenarios. These images span from sparsely populated scenes to highly crowded ones and encompass various colors of safety helmets and reflective clothing. The initial dataset comprised 4311 images. To bolster the data and emulate potential challenges encountered with camera equipment in different construction sites, data augmentation techniques were employed. These encompassed random horizontal rotations of images, as well as adjustments to saturation, exposure, and brightness. Post-augmentation, the total image count swelled to 10,449.

The dataset is categorized into four distinct classes: “no helmet”, “no jacket”, “safe”, and “unsafe”. The specific classification criteria are defined as follows: images displaying both a safety helmet and reflective clothing are designated as “safe”; those with reflective clothing but no safety helmet are labeled “no helmet”; images featuring a safety helmet but lacking reflective clothing are categorized as “no jacket”; and those devoid of both are marked as “unsafe”. A visual reference for this classification system is presented in Figure 1.

The dataset was partitioned into three sets: 9207 images for training, 464 for testing, and 778 for validation. While this dataset comprehensively covers most photographic types, it falls short in capturing images from nighttime environments, a limitation that will be addressed in future endeavors.

2.2. Experimental Environment

In this paper, the performance of SDCB-YOLO is investigated for the experimental environment shown in Table 1.

2.3. YOLOV8 Model

The YOLOv8 model is an improvement based on YOLOv5. Released by Ultralytics in 2023, the YOLOv8 model boasts five variants, YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x, each variant’s number of parameters being directly correlated with its training duration. The intricate model architecture of YOLOv8n, illustrated in Figure 2, can be neatly summarized into four components: Input, Backbone, Neck, and Predict.

In comparison to YOLOv5, YOLOv8 incorporates the innovative C2F (Cross-Stage Partial Network Bottleneck with 2 Convolutions) within its Backbone layer, which harmoniously integrates ELAN and C3 [17]. This refinement contributes to YOLOv8’s enhanced accuracy and quicker performance compared to its predecessor. Additionally, YOLOv8 employs SPPF (Spatial Pyramid Pooling-Fast) as its spatial pyramid pooling module. SPPF efficiently performs pooling operations across feature maps of varying scales without altering their dimensions, thus augmenting overall accuracy [18].

Exhibiting remarkable versatility, YOLOv8 can seamlessly operate on diverse hardware platforms, including CPUs and GPUs, and offers a smooth transition between its multiple versions, demonstrating unparalleled flexibility.

2.4. Squeeze-and-Excitation Networks

In actual safety helmet and reflective vest detection scenarios, the significance of distinct regions within a single image frequently diverges. The failure to intelligently allocate computational resources to these pivotal areas can markedly compromise the accuracy and reliability of the final recognition outcomes. Unfortunately, the conventional YOLOv8 model lacks an inherent resource allocation optimization mechanism, significantly hindering its actual detection capabilities.

To mitigate this issue, this article innovatively incorporates attention mechanisms as a solution. By mimicking the human visual system’s selective attention prowess, these mechanisms aim to autonomously discern and concentrate on the crucial information zones within images. Through a meticulous series of experiments, we thoroughly evaluated and verified a multitude of attention mechanisms, striving to identify the optimal fit for this specific application context.

Upon rigorous comparison and thorough analysis, the SE (Squeeze-and-Excitation) attention mechanism emerged as the standout performer in enhancing detection accuracy. By explicitly modeling the intricate interdependencies among feature channels and adaptively recalibrating their relative importance, the SE mechanism adeptly amplifies the model’s sensitivity to pivotal features while effectively mitigating the distraction caused by non-essential information. This groundbreaking finding not only offers fresh perspectives for the precise detection of safety helmets and reflective vests but also serves as a robust foundation for performance optimization across a broad spectrum of similar visual tasks.

The attention mechanism closely mimics the human eye’s ability to allocate varying levels of attention based on the significance of image information during observation, effectively tackling the challenge of information overload. Its primary objective is to strategically allocate computational resources towards the crucial elements that have a more profound impact on the final results [19].

The SE (Squeeze-and-Excitation) module aims to derive a set of weights and apply feature weighting across all original channels. It achieves this through a two-step process: Squeeze and Excitation. In the Squeeze stage, the feature maps derived from the convolutional layer are transformed into a distinctive vector, condensing spatial information. Following this, the Excitation stage leverages fully connected layers and specific nonlinear activation functions to learn and generate a weight vector for each channel. This enables the SE module to adaptively adjust the weight proportion of each channel, fine-tuning the channel’s contribution within the feature map according to the specific task requirements.

The detailed algorithm flowchart, as depicted in Figure 3, illustrates this process. By integrating the SE attention mechanism into the SDCB-YOLO model, it becomes capable of allocating diverse computational resources to different regions of the image. This ensures that the model achieves improved accuracy without incurring additional computational complexity.

In this paper, a comparison is made in the selection of attention mechanisms, specifically comparing the common attention mechanisms. The detailed test results are shown in Table 2. It can be observed that SE attention outperforms other attention mechanisms across various indicators.

2.5. Distance IoU Loss Function

When using object detection algorithms, the precise positioning of the target bounding box is undoubtedly one of the core metrics for measuring algorithm performance. It is crucial to select an appropriate loss function for optimizing and ranking the position of bounding boxes, as different loss functions have their own advantages in calculation methods and are directly related to the accuracy of the model’s prediction of the target bounding box. YOLOv8, as an advanced object detection model, initially adopted CIOU (Complete Intersection over Union) as its core loss function, aiming to improve detection accuracy through more refined and complex computational logic. CIOU not only considers the degree of overlap between bounding boxes but also introduces various factors such as center point distance and aspect ratio consistency to achieve more accurate matching results. However, in the process of in-depth analysis and experimentation, we noticed that the aspect ratio penalty mechanism in CIOU did not significantly affect the selection accuracy of target bounding boxes as expected for specific datasets. This discovery suggests that, for certain specific application scenarios or datasets, the complex computational mechanism of CIOU may not be necessary and may instead increase the computational burden of the model due to overfitting or redundant calculations, affecting detection efficiency. In view of this, this article proposes adjusting the loss function from CIOU to DIOU (Distance IoU Loss). This change is based on the following considerations:

Simplify calculations and improve efficiency: DIOU retains only the key parts of CIOU regarding the distance between the center points of the bounding boxes and the overlapping area, removing complex calculations such as aspect ratio penalties, thereby significantly reducing the computational complexity of the model and helping to improve detection speed.

Maintaining accuracy advantage: Despite removing aspect ratio penalties, DIOU can still maintain good detection accuracy on most datasets, especially in scenarios where aspect ratio changes have little impact on detection results, where its performance is particularly outstanding.

Stronger adaptability: By simplifying the loss function, the SDCB-YOLO model demonstrates stronger adaptability and flexibility, making it easier to cater to the needs of different datasets and detection tasks, providing more possibilities for practical applications.

In summary, by modifying the loss function from CIOU to DIOU, the SDCB-YOLO model not only simplifies the calculation process and improves detection efficiency but also maintains good detection accuracy.

DIOU minimizes the relative position of the center points of two BBox frames [20]. DIOU is calculated according to Equation (1).

L_{D I O U} = 1 - I O U + \frac{ρ^{2} (b, b^{g t})}{C^{2}}

(1)

“b” and “b^gt” represent the center positions of the two bounding boxes, respectively; “ρ ()” is the function used to calculate the distance; and “C” represents the smallest rectangular box that encloses both bounding boxes. The modified model achieves improved lightweight performance without compromising detection accuracy.

2.6. CARAFE: A Lightweight and General Upsampling Operator for Computer Vision

Compared with algorithms that focus on single helmet detection, the method discussed in this article demonstrates a wider range of concerns when processing datasets, not only focusing on subtle helmet recognition, but also detecting reflective clothing. This extension requires algorithms to have a larger receptive field (i.e., receptive domain) to capture a wider and more complex range of scene information, and to flexibly apply kernels of multiple sizes that dynamically adjust based on the specific content of the detection instance to achieve accurate recognition. Given that in practical application scenarios, especially terminal devices, they often face the challenge of limited computing resources, lightweight design has become an indispensable consideration factor. Therefore, this article innovatively introduces CARAFE (an advanced feature upsampling technique) to optimize the model, effectively replacing traditional upsampling methods. This measure not only significantly improves the performance of the model in handling complex multi-target detection tasks, but also ensures efficient operation of the model in resource-limited environments by reducing computational complexity and parameter count, achieving a perfect balance between detection accuracy and operational efficiency.

CARAFE’s upsampling operation involves feature reorganization, where feature maps are first obtained and then multiplied with corresponding adjacent pixels and sampling kernels [21]. CARAFE comprises two main modules: the upsampling kernel prediction module and the feature reorganization module. The specific calculation formulas are given in Equations (2) and (3). The kernel module is responsible for generating the reorganization kernels. Each source location on X requires a k_up × k_up reorganization kernel, where k_up is the size of the reorganization kernel. The kernel module is further divided into three sub-modules: a channel compressor, a content encoder, and a kernel normalizer. The channel compressor reduces the number of channels by an unspecified amount. The content encoder encodes features before delivery. The kernel normalizer performs the reorganization operation.

W_{l^{'}} = ψ (N (X_{l}, k_{e n c o d e r}))

(2)

X_{l^{'}}^{'} = ϕ (N (X_{l}, k_{u p}), W_{l^{'}})

(3)

The CARAFE operator accelerates the computation speed and improves the computational accuracy of the model. The change in receptive field enables the sampled feature maps to contain more information, leading to more accurate model training. Furthermore, the relatively low computational cost of the CARAFE operator also speeds up the model training process.

2.7. Bidirectional Feature Pyramid Network

A particularly significant technical challenge in the field of safety helmet and reflective clothing detection lies in the extreme inconsistency of feature scales. This necessitates not only tackling the detection challenges posed by small targets such as safety helmets, but also accurately identifying large-scale targets like reflective clothing. In complex real-world scenarios, a single image can contain dense crowds, leading to mutual occlusion among individuals. Moreover, the diversity of environmental factors further complicates the detection task. Consequently, achieving effective feature fusion across multiple scales is crucial for enhancing detection accuracy. Traditional single-layer feature fusion methods are often constrained by their limited field of view, overlooking the complementarity between features at different levels, which inevitably compromises detection accuracy to some degree. While blindly increasing the number of fusion layers might potentially improve performance, it also leads to a significant rise in computational complexity, imposing stricter requirements on computing resources, which is particularly undesirable in resource-constrained deployment environments.

In light of these issues, this article introduces an innovative solution: replacing the original PAN (Path Aggregation Network) structure with the more efficient BiFPN (Bidirectional Feature Pyramid Network) structure. The BiFPN, leveraging its unique bidirectional cross-scale connections and weighted feature fusion mechanism, not only bolsters the network’s capacity to capture multi-scale features but also facilitates a more comprehensive and balanced dissemination of feature information. This advancement not only elevates the detection accuracy of safety helmets and reflective clothing across various scales but also effectively mitigates computational costs by optimizing the computational path. As a result, the model remains highly performant while being more adaptable to the computational resource constraints encountered in actual deployment scenarios.

The BiFPN (Bidirectional Feature Pyramid Network) employed in this algorithm [22] supersedes the conventional PAN (Path Aggregation Network) structure, addressing several issues. The key enhancements are outlined below:

Firstly, nodes with a single input edge are eliminated, a strategic move aimed at minimizing unnecessary computations and optimizing the algorithm’s computational efficiency. Secondly, when the initial input and output nodes reside on the same layer, the number of connecting edges is augmented by one, enhancing information flow. Thirdly, in the presence of a bidirectional path, this path is reiterated multiple times as a feature network layer, further enriching feature representation.

A structural comparison is presented in Figure 4. The traditional FPN (Feature Pyramid Network), as depicted in Figure 4a, relies solely on top-down fusion, which can be restrictive due to its unidirectional information flow. In contrast, the YOLOv8 structure, as illustrated in Figure 4b, incorporates a PAN, adding a bottom-up operation to the FPN layer. However, when fusing features across different levels, it adopts a relatively straightforward approach, primarily through summarization, treating features of varying sizes uniformly. This approach may not be optimal for the diverse processing needs of different layer features, particularly for safety helmets and reflective clothing detection.

The BiFPN structure in SDCB-YOLO, as showcased in Figure 3c, excels in providing nuanced attention and processing at multiple scales and levels. It achieves weighted influence for the fused feature maps, dynamically adjusting based on feature maps of different resolutions. This enables the model to better capture and represent the intricate characteristics of safety helmets and reflective clothing, enhancing overall detection performance.

2.8. SDCB-YOLO Network

In summary, the design of the new model is completed through the above four optimizations. In this paper, the new model is named SDCB-YOLO by taking the initials of the four modifications. The SDCB-YOLO architecture diagram is shown in Figure 5. The SDCB-YOLO model has been optimized and improved specifically for the detection of safety helmets and reflective clothing. It employs the backbone network of YOLOv8, simplifying migration to various devices. In the neck layer, CARAFE is innovatively used to replace the original sampling method, enabling the network to achieve a larger receptive field and employ adaptive convolution kernels to capture more informative feature maps, while concurrently reducing some computational costs.

Moreover, the BiFPN is seamlessly integrated into the Concat module to optimize feature fusion across multiple scales. This eliminates unnecessary computations while enhancing the significance of important features, thereby deepening the impact of feature maps with varying resolutions on the fusion process. Subsequently, the SE attention mechanism is strategically placed after the original 18th and 21st layers to dynamically allocate computational resources among different parts of the network, leading to improved utilization of these resources. Regarding the loss function, CIOU in YOLOV8 is replaced with DIOU, discarding some unnecessary computations, accelerating the training speed, and reducing the computational burden. This makes the model more compatible with device computational power during actual deployment, enabling more accurate and faster recognition.

3. Results and Discussion

3.1. Evaluation Indicators

In this study, we use precision, recall, mean average precision at 0.5 ([email protected]), mean average precision from 0.5 to 0.95 ([email protected]:0.95), computational complexity (GFLOPs), and model size as evaluation metrics. Precision refers to the proportion of samples that are positive among all samples predicted to be positive. The calculation formula is as shown in Equation (4).

P = TP/(TP + FP)

(4)

TP stands for the number of original positive samples that are correctly detected as positive, FP represents the number of samples that are incorrectly identified as positive when they are negative, and FN represents the number of original positive samples that are not correctly detected. Recall is the proportion of all positive samples that are correctly detected. The calculation formula is given in Equation (5). Here, TP stands for the number of original positive samples that are correctly identified as positive, FP represents the number of samples that are incorrectly identified as positive when they are negative, and FN represents the number of original positive samples that are not correctly detected.

R = TP/(TP + FN)

(5)

AP represents the average of the accuracy for a single category, while mAP is the mean value calculated across all categories. [email protected] refers to the average value when the IOU threshold is set to 0.5, and [email protected]:0.95 represents the average value when the IOU threshold ranges from 0.5 to 0.95 with a step size of 0.05. The calculation methods are shown in Equations (6) and (7). Here, N represents the number of target categories being detected.

A P = \int_{0}^{1} P (R) d R

(6)

m A P = \sum_{i = 1}^{N} A P_{i} / N

(7)

GFLOPs directly reflects the detection speed of the model. Model size will directly affect the feasibility of actual deployment on the device.

3.2. Comparative Experiment

In this paper, the SDCB-YOLO model is compared with current mainstream algorithms. As presented in Table 3, the YOLOv3-tiny model significantly lags behind the proposed SDCB-YOLO model in terms of model size, number of parameters, computational complexity, and accuracy. While YOLOv4m achieves a considerable improvement in accuracy compared to YOLOv3-tiny, its excessive number of parameters and other indicators render it unsuitable for practical applications on devices. Although the SDCB-YOLO model is slightly larger in size and has a higher computational complexity than YOLOv5n, it demonstrates a notable enhancement in accuracy. Compared to the YOLOv8n, SDCB-YOLO achieves a higher accuracy with only a marginal increase in the number of parameters and computational complexity. It is evident from the comparisons that SDCB-YOLO excels in various evaluation criteria.

To more intuitively demonstrate the advantages of SDCB-YOLO, this paper compares the detection results of YOLOv8n and SDCB-YOLO in various complex scenarios. Shown in Figure 6a,b are the detection results of the same image using YOLOv8n and SDCB-YOLO, respectively. Figure 6a exhibits a detection error, while Figure 6b can detect correctly. Figure 6c,d are the comparison of the results after image blurring processing. Figure 6c is the result of YOLOv8, and Figure 6d is the result of SDCB-YOLO. It is evident that SDCB-YOLO has more accurate detection results. Figure 6e,g represent the results of YOLOv8, while Figure 6f,h represent the results of SDCB-YOLO. It is evident that even when both models detect the correct results, SDCB-YOLO achieves a higher confidence level. This verifies the superiority of SDCB-YOLO.

3.3. Ablation Experiment

In this paper, ablation experiments are conducted to demonstrate the effectiveness of all improvements. Compared to YOLOv8n, SDCB-YOLO is improved by 2.1%, 5.4%, 3.5%, and 7.8%, respectively, in terms of P, R, mAP0.5, and mAP0.5–0.9. Initially, the integration of the SE attention mechanism enhanced the recall rate, [email protected], and [email protected]:0.95; subsequently, the modified DIOU improved accuracy and reduced computational demands for recall rate and [email protected]; following this, the adaptation to CARAFE led to substantial enhancements in accuracy, recall rate, [email protected], and [email protected]:0.95; lastly, the inclusion of the BiFPN also boosted accuracy, [email protected], and [email protected]:0.95. This proves that the improvements made to SDCB-YOLO in this paper are effective. The specific experimental conditions are shown in Table 4. Checking all rows for YOLOv8 in Table 4 indicates that the model is modified based on YOLOv8, and the check mark for each module represents the addition of that module on top of YOLOv8. The normalized confusion matrix is shown in Figure 7.

4. Conclusions

In this paper, we propose the SDCB-YOLO object detection algorithm, built upon YOLOv8n, specifically for the detection of safety helmets and reflective clothing. YOLOv8 often encounters challenges in tasks involving crowded scenes, complex environments, blurry images, and small object detection. Consequently, we first introduce the SE attention mechanism to adjust resource allocation, focusing more computational resources on more significant pixels, thereby addressing the issue of poor detection performance. We revise the loss function to reduce computational complexity and incorporate CARAFE for upsampling operations, which enlarges the receptive field and yields feature maps with richer information, leading to more precise feature extraction and a reduced loss of feature information. Furthermore, we substitute the original PAN with the BiFPN to perform weighted calculations on different feature maps. Experimental results showcase that the SDCB-YOLO model exhibits higher accuracy and practicality compared to most other models, rendering it more suitable for real-world applications in safety helmet and reflective clothing detection.

While the algorithm presented in this article represents a significant advancement in enhancing detection accuracy for safety helmets and reflective clothing, it is acknowledged that the solution does have some limitations and areas for further improvement. Primarily, the focus of this article leans heavily towards optimizing accuracy, with a less pronounced emphasis on achieving a lightweight design. This aspect becomes particularly noteworthy when considering the increasing demand for efficient and resource-conscious solutions in diverse application scenarios. Furthermore, the adaptability of the proposed algorithm to more complex and challenging environments remains an area of concern. The harsh and dynamic conditions often encountered in real-world deployments pose additional difficulties that may not be fully addressed by the current approach. These environments can involve extreme lighting conditions, heavy occlusions, and dynamic changes in the visual scene, all of which can compromise detection performance. To address these shortcomings, future research should strive to balance the optimization efforts between accuracy enhancement and lightweight design, ensuring that the algorithm remains practical and efficient for deployment in resource-constrained environments. Additionally, exploring advanced techniques to improve the algorithm’s robustness and adaptability to a broader range of complex scenarios would significantly expand its applicability and effectiveness. Future research will focus on advancing the real-time detection capabilities of SDCB-YOLO and enhancing safe construction practices. Although we have addressed detection issues in various complex environments, detection in special weather conditions such as at night or during heavy rain remains a challenge.

Author Contributions

Conceptualization, J.W.; methodology, J.W.; software, X.Y.; validation, J.W. and X.Y.; formal analysis, X.Y.; investigation, X.Y.; resources, J.W.; data curation, X.Y.; writing—original draft preparation, J.W.; writing—review and editing, J.W.; visualization, J.W.; supervision, J.W.; project administration, M.D.; funding acquisition, M.D. All authors have read and agreed to the published version of the manuscript.

Funding

This study was funded by the National Natural Science Foundation of China (61563012); General Project of Guangxi Natural Science Foundation (2021GXNSFAA220074).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available in [ppe-updated] at [roboflow], Network address: https://universe.roboflow.com/ppe-detection-csg9b/ppe-updated/dataset/3.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, M.Y.; Cao, Z.Y.; Zhao, X.F.; Yang, Z. On the identification of the safety helmet wearing manners for the construction company workers based on the deep learning theory. J. Saf. Environ. 2019, 19, 535–541. [Google Scholar]
Li, Y.; Wei, H.; Han, Z.; Huang, J.; Wang, W. Deep learning-based safety helmet detection in engineering management based on convolutional neural networks. Adv. Civ. Eng. 2020, 2020, 9703560. [Google Scholar] [CrossRef]
Mu, L.; Zhao, H.; Li, Y.; Qiu, J.; Sun, C.; Liu, X.-T. Vehicle recognition based on gradient compression and YOLO v4 algorithm. Chin. J. Eng. 2022, 44, 940–950. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Boston, MA, USA, 7–12 June 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.; Liao, H.M. YOLOv4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Jin, Z.; Qu, P.; Sun, C.; Luo, M.; Gui, Y.; Zhang, J.; Liu, H. DWCA-YOLOv5: An improve single shot detector for safety helmet detection. J. Sens. 2021, 2021, 4746516. [Google Scholar] [CrossRef]
Zhang, Y.J.; Xiao, F.S.; Lu, Z.M. Helmet wearing state detection based on improved Yolov5s. Sensors 2022, 22, 9843. [Google Scholar] [CrossRef] [PubMed]
Wang, X.; Niu, D.; Luo, P.; Zhu, C.; Ding, L.; Huang, K. A Safety Helmet and Protective Clothing Detection Method based on Improved-Yolo V3. In Proceedings of the Chinese Automation Congress, Shanghai, China, 6–8 November 2020; pp. 5437–5441. [Google Scholar] [CrossRef]
Zhang, X.; Jia, X.; Wang, M.; Zhi, H. Lightweight detection of helmet and reflective clothing: Improved YOLOv5s algorithm. J. Comput. Eng. Appl. 2024, 60, 104–109. [Google Scholar] [CrossRef]
PPE Detection. PPE-Updated Dataset [EB/OL]. Roboflow Universe, 2023-09. Available online: https://universe.roboflow.com/ppe-detection-csg9b/ppe-updated (accessed on 18 November 2023).
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for realtime object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Tang, H.; Liang, S.; Yao, D.; Qiao, Y. A visual defect detection for optics lens based on the YOLOv5-C3CA-SPPF network model. Opt. Express 2023, 31, 2628–2643. [Google Scholar] [CrossRef] [PubMed]
Kosiorek, A. Attention mechanisms in neural networks. Robot. Ind. 2017, 6, 12–17. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 12993–13000. [Google Scholar]
Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. Carafe: Content-aware reassembly of features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3007–3016. [Google Scholar]
Tan, M.X.; Pang, R.M.; Le, Q.V. EfficientDet: Scalable and efficient object detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; p. 10778. [Google Scholar]

Figure 1. The four categories of the dataset. (a) Safe, (b) un-safe, (c) no helmet, and (d) no jacket.

Figure 2. YOLOV8n network structure.

Figure 3. The algorithm flowchart for the SE (Squeeze-and-Excitation) attention mechanism.

Figure 4. Three different structural diagrams: (a) FPN; (b) PAN; and (c) BiFPN.

Figure 5. SDCB-YOLO network.

Figure 6. Comparison chart of detection before and after improvement.

Figure 7. Normalized confusion matrix.

Table 1. Experimental environment.

Configure	Parameter
CPU	12th Gen Intel(R) Core(TM) i5-12490F @ 3.00 GHz (Intel, Santa Clara, CA, USA)
GPU	NVIDIA GeForce RTX 4060 (Nvidia, Santa Clara, CA, USA)
accelerated environment	CUDA12.1
operating system	Windows 11 64-bit system
Pytorch	2.1.0
compiler language	Python3.11
initial learning rate	0.01
momentum	0.937
weight_decay	0.0005
batch	24
epochs	300

Table 2. Comparison results of attention mechanisms.

Model	Precision (%)	Recall (%)	[email protected] (%)	[email protected]:0.95 (%)
CA	91.1	87.1	93.7	69.8
CBAM	91.8	88.0	94.1	70.1
EMA	89.9	87.4	93.9	68.2
GAM	87.8	88.1	93.6	69.1
SE	92.9	88.8	94.1	70.7

Table 3. Comparison Results of Different Models.

Model	Precision (%)	Recall (%)	[email protected] (%)	[email protected]:0.95 (%)	GFLOPs	Model Size (MB)
YOLOv3-tiny	85.6	83.5	88.6	59.0	12.9	16.6
YOLOv4m	94.5	88.9	94.5	69.6	53.0	46.98
YOLOv5n	90.4	86.5	90.7	64.1	4.1	3.74
YOLOv5s	91.8	89.2	92.5	68.7	15.8	13.79
YOLOv8n	93.0	87.3	93.6	70.4	8.1	5.97
SDCB-YOLO	95.1	92.7	97.1	78.2	8.3	6.13

Table 4. Ablation experiment table.

YOLOv8	SE	DIOU	CARAFE	BiFPN	Precision (%)	Recall (%)	[email protected] (%)	[email protected]:0.95 (%)
√					93.0	87.3	93.6	70.4
√	√				92.9	88.8	94.1	70.7
√	√	√			91.9	90.1	94.5	69.6
√	√	√	√		93.5	93.4	96.7	77.0
√	√	√	√	√	95.1	92.7	97.1	78.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, X.; Wang, J.; Dong, M. SDCB-YOLO: A High-Precision Model for Detecting Safety Helmets and Reflective Clothing in Complex Environments. Appl. Sci. 2024, 14, 7267. https://doi.org/10.3390/app14167267

AMA Style

Yang X, Wang J, Dong M. SDCB-YOLO: A High-Precision Model for Detecting Safety Helmets and Reflective Clothing in Complex Environments. Applied Sciences. 2024; 14(16):7267. https://doi.org/10.3390/app14167267

Chicago/Turabian Style

Yang, Xiang, Jizhen Wang, and Minggang Dong. 2024. "SDCB-YOLO: A High-Precision Model for Detecting Safety Helmets and Reflective Clothing in Complex Environments" Applied Sciences 14, no. 16: 7267. https://doi.org/10.3390/app14167267

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SDCB-YOLO: A High-Precision Model for Detecting Safety Helmets and Reflective Clothing in Complex Environments

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Acquistion

2.2. Experimental Environment

2.3. YOLOV8 Model

2.4. Squeeze-and-Excitation Networks

2.5. Distance IoU Loss Function

2.6. CARAFE: A Lightweight and General Upsampling Operator for Computer Vision

2.7. Bidirectional Feature Pyramid Network

2.8. SDCB-YOLO Network

3. Results and Discussion

3.1. Evaluation Indicators

3.2. Comparative Experiment

3.3. Ablation Experiment

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI