1. Introduction
With the rapid development of high-risk industries such as construction and mining, safety issues at construction sites are drawing increasing attention. Especially in environments involving high-altitude operations and heavy machinery, workers face significant safety risks. As a fundamental piece of personal protective equipment, safety helmets play a critical role in reducing head injuries and preventing accidents [
1]. Despite regulations mandating helmet use at many sites, workers not wearing helmets as required—such as shown in
Figure 1—pose a major challenge for site safety management.
In the complex and dynamic environment of construction sites, traditional methods of ensuring helmet compliance, such as manual inspections, face numerous limitations. Manual inspections are highly inefficient and cannot comprehensively or promptly cover the entire site. They often rely on sampling parts of the site, leaving unchecked areas as safety blind spots. Furthermore, human inspections are prone to subjective errors caused by fatigue, negligence, or inconsistent standards, leading to missed or incorrect judgments [
2]. If workers not wearing helmets are not promptly detected, the risk of serious accidents increases significantly, jeopardizing worker safety and the project’s progress and reputation. Therefore, with advancements in technology, there is a growing need for more efficient and intelligent helmet detection technologies to monitor helmet compliance comprehensively, accurately, and promptly, thereby safeguarding workers’ lives, enhancing site safety management, and ensuring smooth project execution.
In recent years, UAVs, often referred to as “eyes in the sky”, have steadily entered the realm of construction safety assurance, weaving a tight safety net from above. The application of UAVs on construction sites has become increasingly widespread, establishing them as invaluable tools for safety management. Their outstanding mobility enables them to swiftly navigate various areas of construction sites, whether near towering cranes or within complex structural environments, allowing for flexible and unobstructed comprehensive monitoring. The simplicity of operation means that ordinary safety management personnel can become proficient in UAV operation after short-term training, significantly lowering the usage threshold [
3]. Additionally, UAVs exhibit excellent stability, ensuring reliable flight and the capture of clear and accurate imagery even in challenging site environments, such as strong winds or dusty weather conditions. Moreover, they can be equipped with various devices, such as high-definition cameras, to accurately identify workers’ operational behaviors and compliance with safety equipment usage. These advantages provide robust technical support for safe production on construction sites [
4]. An example of UAV use for capturing violations on construction sites is shown in
Figure 2.
The application of UAV technology in construction site scenarios effectively addresses the limitations of traditional detection methods, showcasing significant technical advantages. UAVs can conduct safety inspections without being hindered by the risks posed by complex construction environments, such as objects falling from heights or narrow passageways, which greatly enhance the safety of the inspection process [
5]. They are capable of responding to commands at any time and performing inspection tasks in any corner of the construction site, whether in large-scale construction zones or scattered material storage areas. This flexibility significantly improves the automation and mobility of site inspections. Additionally, the high speed of UAVs far surpasses that of manual inspections, enabling comprehensive site surveys within a short period and dramatically increasing inspection efficiency. Equipped with advanced image recognition technology, UAVs can accurately identify whether workers are wearing safety equipment such as helmets and harnesses and promptly detect potential safety hazards in site facilities, thereby improving the quality and accuracy of inspections. At present, the application of UAVs in construction site safety detection is still in its early stages, with significant room for further development [
6]. Nonetheless, their immense potential has already been widely recognized within the construction industry. With continuous technological advancements, UAVs are expected to further optimize their performance and play an indispensable role in construction site safety management.
2. Related Research
Currently, target detection is a research hotspot, and intelligent real-time detection often relies on target detection technologies. Target detection based on deep learning has recently emerged as a prominent research topic in the field of computer vision. Detection algorithms based on deep learning include R-CNN (Region-based Convolutional Neural Network), Fast R-CNN, Faster R-CNN, SPPNet (Spatial Pyramid Pooling Network), YOLO (You Only Look Once), and SSD (Single Shot Multibox Detector), among others [
7].
Fang et al. [
8] proposed an R-CNN-based method to detect whether workers were not wearing helmets, randomly selecting over 100,000 image frames of construction workers from far-field surveillance videos across 25 different construction sites. Chen et al. [
9] introduced a faster R-CNN-based method for detecting safety helmets, which achieved higher accuracy but still required speed improvements. Wang et al. [
10] investigated construction sites, focusing on challenges such as occlusion, overlapping, and workers wearing reflective clothing. By utilizing an improved Faster R-CNN model for target detection, they achieved outstanding detection accuracy. Wu et al. [
11] proposed a helmet detection algorithm based on the Single Shot Multibox Detector (SSD).
Compared to other algorithms, the YOLO series offers advantages such as a simpler network structure, better generalization, and enhanced performance. Lin et al. [
12] proposed a novel algorithm for worker detection based on the YOLO framework. They incorporated a multi-head self-attention mechanism into the backbone network and neck connection, designed the CISNeck structure, developed the superficial feature fusion module (SFFM) to optimize features, and used GIoU as the loss function to create the MSCG-YOLO algorithm. MSCG-YOLO is a novel algorithm based on YOLO networks for worker detection. This algorithm outperformed existing methods in terms of AP and AP50 on validation and test datasets, effectively meeting the requirements for attire detection in construction scenarios.
Sridhar et al. [
13] employed the YOLOv2 deep learning framework and designed strategies using deep convolutional neural networks to detect helmet violations on motorcycles, achieving better results than traditional algorithms. Song [
14] proposed improvements to YOLOv3 by replacing the Res8 module with an RSSE module, substituting the CBL × 5 module with Res2, increasing input image resolution, using four-scale feature prediction, and introducing a CIOU-based loss function. These modifications enhanced the precision, recall, and mAP of the algorithm compared to the original, significantly improving detection performance. Ben et al. [
15] introduced a safety helmet detection method based on an improved YOLOv4. They utilized the K-means algorithm to cluster the dimensions of anchor boxes, adopted a multi-scale training strategy to improve model performance, and achieved excellent results in helmet detection tasks. Li et al. [
16] tackled the challenge of helmet detection by proposing the YOLO-PL algorithm based on YOLOv4. They first designed the YOLO-P algorithm, incorporating the Enhanced PAN (E-PAN) structure to improve precision, and then lightened the model by replacing the Spatial Pyramid Pooling (SPP) with the Dilated Convolution Cross-Stage Partial with X res units (DCSPX), designing a Lightweight VoVNet (L-VoVN) structure, and optimizing down-sampling and activation functions, thereby reducing parameters and enhancing performance and usability.
Yung et al. [
17] conducted performance evaluations of previous deep learning algorithms and trained and assessed models such as YOLOv5s, YOLOv6s, and YOLOv7 to explore safety helmet detection. Tan et al. [
18] improved YOLOv5 by adding a functional detection scale, replacing NMS with DIoU-NMS, and considering the overlap region and center distance of bounding boxes for more precise suppression of predicted bounding boxes. Shao et al. [
19] enhanced YOLOv7 for miner helmet detection by constructing fusion modules, designing a multi-scale feature fusion network, and using an efficient IoU regression loss function. Experiments conducted on a self-built dataset demonstrated improved results. Chen et al. [
20] proposed an improved convolutional neural network model called YOLOv7-WFD for detecting workers without helmets. They introduced the content-aware reassembly of features (CARAFE) module to capture effective features, improving the model’s ability to reconstruct details and structural information during image up-sampling. Despite achieving breakthroughs in speed, YOLOv7 still requires further optimization to address high computational resource demands and false detection rates.
Lin et al. [
21] built upon YOLOv8n by integrating triplet attention, the ASF structure, and DyHead, and using the Focal-EIoU loss function to improve the model. The resulting YOLOv8n-ASF-DH model showed enhanced performance in detecting complex and small targets, with significant improvements in accuracy, recall, and mAP compared to YOLOv8n. Wang et al. [
22] proposed an improved YOLOv8-ADSC safety helmet detection model by enhancing the detection head with Adaptive Spatial Feature Fusion (ASFF) and the Deformable Convolutional Network version 2 (DCNv2), adding a small target detection layer, and replacing the up-sampling module. Experiments on the Safety Helmet Wearing Dataset (SHWD) demonstrated improved precision metrics, making the model more suitable for helmet detection. Fan et al. [
23] introduced the LG-YOLOv8 model, which leveraged feature enhancement techniques to make the detection algorithm more efficient and accurate.
In the integration of UAV technology and deep learning, Liang and Seo [
3] developed a low-altitude remote sensing system for safety helmet detection, combining a small-target detection network with UAV technology. This system collects site images via low-altitude UAV flights and leverages deep learning models to achieve accurate detection of workers wearing safety helmets. Additionally, Yan et al. [
24] integrated deep learning with random forest (RF) techniques for helmet detection, utilizing UAV images for real-time analysis, significantly enhancing helmet monitoring capabilities in power construction. Bian et al. [
25] proposed an improved safety helmet detection method based on the YOLOv7 algorithm, which uses UAV-captured image data to markedly improve detection accuracy and robustness. Sharma et al. [
4] designed an edge-controlled autonomous UAV system for detecting and counting workers wearing helmets at construction sites, achieving effective results even in complex site environments. Studies show that the combination of UAVs and deep learning not only improves the accuracy of safety helmet detection but also adapts to the requirements of various work environments, especially for large-scale, high-frequency, real-time monitoring. With the continuous development of UAV technology and the optimization of deep learning algorithms, future helmet detection systems are expected to serve construction site safety management more efficiently and enhance workers’ safety protection.
However, most researchers have focused on improving and optimizing existing algorithms. As an advanced object detection algorithm, YOLOv8 demonstrates outstanding performance in both speed and accuracy. It can quickly and precisely detect various targets at construction sites, particularly in helmet detection. Its broad application prospects suggest that YOLOv8 has the potential to become a key technology for ensuring construction site safety, providing strong support for building safe and efficient working environments. Arai et al. [
26] utilized transfer learning to perform object detection using the YOLOv8 and Detectron2 models. They annotated datasets in the COCO format via Roboflow and employed multi-view deep learning to detect objects from different angles, enhancing detection accuracy and verifying the use of chin straps.
Although significant progress has been made in optimizing helmet detection algorithms and applying deep learning in combination with drones, many studies have not thoroughly explored the performance differences and targeted optimization strategies of advanced models like YOLOv8 in construction drone scenarios.
In terms of dataset construction and annotation, existing research lacks specialized and efficient data processing methods for drone images in construction sites. This study has developed a drone-based helmet-wearing dataset, annotated in VOC format, and innovatively set up two labels, “person” and “helmet”, for transfer learning. This provides a more practical data foundation for model training, significantly enhancing the model’s detection capability in the complex environment of construction sites. Moreover, when handling the complex real-world scenarios of helmet-wearing detection, existing research has yet to establish a comprehensive judgment system. This study proposes innovative judgment criteria, such as using the IoPPE standard to determine the appropriateness of helmet wearing and combining the bounding box height ratio to identify special cases (e.g., helmets being held), which improve detection accuracy and reliability.
Significant progress has been made in improving model performance in existing research. However, challenges remain in addressing practical demands in real-world engineering scenarios, such as adapting to complex environments, managing occlusion, and optimizing multi-worker detection. This study leverages the YOLOv8 model, combined with drone imaging technology, to focus on the practical application needs of construction sites and enhance model detection performance. To tackle the issue of occlusion in construction site environments, the flexibility of drones is utilized to capture images from multiple angles, thereby improving the model’s robustness and adaptability in complex scenarios. Additionally, the proposed method effectively addresses the challenge of multi-worker detection by optimizing model training and decision logic, ensuring the accurate identification of each worker’s helmet-wearing status in densely populated construction site scenarios. This approach provides an efficient and reliable safety management solution for construction sites, effectively safeguarding personnel safety and ensuring operational compliance.
3. YOLOv8 Target Detection
YOLOv8, the latest version in the YOLO (You Only Look Once) series of object detection algorithms, has become a leading model in the field of object detection due to its superior speed and accuracy. This chapter introduces the selection of YOLOv8 models, dataset preparation, model training, and analysis of experimental results, with a focus on its performance in detecting “person” and “helmet” targets. By applying YOLOv8 to complex UAV aerial scenarios, this study demonstrates its effectiveness and reliability in construction site safety management.
3.1. Model Selection
Although the YOLO models have been validated on standard datasets, their performance still has certain limitations in drone-captured scenes, where the targets are often far away and their sizes are small. Therefore, selecting models suitable for drone-captured scenes becomes a key task.
Taking YOLOv8 as an example, the model is divided into several versions based on size (number of parameters), including YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x. Generally, the larger the model, the higher the recognition accuracy, but the inference time also increases significantly. A performance comparison of different YOLOv8 models on the COCO dataset is presented in
Table 1 [
27].
To evaluate the performance of different YOLOv8 models in practical applications, this paper compares the recognition results of several YOLOv8 models under default settings. The pre-trained models based on the official COCO dataset, which contains 80 categories, were used in the test. For clarity, this experiment focuses on the recognition of the “person” category, and the actual inference time used in the test is indicated in the chart (test platform: 3.2 GHz 6-core Intel Core i7, no GPU). The experimental results show that, while the smallest YOLOv8n model has a shorter inference time, it failed to detect any valid targets. YOLOv8s was able to detect some targets, while YOLOv8l successfully recognized all targets, as shown in
Figure 3.
When the model has a large number of parameters, the training time required also increases accordingly. Therefore, this project attempts to improve the recognition performance of smaller models by adjusting the model settings. The recognition results of YOLOv8s with different input image sizes are shown in
Figure 4, with other settings unchanged.
As shown in
Figure 4, YOLOv8s with an input image size of 1280 is able to detect all targets, and the inference time is slightly shorter than that of YOLOv8l with an input image size of 640. Although the inference time difference between the two is not large, considering the training time, the former will be significantly superior to the latter. Thus, YOLOv8s is chosen as the base model for this project.
It is important to note that the input image size used during training was not changed. Instead, the same model was used with different input image sizes during inference. Moreover, since the images obtained from the internet are typically close to 640 in size, the input image size of 640 was still used during training.
3.2. Dataset Preparation
The dataset is the foundation for solving practical problems, so it is necessary to collect a dataset that meets the requirements of the detection task before the experiment. In this experiment, a helmet-wearing dataset collected through UAV aerial photography is used. The images collected by the UAV are annotated using the AnylableImg (v0.4.10) deep learning image annotation software, as shown in
Figure 5. The dataset is annotated in the VOC format, with a total of 1584 images, as shown in
Figure 6. The test set and training set are randomly split in a 2:8 ratio.
In order to prepare the dataset and perform transfer learning based on the YOLOv8 model, this paper sets up two labels, “person” and “helmet”, and performs transfer learning based on the official YOLOv8s model.
3.3. Model Training
The experiment uses PyTorch (v1.13.1) as the framework, Python (v3.7.0) as the programming language, and a GPU for training. The hardware environment for training is as follows: the CPU is Intel® CoreTM i7-10750H, the GPU is NVIDIA RTX 4060 with 8 GiB of memory, and the operating system is Linux. The PyTorch version is 1.10.0, the CUDA version is 10.2.0, and the Python version is 3.8.
The experimental parameters were set as follows: the number of epochs was 100, the batch size was 8, the optimizer used was Adam (v1.2), the initial learning rate was 0.01, the momentum parameter was 0.937, and the weight decay was 0.0005. The training and detection process is shown in
Figure 7.
Figure 8 illustrates the precision–recall curves for the model’s detection performance on the “person” and “helmet” categories. The solid blue line represents the precision-recall curve for the “person” category, with an mAP value of 0.982. The dashed orange line corresponds to the “helmet” category, with an mAP value of 0.968. The bold black line indicates the average performance across all categories, with an mAP@0.5 value of 0.975, demonstrating the overall effectiveness of the model.
4. Results and Discussion
Simply detecting “person” and “helmet” in the image does not allow us to determine whether the “person” is wearing the “helmet”. Therefore, this project adds the following judgment conditions, as shown in
Figure 9.
Intersection over union (loU) is widely used in image segmentation tasks to check if an object is well detected. The original conception of loU is as follows:
A similar criterion, loPPE, is used in this work to check if the PPE is worn properly.
loPPE shows if the PPE and person occurred at the same place. Construction workers can be in different poses when working on a construction site, and thus, the area corresponding to a person can vary, so only the area of PPE is used in the denominator of loPPE.
The recognition results obtained based on the above judgment method are shown in
Figure 10. When the intersecting area between the “person” and “helmet” accounts for more than 90% of the helmet’s area, the “helmet” is considered to be correctly worn. In the image, one of the three individuals does not have a matching helmet, so they are detected as “not wearing a helmet”.
A common exception is when the helmet is held in the hand rather than worn on the head. To address this special case, this paper introduces a judgment condition: the helmet is considered to be worn only when the height of the bounding box corresponding to the helmet is not less than 90% of the height of the bounding box corresponding to the “person”. The recognition results after adding this judgment condition are shown in
Figure 11.
In addition, for images captured by drones, considering that the recognition accuracy cannot realistically reach 100%, this paper further proposes a time-window-based judgment strategy. When no worker wearing a helmet is detected in 10 consecutive frames (approximately 1 s of video), the program will automatically issue a warning. This measure aims to improve the accuracy of helmet-wearing status detection and promptly identify potential safety hazards. The specific process is shown in
Figure 12.
The mAP@0.5 of all classes achieved by our method is 0.975, demonstrating its effectiveness in addressing safety monitoring challenges in construction sites.
5. Conclusions
This paper primarily discusses the application of YOLOv8 models in helmet-wearing detection in drone-captured scenes. By introducing two labels, “person” and “helmet,” and employing a transfer learning strategy, the model’s detection capability in complex environments was successfully enhanced.
Additionally, new judgment conditions were proposed for special cases in drone-captured images, such as when the helmet is held in hand instead of worn on the head. By introducing a height ratio judgment of the bounding boxes, the accuracy of detecting the wearing status was further improved. To ensure the model can adapt to changes in video streams, the system automatically issues a warning when no helmet-wearing worker is detected in 10 consecutive frames, providing timely safety alerts.
Through comprehensive training and optimization of the YOLOv8 model in object detection tasks, the experimental results show that the proposed solution performs well in complex and dynamic real-world scenarios. The mAP@0.5 of all classes achieved by our method is 0.975, enabling effective detection of helmet-wearing workers in most practical applications. Through reasonable judgment conditions and data augmentation, it significantly enhances the efficiency and reliability of safety monitoring.
Future work could further explore how to improve the model’s accuracy and real-time performance through more efficient architectures and richer datasets, particularly in adapting to complex environments. Additionally, integrating real-time data streams from drones for multi-target tracking and intelligent decision-making will be an important direction for this research.