PRE-YOLO: A Lightweight Model for Detecting Helmet-Wearing of Electric Vehicle Riders on Complex Traffic Roads

Yang, Xiang; Wang, Zhen; Dong, Minggang

doi:10.3390/app14177703

Open AccessArticle

PRE-YOLO: A Lightweight Model for Detecting Helmet-Wearing of Electric Vehicle Riders on Complex Traffic Roads

by

Xiang Yang

^1,2,*,

Zhen Wang

^1,2

and

Minggang Dong

^2,3

¹

College of Computer Science and Engineering, Guilin University of Technology, Guilin 541006, China

²

Guangxi Key Laboratory of Embedded Technology and Intelligent Systems, Guilin University of Technology, Guilin 541006, China

³

College of Physics and Electronic Information Engineering, Guilin University of Technology, Guilin 541006, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(17), 7703; https://doi.org/10.3390/app14177703

Submission received: 14 August 2024 / Revised: 30 August 2024 / Accepted: 30 August 2024 / Published: 31 August 2024

Download

Browse Figures

Versions Notes

Abstract

:

Electric vehicle accidents on the road occur frequently, and head injuries are often the cause of serious casualties. However, most electric vehicle riders seldom wear helmets. Therefore, combining target detection algorithms with road cameras to intelligently monitor helmet-wearing has extremely important research significance. Therefore, a helmet-wearing detection algorithm based on the improved YOLOv8n model, PRE-YOLO, is proposed. First, we add small target detection layers and prune large target detection layers. The sophisticated algorithm considerably boosts the effectiveness of data manipulation while significantly reducing model parameters and size. Secondly, we introduce a convolutional module that integrates receptive field attention convolution and CA mechanisms into the backbone network, enhancing feature extraction capabilities by enhancing attention weights within both channel and spatial aspects. Lastly, we incorporate an EMA mechanism into the C2f module, which strengthens feature perception and captures more characteristic information while maintaining the same model parameter size. The experimental outcomes indicate that in comparison to the original model, the proposed PRE-YOLO model in this paper has improved by 1.3%, 1.7%, 2.2%, and 2.6% in terms of precision P, recall R, mAP@0.5, and mAP@0.5:0.95, respectively. At the same time, the number of model parameters has been reduced by 33.3%, and the model size has been reduced by 1.8 MB. Generalization experiments are conducted on the TWHD and EBHD datasets to further verify the versatility of the model. The research findings provide solutions for further improving the accuracy and efficiency of helmet-wearing detection on complex traffic roads, offering references for enhancing safety and intelligence in traffic.

Keywords:

YOLOv8n; helmet-wearing detection; attention mechanism; RFCAConv; EMA

1. Introduction

In recent years, with the popularization of electric bikes and the increasing emphasis on traffic safety, the issue of helmet-wearing by electric bike riders has become a focal point of societal attention. Helmets, as crucial equipment for protecting riders’ head safety [1], have a direct impact on the severity of injuries in traffic accidents. However, in real life, due to various reasons such as riders’ lack of safety awareness and the inconvenience of wearing helmets, the helmet-wearing rate is not ideal. Therefore, developing an efficient and accurate algorithm for detecting helmet-wearing by electric bike riders is of great significance for enhancing riders’ safety awareness and reducing traffic accident injuries. The detection algorithm for helmet-wearing by electric bike riders provides strong support for traffic management and safety monitoring. This research not only aids in further advancing the widespread adoption and profound progression of computer vision technology within the realm of traffic safety but also provides innovative technical means for the intelligent supervision of helmet-wearing by electric bike riders, opening up new avenues for supervision.

Helmet-wearing detection, as an interdisciplinary research topic in computer vision and traffic safety, has garnered widespread attention. Early research primarily relied on traditional image processing techniques such as background subtraction and feature extraction, but these methods had limited effectiveness in handling complex scenes and occlusion issues. With the increasing development of deep learning, object detection has become an important research area. There are mainly two categories: one-stage object detection and two-stage object detection. One-stage object detection algorithms, such as the YOLO [2,3,4,5] series and SSD [6], accurately forecast the class and position of targets through a single forward pass, offering the advantages of fast detection speed and good real-time performance. These algorithms are typically suitable for scenarios requiring high detection speeds. Alternatively, two-phase object detection methodologies, such as the R-CNN [7,8] series, adopt a two-step strategy of first generating candidate regions and then performing classification and location refinement. These algorithms usually have an advantage in detection accuracy but relatively slower detection speeds. Among them, the YOLO series of algorithms excel in the field of object detection and related areas, possessing unique advantages. Their most notable achievement lies in real-time detection, predicting the location and category of all targets in an image through a single forward pass, greatly improving processing speed. Simultaneously, while preserving high detection precision, the YOLO series of algorithms persistently refine model architectures and detection methodologies to cater to the requirements of diverse scenarios. Furthermore, the YOLO series of algorithms exhibit good generalization capabilities, capable of handling targets of various scales and aspect ratios, with a low rate of false detections in the background. These advantages have led to the wide application of the YOLO series of algorithms in multiple fields, demonstrating their strong potential for practical use.

In recent years, multiple research papers have delved into helmet-wearing detection methods based on deep learning. For instance, Jia W et al. [9]. integrated a triple attention mechanism into the YOLOv5 model. This triple attention mechanism extracts semantic dependency relationships from different spatial dimensions, eliminating indirect correspondences between channels and weights and thereby enhancing accuracy. Moreover, in crowded and complex road scenarios where targets often overlap and obscure each other, Jia W et al. employed Soft-NMS to gradually reduce the confidence of target boxes, effectively mitigating the competition between overlapping boxes. This approach addressed the issue of target occlusion and overlap, improving the model’s detection precision. However, the model still exhibits some limitations, particularly in terms of model size and computational cost, which may lead to certain deficiencies when tackling tasks that require fast detection speeds. In addition, for the detection of motorcycle helmet-wearing [10], a clear qualification standard is that the helmet must fully enclose the head, and most motorcycle helmets in the market have relatively uniform styles. In contrast, the standards for electric bike helmets are relatively lenient, typically requiring only partial coverage of the head, and there are numerous styles available in the market, which makes the target objects exhibit greater diversity. Therefore, the model proposed in this paper may demonstrate insufficient adaptability when confronted with this diversity, posing a risk of missed detections.

Similarly, ZHU Z, H et al. [11]. introduced the Convolutional Block Attention Module (CBAM) and Coordination Attention (CA) module into the YOLOv5 model to establish feature mapping relationships and reconstruct the attention on feature maps. This allowed the network to fully leverage global information, making the model more focused on detecting small targets. In the traditional Non-Maximum Suppression (NMS) method, the IoU metric is utilized to suppress redundant detection boxes, yet it solely takes overlapping areas into account, frequently resulting in incorrect suppression. Therefore, ZHU Z, H et al. introduced DIoU-NMS, which considers the disparity between the centroids of the predicted box and the target box, with IoU serving as the reference. This provides a more comprehensive evaluation criterion and increases the effectiveness of judging detection boxes. However, the article only uses one dataset for experimentation, which is not enough to confirm the model’s ability to generalize. The versatility of the model in different scenarios and conditions still needs further validation.

Furthermore, WU D, M et al. [12]. used the YOLOv4 model as a basis and initially deepened the backbone network. They increased the number of convolutions after the first feature layer output using the backbone network and the three convolutions outputs after SPP pooling from one to five, thereby deepening the network and further extracting target features. Subsequently, to enhance the receptive field, they added an SPP network within the PANet, strengthening feature extraction and fusion. This improved the internal receptive field of the network, ensuring the effective extraction of the features of large target objects. Experimental results demonstrate good performance in terms of both P and mAP@0.5. However, there are still some unresolved issues. During dataset annotation, helmets, electric bikes, and electric bike riders were annotated as a whole. Consequently, if there are multiple individuals on an electric bike, this method cannot individually determine whether each person is wearing a helmet, revealing certain limitations in detection.

Although the series of improvement methods proposed in the aforementioned articles have achieved a certain degree of enhancement in detection accuracy, further improvements are still necessary to excel in object detection tasks. Additionally, most of these improvement methods increase model complexity and computational costs, thereby reducing the speed of target detection. Furthermore, most relevant researchers have only conducted experimental tests using a single dataset, lacking datasets specific to complex traffic monitoring scenarios. As a result, the generalization ability of the models remains unproven. Therefore, to address these issues, based on the YOLOv8n model, this paper proposes a new PRE-YOLO model for detecting helmet-wearing status on electric bikes.

The primary contributions of this paper are outlined as follows:

Enhance the detection capability of small targets by refining the model structure through the incorporation of a specialized small target detection layer, which improves the extraction of information from shallow feature maps. In the meantime, prune the original detection layers for larger targets to make the model lightweight. This enables the model to improve performance even under resource-constrained conditions.
For the purpose of enhancing the model’s capacity to capture feature information in both channel and spatial dimensions, we have introduced a convolutional module that combines receptive field attention with attention mechanisms. This is suitable for complex environments and scenarios with densely distributed detection targets.
Furthermore, this paper incorporates the EMA [13] mechanism into the C2f module, enhancing the model’s capabilities in both feature extraction and fusion. This has brought considerable assistance to the improvement of accuracy, meeting the requirements for task detection accuracy.
The remaining sections of this paper are arranged in the following manner. Section 2 reviews the research background of the algorithm. Section 3 focuses on the detection method proposed in this paper. In Section 4, Section 4.1 introduces the dataset and experimental environment, Section 4.2 describes the evaluation metrics of the model, and Section 4.3 shows the experimental results along with the analysis. Section 5 discusses the research achievements of this paper and future work directions.

2. Research Background

YOLOv8n, released by Ultralytics on 10 January 2023, is designed to accomplish tasks within the realm of computer vision, encompassing picture classification, object detection, and image segmentation. To adapt to different application scenarios and computational resource requirements, with the increase in model complexity and network depth, it is sequentially divided into five versions: n, s, m, l, and x. The depth and width of the network models increase sequentially. Among them, although the YOLOv8n model has relatively lower accuracy, due to its small size and rapid detection speed, it is highly suitable for detection tasks.

The YOLOv8n model consists of four core components: Inputs, Backbone, Neck, and Head, as illustrated in Figure 1. The Inputs component employs the Mosaic data augmentation technique, which adjusts certain hyperparameters dependent on the size of the model in use. This method effectively expands and enriches the dataset, consequently enhancing the model’s generalization proficiency and enabling it to maintain stable performance when confronted with various unknown or complex scenarios, ultimately bolstering the model’s robustness. In the Backbone and Neck components, the design approach of stacking multiple ELAN modules from YOLOv7 is referenced. The C3 module from YOLOv5 is modified to create the C2f structural module. While still maintaining a lightweight model, this significantly enhances the ability to capture gradient flow information and provides flexibility in adjusting the number of channels based on changes in model scale, resulting in a substantial improvement in model performance. In the Head component, compared to YOLOv5, the currently prevalent decoupled head structure is adopted. The model design separates the classification and detection heads, using two parallel branches to extract category feature information and location feature information, respectively. Then, it employs a 1 × 1 convolution layer for each branch to complete the classification and localization tasks, significantly diminishing the model’s parameter number, scale, and computational burden while enhancing its generalization ability and robustness.

YOLOv8n has abandoned the anchor-based detection method used in previous versions and adopted a new anchor-free detection strategy. Using this method, the midpoint coordinates and width-to-height proportion of the intended target can be directly predicted, greatly optimizing the detection process and significantly reducing the number of anchor boxes used. Therefore, YOLOv8n provides a more efficient and accurate solution for real-time object detection.

3. Proposed Methods

Detecting whether electric bike riders are wearing helmets in complex traffic scenarios often faces challenges such as misdetection and missed detection due to the complexity of the scene and the small size of the target. This research introduces a refinement based on YOLOv8n, with the refined network structure shown in Figure 2. The main features include the introduction of a 160 × 160 small target detection head, which enhances shallow feature extraction and focuses on small target detection. Concurrently, the large target detection head is removed, considerably reducing model parameters and size while maintaining accuracy. The backbone network is integrated with a combined module, RFCAConv, generated from the fusion of receptive field attention convolution and the CA attention mechanism, which enhances feature perception and captures more feature information. Finally, the C2f_EMA module is incorporated to further augment the perception of important feature information, thereby improving the model’s detection performance.

3.1. Improve the Small Object Detection Layer

In the YOLOv8n network structure, three detection scales are designed to capture targets of different sizes. The feature extraction process progresses from shallow to deep layers. Shallow features are rich in specific geometric details of the targets due to their high resolution, while deep feature maps possess larger receptive fields and abundant semantic information. The network model performs 8×, 16×, and 32× downsampling on the input image of size 640 × 640, utilized for detecting small, medium, and large objects, respectively. When detecting electric bike riders and their helmets on complex traffic roads, the helmets, which are small targets [14], require detection. If the original three detection heads are still used for detection, it may lead to insufficient utilization of shallow feature information, resulting in poor recognition ability for small targets [15] and the loss of detection accuracy, coupled with frequent instances of overlooked detections and incorrect positives.

If the original three detection heads in the YOLOv8n network structure are still used for the detection task, the recognition accuracy for small targets will be excessively low. Therefore, a dedicated detection layer for small targets is incorporated into the original network structure. This layer performs 4× downsampling on the input image, significantly enhancing the extraction of shallow feature information and thereby improving the detection capabilities for small targets [16]. For small targets, this method has achieved a further improvement in accuracy. Additionally, in detection tasks where most targets are small [17], the original detection layer designed for large targets only increases the model’s parameter count and size. Therefore, this paper refines the detection process by pruning the layer originally designed for detecting large targets. The principal objective is to refine the detection criteria for small targets. The findings from the experimental studies, as depicted in Table 1, demonstrate that the effectiveness has been adequately proven. Although the increase in detection accuracy is most pronounced after adding the small target layer, the model’s parameter count only decreases slightly. After refining the small target layer, despite a slight decrease in detection accuracy, the model’s parameter count is reduced by 33%, achieving a lightweight improvement while enhancing accuracy.

3.2. Replace the Backbone Network Convolution

The standard convolution operation is a core component in constructing convolutional neural networks (CNNs). It effectively extracts feature information from images through sliding windows and parameter sharing, overcoming the inherent limitations of fully connected layers in terms of parameters and computational efficiency. However, this operation is also accompanied by issues such as large model parameters and high computational costs. The spatial attention mechanism, as an important attention technology, focuses on the spatial dimensions of images, namely the interrelationships between pixels. Through training, the model can learn how to assign varying weights to different regions within the image, effectively prioritizing and focusing on the key information. The combination of this mechanism with standard convolution enables more efficient extraction and processing of image information in CNNs. However, traditional spatial attention mechanisms often fail to fully consider the spatial features of the entire receptive field. Therefore, when dealing with large convolutional kernels (such as 3 × 3 convolutions), the advantages of parameter sharing are not fully leveraged, which somewhat limits their effectiveness.

Receptive-Field Attention convolutional operation (RFAConv) [18] focuses on the spatial features information of the receptive field and addresses the issue of parameter sharing in convolutional kernels by introducing a receptive field attention mechanism. RFAConv ensures minimal computational overhead and parameter count, resulting in significant improvements in detection performance. Coordinate Attention (CA) is a mechanism that incorporates spatial location information into channel attention. By introducing CA’s spatial attention mechanism into the spatial features [19] of the receptive field, we obtain Receptive-Field Coordinate Attention (RFCA). The framework of the network is attractively portrayed in Figure 3, offering a clear visual representation. By matching the spatial attention of the receptive field’s spatial features with convolution, we generate the Receptive-Field Coordinate Attention convolutional operation (RFCAConv) to replace standard convolution, fully resolving the issue of convolutional parameter sharing. At the same time, it considers long-range information to some extent, enhancing convolutional performance. In the present study, RFCAConv is utilized to replace some of the standard convolutions in the backbone network, resulting in increased model performance.

3.3. Improve the C2f Module

Introduce EMA, an efficient multi-scale attention module that operates without the requirement of dimension reduction. Its core mechanism is to smooth the attention distribution through the exponential moving average method, thereby improving the model’s robustness and generalization ability. This module recalibrates the channel weights in each parallel branch by encoding global information and captures pixel-level relationships [20] through cross-dimensional interactions. EMA is designed to reduce computational overhead while preserving key information from each channel, enhancing the model’s capability to process features effectively. By reorganizing the channel and batch dimensions [21] and leveraging cross-dimensional interactions, the model is able to effectively capture pixel-level relationships, which enables it to excel in focusing on important information within the image. Particularly noteworthy is the structure of the EMA attention mechanism module, as depicted in Figure 4.

Specifically, EMA initially separates the feature map across the channel dimension into G sub-features, with the aim of facilitating the model in learning and capturing different semantic information, as shown in Equation (1).

X = [X_{1}, X_{2}, X_{3}, \cdot \cdot \cdot, X_{(G - 1)}], X_{i} \in R^{(\frac{C}{G} \times H \times W)}

(1)

In this section,

X

represents a three-dimensional tensor containing feature information;

C

denotes the quantity of input channels;

G

represents the sub-features divided; and

R

represents the set of real numbers.

Secondly, EMA utilizes three parallel paths to extract clustered feature maps, with two of the paths being

1 \times 1

branches. When encoding the channels, pooling kernels of sizes

(1, W)

and

(H, 1)

are applied in the vertical and horizontal directions, individually, and the calculations are shown in Equations (2) and (3).

Z_{C}^{W} (W) = \frac{1}{H} \sum_{0 \leq i \leq H} X_{C} (j, W)

(2)

Z_{C}^{H} (H) = \frac{1}{W} \sum_{0 \leq i \leq W} X_{C} (H, i)

(3)

In this section,

Z_{C}^{W} (W)

signifies the output of the C-th channel that has a width of

W

, processed only in the vertical direction;

X_{C}

represents the output of the

C

-th channel;

Z_{C}^{H} (H)

signifies the output of the C-th channel that has a height of

H

, processed only in the horizontal direction; and

i

and

j

represent different spatial directions.

The feature projections in the vertical and horizontal directions are fused through Equations (4)–(6), allowing different cross-channel information interactions between the two parallel paths. The fused feature representation is then split into two independent tensors, and finally, two nonlinear Sigmoid functions are applied.

f = P ([Z_{C}^{H} (H), Z_{C}^{W} (W)])

(4)

g^{h} = σ (f^{h})

(5)

g^{w} = σ (f^{w})

(6)

In this section,

f

represents the fused feature representation, where

f \in R^{C / / G \times (H + W)}

;

P

represents the transformation used for adjusting the number of channels;

f^{h}

and

f^{w}

represent the feature representations of

f

in the two spatial directions;

σ

stands for the Sigmoid activation function; and

g

signifies the attention weight.

Upon acquiring the attention weights

g

in different spatial directions, they are processed together with the output of the third branch

(3 \times 3)

using a two-dimensional global average downsampling operation to handle the feature information in the vertical and horizontal directions. The operation of two-dimensional global pooling is shown in Equation (7). Finally, cross-spatial information fusion is performed to obtain the final attention weight output of EMA.

z_{c} = \frac{1}{H \times W} \sum_{j = 1}^{W} \sum_{i = 1}^{H} x_{c} (i, j)

(7)

In this section,

Z_{C}

signifies the output corresponding to the c-th channel.

Furthermore, the C2f module is of utmost importance in object detection, specializing in fusing feature maps of different scales to enhance detection accuracy [22]. By concatenating feature representations across various levels along the channel dimension and stacking them together, the module forms a deeper feature map. This not only preserves abundant spatial information but also fully retains semantic information, providing strong support for object detection.

Consequently, this paper presents the idea of integrating the EMA module with the C2f module and develops the C2f_EMA module, which combines the advantages of both. While maintaining a lightweight computational load, it enhances the feature perception capability and captures more characteristic information. The module is composed of multiple Bottleneck_EMA components, which combine the original features with the enhanced features through residual connections to maintain feature continuity and information flow. The specific design is presented in Figure 5.

Within this article, the C2f_EMA module is used in the Backbone, and its architecture is illustrated in Figure 6. It exchanges the primary C2f module located in the core network. While maintaining the lightweight nature of the model, it enables the model to complete the learning of residual features, preserving feature continuity and information flow. For helmet-wearing detection on roads with complex traffic backgrounds and mostly small targets, the introduction of this module further elevates the perception of vital feature information, reduces the impact of noise interference, and improves detection accuracy.

4. Experimental Design and Result Analysis

4.1. Dataset and Experimental Environment

The experimental data were collected from multiple traffic intersections in the main urban area of Guilin. After screening, we obtained a total of 2126 image data, each with dimensions of

960 \times 540

pixels. We used the roboflow platform for data annotation, and the dataset mainly includes three types of targets: electric bicycle riders (E_bicycle), helmet-wearing targets (Helmet), and non-helmet-wearing targets (No_helmet). The collected dataset covers various conditions such as location differences, lighting variations, and shooting angles. Among these, conditions of illumination are a vital factor affecting the performance of target detection. If the dataset only includes images under specific lighting conditions during training, the model may not be able to adapt to other lighting conditions. Similarly, the background complexity in different scenes is high, and there may be significant appearance differences between targets in the same category. Therefore, constructing a dataset that includes multiple lighting conditions and different scenes is crucial for enhancing the versatility of the model and improving its generalization ability. Figure 7 presents a number of images drawn from the dataset.

To enrich the image data, we adopted data augmentation methods, applying operations such as flipping, cropping, adjusting brightness, adjusting saturation, and adding noise to the images. This enriches the variety of the data and boosts the model’s generalization ability. By using this method, we can effectively expand the dataset, thereby enhancing the efficiency of network training. The enhanced dataset contains a total of 3827 images, which we have allocated in a ratio of 8:2 to form the training set and validation set, respectively.

This experiment is performed on the Ubuntu 22.04 operating system, using a CPU model of E5-2687W and a GPU model of NVIDIA RTX-3060 with a memory capacity of 12 GB. The host memory is 24 GB. We choose Pytorch-2.1.0 as the deep learning framework, with Python edition 3.10 and CUDA version 12.1. The arrangements of the other training parameters are outlined in Table 2.

4.2. Evaluation Indicators

This experiment employs precision (P), recall (R), and mean Average Precision (mAP) as metrics to gauge the model’s performance. The calculations are shown in Equations (8)–(11). Additionally, metrics such as model parameters, model complexity (GFLOPs), model file size (Weight), and frames per second (FPS) are also used. These indicators give a comprehensive overview of the model’s performance, reflecting its performance in detection tasks from various perspectives.

P = \frac{T P}{T P + F P}

(8)

R = \frac{T P}{T P + F N}

(9)

A P = \int_{0}^{1} P (r) d r

(10)

m A P = \frac{\sum_{1}^{N} A P_{i}}{N (c l a s s)}

(11)

In this section,

T P

signifies the number of samples detected correctly,

F P

signifies the number of samples detected falsely,

F N

signifies the number of samples that were missed,

A P

is the precision for a single category, and

m A P

is the mean Average Precision across all categories. mAP@0.5 calculates the mean of

A P

for each category when the IoU threshold in the confusion matrix is 0.5. mAP@0.5:0.95 represents the mean of

A P

when the IoU varies from 0.50 to 0.95 with a step size of 0.05.

4.3. Experimental Results and Analysis

4.3.1. Experimental Comparison and Analysis of Different Algorithms

To confirm the advantages of the algorithm studied within this article, experimental comparisons were conducted with currently popular object detection models. Precision, recall, mAP@0.5, mAP@0.5:0.95, parameter count, and model size were used as evaluation metrics. The algorithm studied in this paper was relative to the original models of Faster-CNN, YOLOv3-tiny, YOLOv5n, YOLOv5s, YOLOv7-tiny, and YOLOv8n under the same experimental environment. The experimental findings are detailed in Table 3.

According to the experimental results, the precision rate P, recall rate R, mAP@0.5, and mAP@0.5:0.95 have increased by 1.3, 1.7, 2.2, and 2.6 percentage points, respectively, compared to the original YOLOv8n model, and the number of model parameters has reduced by 33.3 percentage points. Although the precision rate P has slightly decreased compared to YOLOv5s, the other indicators show significantly better performance. Additionally, the modified model has a larger parameter size, which is unfavorable for subsequent model deployment. Although the FPS of the improved model has decreased, according to the “Technical Specifications for Video Evidence Collection Equipment for Road Traffic Safety Violations”, the standard for detection frame rate is greater than or equal to 25 frames per second. Even when pursuing higher image quality detection at 60 frames per second, this research paper retains the ability to maintain a detection rate of 81.6 frames per second, which is more than sufficient for executing detection tasks. Moreover, high-frame-rate image quality significantly occupies hardware storage space, and hardware detection platforms on traffic roads regularly have trouble with limited space and low computing power. Therefore, the optimized algorithm model put forward in this work is proficient in executing detection tasks to a higher standard.

4.3.2. Ablation Experiment

To assess the beneficial impact of the proposed modules on the YOLOv8n model, six series of ablation experiments were devised. On the YOLOv8n network model, improvements were made to the small target layer, most of the standard convolutions in the backbone network were replaced with RFCAConv, and the C2f_EMA module was introduced. The various improved modules were accumulated sequentially to complete the ablation experiments. The findings of the experimental study are displayed in Table 4. (Where ✓ indicates that the module was added and - indicates that the module was not added).

The empirical findings show that, when compared to YOLOv8n, following the enhancement of the small object detection layer, the model’s mAP@0.5 and mAP@0.5:0.95 both increased by 0.7%, while the number of parameters decreased by

1.0 \times 10^{6}

and the model size also reduced by 1.8 MB. Furthermore, when using RFCAConv to replace the standard convolution in the backbone, the mAP@0.5 and mAP@0.5:0.95 improved by 0.5% and 0.8%, respectively, compared to the original model. It is remarkable that following the simultaneous introduction of the small object detection layer and RFCAConv convolution, the model’s detection performance was further enhanced, with the improvement in mAP@0.5 and mAP@0.5:0.95 reaching 1.2% and 1.9%, respectively. In addition, the model’s parameter count and size still decreased. On this basis, the addition of C2f_EMA further strengthened the model’s capacity for feature perception, capturing more characteristic information, thereby further enhancing the model’s performance. Compared to the original YOLOv8n model, the improved model achieved an increase of 2.2% and 2.6% in mAP@0.5 and mAP@0.5:0.95, respectively, while the parameter count decreased by 33.3% and the model size was reduced by 1.8 MB. The ablation experiments comprehensively proved that the advanced algorithm proposed in this paper not only boosted detection accuracy but also notably decreased the model’s parameter count and size, achieving a comprehensive improvement in model performance. This enhancement effect is visually demonstrated in the comparison curve shown in Figure 8.

4.3.3. Model Generalization Verification

To confirm the generalization prowess of the algorithm model researched in this paper, generalization experiments were conducted using the TWHD [23] (Two Wheeler Helmet Dataset) and EBHD [24] (Electric Bicycle Helmet Detection) datasets. The TWHD dataset was derived from the open-source OSF dataset, the bike helmet dataset, and images collected from the internet, totaling 5522 images. Specifically, 4710 images were randomly picked from the OSF dataset, and the leftover 738 images came from the bike helmet dataset and internet collections. The EBHD dataset was sourced from Alibaba Cloud’s open data platform, totaling 629 images. Both datasets feature complex backgrounds and diverse environments, making them suitable for validating the effectiveness and generalization ability of the algorithm studied in this paper.

The experimental findings are displayed in Table 5. In the TWHD and EBHD datasets, the performance of the model presented in this paper, in terms of metrics such as P, R, mAP@0.5, mAP@0.5:0.95, and Params, is generally better than the original YOLOv8n model. This validates the generalization and versatility of the algorithm studied in the present research.

4.3.4. Visualization of Detection Results

Based on the analysis of ablation experimental results, a comparison was made between the original YOLOv8n model and the algorithm studied in this paper for detecting electric bicycle riders on complex roads. To more intuitively showcase the detection performance of the two algorithms, selected images from different scenarios in the dataset, including bright light, dim light, and night, were tested, as shown in Figure 9. In scenarios (a), (b), and (c), the original YOLOv8n model suffered from issues such as missed detections, false detections, and difficulties in detecting overlapping targets for electric bicycle riders. In contrast, the refined model presented in this paper excelled in detection tasks under varying lighting and scene conditions.

In this paper, data augmentation techniques are employed on a self-built dataset to enhance data diversity. Specifically, Noise is superimposed on the images to augment the data. However, it is also acknowledged that the addition of noise may lead to information loss and obscure important features in the images. To validate the robustness of the proposed PRE-YOLO model, two images were selected and noise was added to them for testing, as shown in Figure 10. The model did not exhibit any issues such as missed detections or false detections due to the added noise, which fully demonstrates the ability of the proposed model to maintain its normal function, performance, or stability when confronted with unfavorable factors.

5. Conclusions

This paper proposes a PRE-YOLO model expressly designed for detecting the helmet-wearing status of electric bike riders in complex traffic scenarios. Based on the YOLOv8n model, several optimizations have been implemented. Firstly, by incorporating a small object detection layer and pruning the large object detection layer, the detection accuracy has been notably enhanced while substantially decreasing the model’s parameters and size. Secondly, the standard convolution in the backbone has been replaced with the RFCAConv module to enhance the receptive field spatial features and improve spatial attention, thereby further enhancing the detection accuracy. Lastly, the EMA has been integrated into the C2f module, which enhances feature perception capabilities and captures a greater amount of feature information without elevating the model’s computational load. Experimental findings reveal that, when contrasted with most existing mainstream detection models, the proposed PRE-YOLO model exhibits higher accuracy and practicality, making it more suitable for real-world traffic target detection applications. Although the model has achieved significant progress in multiple aspects, there are still some limitations. For instance, the model’s detection performance may be affected under extreme lighting conditions, severe occlusion, and dynamic visual scenes. Furthermore, whereas the model’s detection velocity is capable of satisfying the demands of present detection tasks, there is still a possibility of missed detections when the target vehicle is moving extremely fast. Forthcoming studies will concentrate on evaluating the PRE-YOLO model’s detection performance in extreme weather conditions such as heavy downpours and foggy conditions, as well as achieving true real-time detection, which presents a challenging research direction.

Author Contributions

Conceptualization, X.Y.; Methodology, Z.W.; Validation, Z.W.; Writing—original draft, Z.W.; Writing—review & editing, X.Y.; Supervision, X.Y. and M.D.; Project administration, M.D. All authors have read and agreed to the published version of the manuscript.

Funding

National Natural Science Foundation of China: (61563012); Guangxi Natural Science Foundation: (2021GXNSFAA220074).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author, the data are not publicly available due to privacy.

Conflicts of Interest

The authors declare no conflict of interest.

References

Tai, W.; Wang, Z.; Li, W.; Cheng, J.; Hong, X. DAAM-YOLOv5: A helmet detection algorithm combined with dynamic anchor box and attention mechanism. Electronics 2023, 12, 2094. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 2778–2788. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. pp. 21–37. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference On Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef] [PubMed]
Jia, W.; Xu, S.; Liang, Z.; Zhao, Y.; Min, H.; Li, S.; Yu, Y. Real-time automatic helmet detection of motorcyclists in urban traffic using improved YOLOv5 detector. IET Image Process. 2021, 15, 3623–3637. [Google Scholar] [CrossRef]
Xu, D.; Chen, Z.Y. An Electric Bike Helmet Wearing Detection Method Based on Improved YOLOv5s. J. Jinling Inst. Technol. 2022, 38, 7–14. [Google Scholar] [CrossRef]
Zhu, Z.H.; Qi, Q. Automatic Detection and Recognition of Electric Bike Helmets Based on Improved YOLOv5s. Comput. Appl. 2023, 43, 1291–1296. [Google Scholar]
Wu, D.M.; Yin, Y.P.; Song, W.Y.; Wang, J. Detection of Electric Bike Driver’s Helmet Wearing Based on Improved YOLOv4 Algorithm. Comput. Simul. 2023, 40, 508–513. [Google Scholar]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient multi-scale attention module with cross-spatial learning. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Wang, G.; Chen, Y.; An, P.; Hong, H.; Hu, J.; Huang, T. UAV-YOLOv8: A small-object-detection model based on improved YOLOv8 for UAV aerial photography scenarios. Sensors 2023, 23, 7190. [Google Scholar] [CrossRef] [PubMed]
Wang, J.P.; He, M.; Zhen, Q.G.; Zhou, H.P. Static and Dynamic Detection and Counting of Camellia Oleifera Fruits Based on COF-YOLOv8n. Trans. Chin. Soc. Agric. Mach. 2024, 55, 1–15. [Google Scholar]
Chen, Y.; Liu, H.; Chen, J.; Hu, J.; Zheng, E. Insu-YOLO: An insulator defect detection algorithm based on multiscale feature fusion. Electronics 2023, 12, 3210. [Google Scholar] [CrossRef]
Zhang, G.H.; Li, C.F.; Li, G.Y.; Lu, W.D. A Small Target Detection Algorithm for UAV Aerial Images Based on Improved YOLOv7-tiny. Eng. Sci. Technol. 2024, 56, 1–14. [Google Scholar] [CrossRef]
Zhang, X.; Liu, C.; Yang, D.; Song, T.; Ye, Y.; Li, K.; Song, Y. RFAConv: Innovating spatial attention and standard convolutional operation. arXiv 2023, arXiv:2304.03198. [Google Scholar]
Zhu, Y.P.; Cheng, B.; Xiong, C.; Ding, C. Multi-scale YOLOv5 Algorithm for Surface Defect Detection of Lithium-ion Batteries. Battery 2024, 3, 1–5. [Google Scholar]
Lan, M.; Liu, C.; Zheng, H.; Wang, Y.; Cai, W.; Peng, Y.; Xu, C.; Tan, S. RICE-YOLO: In-Field Rice Spike Detection Based on Improved YOLOv5 and Drone Images. Agronomy 2024, 14, 836. [Google Scholar] [CrossRef]
Wang, X.L.; Liao, C.X.; Wang, S.; Xiao, R.L. Research on Lightweight Network for Object Detection in Fish-eye Camera. Prog. Laser Optoelectron. 2024, 56, 1–25. [Google Scholar]
Xie, P.X.; Cui, J.R.; Zhao, M. Electric Bike Helmet Wearing Detection Algorithm Based on Improved YOLOv5. Comput. Sci. 2023, 50, 420–425. [Google Scholar]
ebicycle. TWHD Dataset. Roboflow Universe. Available online: https://universe.roboflow.com/ebicycle/twhd (accessed on 21 May 2024).
Duo, D. Helmet Detection. Alibaba. Available online: https://tianchi.aliyun.com/dataset/90136 (accessed on 15 June 2024).

Figure 1. The Network Structure of YOLOv8n.

Figure 2. PRE-YOLO Network Structure.

Figure 3. Structural Comparison Diagram of CA and RFCA.

Figure 4. EMA Network Structure.

Figure 5. Structural comparison diagram between Bottleneck and Bottleneck_EMA.

Figure 6. Network structure diagram of C2f_EMA.

Figure 7. Pictures of different scenes and lighting.

Figure 8. Comparison of Improvement Effects.

Figure 9. Detection Comparison under Different Lighting and Scene Conditions.

Figure 10. Noise Addition Comparison Test.

Table 1. Experimental Results of the Improved Small Target Layer.

Model	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Params 10⁶
YOLOv8n	93.5	64.3	3.0
Add a Small Target Layer	94.7	65.5	2.9
Improve the Small Target Layer	94.4	65.2	2.0

Table 2. Parameter setting.

Parameter	Value
Epochs	200
Batch	32
Imgsz	640
workers	4
optimizer	SGD
weight_decay	0.0005
momentum	0.937

Table 3. Experimental results compared with other models.

Model	$P$ (%)	$R$ (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Params $(10^{6}$ )	GFLOPs (G)	FPS ( $frame \cdot s^{- 1}$ )
Faster-RCNN	50.2	89.8	82.1	46.5	28.2	940.9	11.6
YOLOv3-tiny	91.7	85.0	91.7	57.3	8.6	12.9	212.7
YOLOv5n	90.4	88.9	93.4	62.1	1.7	4.1	105.2
YOLOv5s	92.8	90.9	94.9	64.2	7.0	15.8	75.3
YOLOv7-tiny	88.5	90.1	94.3	62.4	6.0	13.2	127.5
YOLOv8n	89.7	89.3	93.5	64.3	3.0	8.1	130.2
PRE-YOLO	91.0	91.0	95.7	66.9	2.0	11.7	81.6

Table 4. Experimental results of different improved ablation experiments on YOLOv8n.

Improve the Small Target Layer	RFCAConv	C2f_EMA	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Params 10⁶	Model Size (MB)
-	-	-	93.5	64.3	3.0	6.2
✓	-	-	94.4	65.2	2.0	4.4
-	✓	-	94.0	65.1	3.0	6.3
-	-	✓	93.6	64.5	3.0	6.2
✓	✓	-	94.7	66.2	2.0	4.4
✓	✓	✓	95.7	66.9	2.0	4.4

Table 5. Results of Generalization Experiments.

Datasets	Model	$P$ (%)	$R$ (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Params ( $10^{6}$ )
TWHD	YOLOv8n	80.5	74.9	79.9	49.1	3.0
TWHD	PRE-YOLO	82.2	78.8	83.7	51.6	2.0
EBHD	YOLOv8n	61.2	71.7	67.4	39.2	3.0
EBHD	PRE-YOLO	64.2	75.9	69.6	41.7	2.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, X.; Wang, Z.; Dong, M. PRE-YOLO: A Lightweight Model for Detecting Helmet-Wearing of Electric Vehicle Riders on Complex Traffic Roads. Appl. Sci. 2024, 14, 7703. https://doi.org/10.3390/app14177703

AMA Style

Yang X, Wang Z, Dong M. PRE-YOLO: A Lightweight Model for Detecting Helmet-Wearing of Electric Vehicle Riders on Complex Traffic Roads. Applied Sciences. 2024; 14(17):7703. https://doi.org/10.3390/app14177703

Chicago/Turabian Style

Yang, Xiang, Zhen Wang, and Minggang Dong. 2024. "PRE-YOLO: A Lightweight Model for Detecting Helmet-Wearing of Electric Vehicle Riders on Complex Traffic Roads" Applied Sciences 14, no. 17: 7703. https://doi.org/10.3390/app14177703

APA Style

Yang, X., Wang, Z., & Dong, M. (2024). PRE-YOLO: A Lightweight Model for Detecting Helmet-Wearing of Electric Vehicle Riders on Complex Traffic Roads. Applied Sciences, 14(17), 7703. https://doi.org/10.3390/app14177703

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

PRE-YOLO: A Lightweight Model for Detecting Helmet-Wearing of Electric Vehicle Riders on Complex Traffic Roads

Abstract

1. Introduction

2. Research Background

3. Proposed Methods

3.1. Improve the Small Object Detection Layer

3.2. Replace the Backbone Network Convolution

3.3. Improve the C2f Module

4. Experimental Design and Result Analysis

4.1. Dataset and Experimental Environment

4.2. Evaluation Indicators

4.3. Experimental Results and Analysis

4.3.1. Experimental Comparison and Analysis of Different Algorithms

4.3.2. Ablation Experiment

4.3.3. Model Generalization Verification

4.3.4. Visualization of Detection Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI