1. Introduction
In recent years, with the popularization of electric bikes and the increasing emphasis on traffic safety, the issue of helmet-wearing by electric bike riders has become a focal point of societal attention. Helmets, as crucial equipment for protecting riders’ head safety [
1], have a direct impact on the severity of injuries in traffic accidents. However, in real life, due to various reasons such as riders’ lack of safety awareness and the inconvenience of wearing helmets, the helmet-wearing rate is not ideal. Therefore, developing an efficient and accurate algorithm for detecting helmet-wearing by electric bike riders is of great significance for enhancing riders’ safety awareness and reducing traffic accident injuries. The detection algorithm for helmet-wearing by electric bike riders provides strong support for traffic management and safety monitoring. This research not only aids in further advancing the widespread adoption and profound progression of computer vision technology within the realm of traffic safety but also provides innovative technical means for the intelligent supervision of helmet-wearing by electric bike riders, opening up new avenues for supervision.
Helmet-wearing detection, as an interdisciplinary research topic in computer vision and traffic safety, has garnered widespread attention. Early research primarily relied on traditional image processing techniques such as background subtraction and feature extraction, but these methods had limited effectiveness in handling complex scenes and occlusion issues. With the increasing development of deep learning, object detection has become an important research area. There are mainly two categories: one-stage object detection and two-stage object detection. One-stage object detection algorithms, such as the YOLO [
2,
3,
4,
5] series and SSD [
6], accurately forecast the class and position of targets through a single forward pass, offering the advantages of fast detection speed and good real-time performance. These algorithms are typically suitable for scenarios requiring high detection speeds. Alternatively, two-phase object detection methodologies, such as the R-CNN [
7,
8] series, adopt a two-step strategy of first generating candidate regions and then performing classification and location refinement. These algorithms usually have an advantage in detection accuracy but relatively slower detection speeds. Among them, the YOLO series of algorithms excel in the field of object detection and related areas, possessing unique advantages. Their most notable achievement lies in real-time detection, predicting the location and category of all targets in an image through a single forward pass, greatly improving processing speed. Simultaneously, while preserving high detection precision, the YOLO series of algorithms persistently refine model architectures and detection methodologies to cater to the requirements of diverse scenarios. Furthermore, the YOLO series of algorithms exhibit good generalization capabilities, capable of handling targets of various scales and aspect ratios, with a low rate of false detections in the background. These advantages have led to the wide application of the YOLO series of algorithms in multiple fields, demonstrating their strong potential for practical use.
In recent years, multiple research papers have delved into helmet-wearing detection methods based on deep learning. For instance, Jia W et al. [
9]. integrated a triple attention mechanism into the YOLOv5 model. This triple attention mechanism extracts semantic dependency relationships from different spatial dimensions, eliminating indirect correspondences between channels and weights and thereby enhancing accuracy. Moreover, in crowded and complex road scenarios where targets often overlap and obscure each other, Jia W et al. employed Soft-NMS to gradually reduce the confidence of target boxes, effectively mitigating the competition between overlapping boxes. This approach addressed the issue of target occlusion and overlap, improving the model’s detection precision. However, the model still exhibits some limitations, particularly in terms of model size and computational cost, which may lead to certain deficiencies when tackling tasks that require fast detection speeds. In addition, for the detection of motorcycle helmet-wearing [
10], a clear qualification standard is that the helmet must fully enclose the head, and most motorcycle helmets in the market have relatively uniform styles. In contrast, the standards for electric bike helmets are relatively lenient, typically requiring only partial coverage of the head, and there are numerous styles available in the market, which makes the target objects exhibit greater diversity. Therefore, the model proposed in this paper may demonstrate insufficient adaptability when confronted with this diversity, posing a risk of missed detections.
Similarly, ZHU Z, H et al. [
11]. introduced the Convolutional Block Attention Module (CBAM) and Coordination Attention (CA) module into the YOLOv5 model to establish feature mapping relationships and reconstruct the attention on feature maps. This allowed the network to fully leverage global information, making the model more focused on detecting small targets. In the traditional Non-Maximum Suppression (NMS) method, the IoU metric is utilized to suppress redundant detection boxes, yet it solely takes overlapping areas into account, frequently resulting in incorrect suppression. Therefore, ZHU Z, H et al. introduced DIoU-NMS, which considers the disparity between the centroids of the predicted box and the target box, with IoU serving as the reference. This provides a more comprehensive evaluation criterion and increases the effectiveness of judging detection boxes. However, the article only uses one dataset for experimentation, which is not enough to confirm the model’s ability to generalize. The versatility of the model in different scenarios and conditions still needs further validation.
Furthermore, WU D, M et al. [
12]. used the YOLOv4 model as a basis and initially deepened the backbone network. They increased the number of convolutions after the first feature layer output using the backbone network and the three convolutions outputs after SPP pooling from one to five, thereby deepening the network and further extracting target features. Subsequently, to enhance the receptive field, they added an SPP network within the PANet, strengthening feature extraction and fusion. This improved the internal receptive field of the network, ensuring the effective extraction of the features of large target objects. Experimental results demonstrate good performance in terms of both P and
[email protected]. However, there are still some unresolved issues. During dataset annotation, helmets, electric bikes, and electric bike riders were annotated as a whole. Consequently, if there are multiple individuals on an electric bike, this method cannot individually determine whether each person is wearing a helmet, revealing certain limitations in detection.
Although the series of improvement methods proposed in the aforementioned articles have achieved a certain degree of enhancement in detection accuracy, further improvements are still necessary to excel in object detection tasks. Additionally, most of these improvement methods increase model complexity and computational costs, thereby reducing the speed of target detection. Furthermore, most relevant researchers have only conducted experimental tests using a single dataset, lacking datasets specific to complex traffic monitoring scenarios. As a result, the generalization ability of the models remains unproven. Therefore, to address these issues, based on the YOLOv8n model, this paper proposes a new PRE-YOLO model for detecting helmet-wearing status on electric bikes.
The primary contributions of this paper are outlined as follows:
Enhance the detection capability of small targets by refining the model structure through the incorporation of a specialized small target detection layer, which improves the extraction of information from shallow feature maps. In the meantime, prune the original detection layers for larger targets to make the model lightweight. This enables the model to improve performance even under resource-constrained conditions.
For the purpose of enhancing the model’s capacity to capture feature information in both channel and spatial dimensions, we have introduced a convolutional module that combines receptive field attention with attention mechanisms. This is suitable for complex environments and scenarios with densely distributed detection targets.
Furthermore, this paper incorporates the EMA [
13] mechanism into the C2f module, enhancing the model’s capabilities in both feature extraction and fusion. This has brought considerable assistance to the improvement of accuracy, meeting the requirements for task detection accuracy.
The remaining sections of this paper are arranged in the following manner.
Section 2 reviews the research background of the algorithm.
Section 3 focuses on the detection method proposed in this paper. In
Section 4,
Section 4.1 introduces the dataset and experimental environment,
Section 4.2 describes the evaluation metrics of the model, and
Section 4.3 shows the experimental results along with the analysis.
Section 5 discusses the research achievements of this paper and future work directions.
2. Research Background
YOLOv8n, released by Ultralytics on 10 January 2023, is designed to accomplish tasks within the realm of computer vision, encompassing picture classification, object detection, and image segmentation. To adapt to different application scenarios and computational resource requirements, with the increase in model complexity and network depth, it is sequentially divided into five versions: n, s, m, l, and x. The depth and width of the network models increase sequentially. Among them, although the YOLOv8n model has relatively lower accuracy, due to its small size and rapid detection speed, it is highly suitable for detection tasks.
The YOLOv8n model consists of four core components: Inputs, Backbone, Neck, and Head, as illustrated in
Figure 1. The Inputs component employs the Mosaic data augmentation technique, which adjusts certain hyperparameters dependent on the size of the model in use. This method effectively expands and enriches the dataset, consequently enhancing the model’s generalization proficiency and enabling it to maintain stable performance when confronted with various unknown or complex scenarios, ultimately bolstering the model’s robustness. In the Backbone and Neck components, the design approach of stacking multiple ELAN modules from YOLOv7 is referenced. The C3 module from YOLOv5 is modified to create the C2f structural module. While still maintaining a lightweight model, this significantly enhances the ability to capture gradient flow information and provides flexibility in adjusting the number of channels based on changes in model scale, resulting in a substantial improvement in model performance. In the Head component, compared to YOLOv5, the currently prevalent decoupled head structure is adopted. The model design separates the classification and detection heads, using two parallel branches to extract category feature information and location feature information, respectively. Then, it employs a 1 × 1 convolution layer for each branch to complete the classification and localization tasks, significantly diminishing the model’s parameter number, scale, and computational burden while enhancing its generalization ability and robustness.
YOLOv8n has abandoned the anchor-based detection method used in previous versions and adopted a new anchor-free detection strategy. Using this method, the midpoint coordinates and width-to-height proportion of the intended target can be directly predicted, greatly optimizing the detection process and significantly reducing the number of anchor boxes used. Therefore, YOLOv8n provides a more efficient and accurate solution for real-time object detection.
3. Proposed Methods
Detecting whether electric bike riders are wearing helmets in complex traffic scenarios often faces challenges such as misdetection and missed detection due to the complexity of the scene and the small size of the target. This research introduces a refinement based on YOLOv8n, with the refined network structure shown in
Figure 2. The main features include the introduction of a 160 × 160 small target detection head, which enhances shallow feature extraction and focuses on small target detection. Concurrently, the large target detection head is removed, considerably reducing model parameters and size while maintaining accuracy. The backbone network is integrated with a combined module, RFCAConv, generated from the fusion of receptive field attention convolution and the CA attention mechanism, which enhances feature perception and captures more feature information. Finally, the C2f_EMA module is incorporated to further augment the perception of important feature information, thereby improving the model’s detection performance.
3.1. Improve the Small Object Detection Layer
In the YOLOv8n network structure, three detection scales are designed to capture targets of different sizes. The feature extraction process progresses from shallow to deep layers. Shallow features are rich in specific geometric details of the targets due to their high resolution, while deep feature maps possess larger receptive fields and abundant semantic information. The network model performs 8×, 16×, and 32× downsampling on the input image of size 640 × 640, utilized for detecting small, medium, and large objects, respectively. When detecting electric bike riders and their helmets on complex traffic roads, the helmets, which are small targets [
14], require detection. If the original three detection heads are still used for detection, it may lead to insufficient utilization of shallow feature information, resulting in poor recognition ability for small targets [
15] and the loss of detection accuracy, coupled with frequent instances of overlooked detections and incorrect positives.
If the original three detection heads in the YOLOv8n network structure are still used for the detection task, the recognition accuracy for small targets will be excessively low. Therefore, a dedicated detection layer for small targets is incorporated into the original network structure. This layer performs 4× downsampling on the input image, significantly enhancing the extraction of shallow feature information and thereby improving the detection capabilities for small targets [
16]. For small targets, this method has achieved a further improvement in accuracy. Additionally, in detection tasks where most targets are small [
17], the original detection layer designed for large targets only increases the model’s parameter count and size. Therefore, this paper refines the detection process by pruning the layer originally designed for detecting large targets. The principal objective is to refine the detection criteria for small targets. The findings from the experimental studies, as depicted in
Table 1, demonstrate that the effectiveness has been adequately proven. Although the increase in detection accuracy is most pronounced after adding the small target layer, the model’s parameter count only decreases slightly. After refining the small target layer, despite a slight decrease in detection accuracy, the model’s parameter count is reduced by 33%, achieving a lightweight improvement while enhancing accuracy.
3.2. Replace the Backbone Network Convolution
The standard convolution operation is a core component in constructing convolutional neural networks (CNNs). It effectively extracts feature information from images through sliding windows and parameter sharing, overcoming the inherent limitations of fully connected layers in terms of parameters and computational efficiency. However, this operation is also accompanied by issues such as large model parameters and high computational costs. The spatial attention mechanism, as an important attention technology, focuses on the spatial dimensions of images, namely the interrelationships between pixels. Through training, the model can learn how to assign varying weights to different regions within the image, effectively prioritizing and focusing on the key information. The combination of this mechanism with standard convolution enables more efficient extraction and processing of image information in CNNs. However, traditional spatial attention mechanisms often fail to fully consider the spatial features of the entire receptive field. Therefore, when dealing with large convolutional kernels (such as 3 × 3 convolutions), the advantages of parameter sharing are not fully leveraged, which somewhat limits their effectiveness.
Receptive-Field Attention convolutional operation (RFAConv) [
18] focuses on the spatial features information of the receptive field and addresses the issue of parameter sharing in convolutional kernels by introducing a receptive field attention mechanism. RFAConv ensures minimal computational overhead and parameter count, resulting in significant improvements in detection performance. Coordinate Attention (CA) is a mechanism that incorporates spatial location information into channel attention. By introducing CA’s spatial attention mechanism into the spatial features [
19] of the receptive field, we obtain Receptive-Field Coordinate Attention (RFCA). The framework of the network is attractively portrayed in
Figure 3, offering a clear visual representation. By matching the spatial attention of the receptive field’s spatial features with convolution, we generate the Receptive-Field Coordinate Attention convolutional operation (RFCAConv) to replace standard convolution, fully resolving the issue of convolutional parameter sharing. At the same time, it considers long-range information to some extent, enhancing convolutional performance. In the present study, RFCAConv is utilized to replace some of the standard convolutions in the backbone network, resulting in increased model performance.
3.3. Improve the C2f Module
Introduce EMA, an efficient multi-scale attention module that operates without the requirement of dimension reduction. Its core mechanism is to smooth the attention distribution through the exponential moving average method, thereby improving the model’s robustness and generalization ability. This module recalibrates the channel weights in each parallel branch by encoding global information and captures pixel-level relationships [
20] through cross-dimensional interactions. EMA is designed to reduce computational overhead while preserving key information from each channel, enhancing the model’s capability to process features effectively. By reorganizing the channel and batch dimensions [
21] and leveraging cross-dimensional interactions, the model is able to effectively capture pixel-level relationships, which enables it to excel in focusing on important information within the image. Particularly noteworthy is the structure of the EMA attention mechanism module, as depicted in
Figure 4.
Specifically, EMA initially separates the feature map across the channel dimension into G sub-features, with the aim of facilitating the model in learning and capturing different semantic information, as shown in Equation (1).
In this section, represents a three-dimensional tensor containing feature information; denotes the quantity of input channels; represents the sub-features divided; and represents the set of real numbers.
Secondly, EMA utilizes three parallel paths to extract clustered feature maps, with two of the paths being
branches. When encoding the channels, pooling kernels of sizes
and
are applied in the vertical and horizontal directions, individually, and the calculations are shown in Equations (2) and (3).
In this section, signifies the output of the C-th channel that has a width of , processed only in the vertical direction; represents the output of the -th channel; signifies the output of the C-th channel that has a height of , processed only in the horizontal direction; and and represent different spatial directions.
The feature projections in the vertical and horizontal directions are fused through Equations (4)–(6), allowing different cross-channel information interactions between the two parallel paths. The fused feature representation is then split into two independent tensors, and finally, two nonlinear Sigmoid functions are applied.
In this section, represents the fused feature representation, where ; represents the transformation used for adjusting the number of channels; and represent the feature representations of in the two spatial directions; stands for the Sigmoid activation function; and signifies the attention weight.
Upon acquiring the attention weights
in different spatial directions, they are processed together with the output of the third branch
using a two-dimensional global average downsampling operation to handle the feature information in the vertical and horizontal directions. The operation of two-dimensional global pooling is shown in Equation (7). Finally, cross-spatial information fusion is performed to obtain the final attention weight output of EMA.
In this section, signifies the output corresponding to the c-th channel.
Furthermore, the C2f module is of utmost importance in object detection, specializing in fusing feature maps of different scales to enhance detection accuracy [
22]. By concatenating feature representations across various levels along the channel dimension and stacking them together, the module forms a deeper feature map. This not only preserves abundant spatial information but also fully retains semantic information, providing strong support for object detection.
Consequently, this paper presents the idea of integrating the EMA module with the C2f module and develops the C2f_EMA module, which combines the advantages of both. While maintaining a lightweight computational load, it enhances the feature perception capability and captures more characteristic information. The module is composed of multiple Bottleneck_EMA components, which combine the original features with the enhanced features through residual connections to maintain feature continuity and information flow. The specific design is presented in
Figure 5.
Within this article, the C2f_EMA module is used in the Backbone, and its architecture is illustrated in
Figure 6. It exchanges the primary C2f module located in the core network. While maintaining the lightweight nature of the model, it enables the model to complete the learning of residual features, preserving feature continuity and information flow. For helmet-wearing detection on roads with complex traffic backgrounds and mostly small targets, the introduction of this module further elevates the perception of vital feature information, reduces the impact of noise interference, and improves detection accuracy.
5. Conclusions
This paper proposes a PRE-YOLO model expressly designed for detecting the helmet-wearing status of electric bike riders in complex traffic scenarios. Based on the YOLOv8n model, several optimizations have been implemented. Firstly, by incorporating a small object detection layer and pruning the large object detection layer, the detection accuracy has been notably enhanced while substantially decreasing the model’s parameters and size. Secondly, the standard convolution in the backbone has been replaced with the RFCAConv module to enhance the receptive field spatial features and improve spatial attention, thereby further enhancing the detection accuracy. Lastly, the EMA has been integrated into the C2f module, which enhances feature perception capabilities and captures a greater amount of feature information without elevating the model’s computational load. Experimental findings reveal that, when contrasted with most existing mainstream detection models, the proposed PRE-YOLO model exhibits higher accuracy and practicality, making it more suitable for real-world traffic target detection applications. Although the model has achieved significant progress in multiple aspects, there are still some limitations. For instance, the model’s detection performance may be affected under extreme lighting conditions, severe occlusion, and dynamic visual scenes. Furthermore, whereas the model’s detection velocity is capable of satisfying the demands of present detection tasks, there is still a possibility of missed detections when the target vehicle is moving extremely fast. Forthcoming studies will concentrate on evaluating the PRE-YOLO model’s detection performance in extreme weather conditions such as heavy downpours and foggy conditions, as well as achieving true real-time detection, which presents a challenging research direction.