TLDM: An Enhanced Traffic Light Detection Model Based on YOLOv5

Song, Jun; Hu, Tong; Gong, Zhengwei; Zhang, Youcheng; Cui, Mengchao

doi:10.3390/electronics13153080

Open AccessArticle

TLDM: An Enhanced Traffic Light Detection Model Based on YOLOv5

by

Jun Song

¹

,

Tong Hu

¹,

Zhengwei Gong

¹,

Youcheng Zhang

¹ and

Mengchao Cui

^2,*

¹

College of Information Science and Technology, Nanjing Forestry University, 159 Longpan Road, Nanjing 210037, China

²

School of Foreign Studies, China University of Political Science and Law, 25 West Tu Cheng Road, Haidian District, Beijing 100088, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(15), 3080; https://doi.org/10.3390/electronics13153080

Submission received: 30 June 2024 / Revised: 26 July 2024 / Accepted: 30 July 2024 / Published: 3 August 2024

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Traffic light detection and recognition are crucial for enhancing the security of unmanned systems. This study proposes a YOLOv5-based traffic light-detection algorithm to tackle the challenges posed by small targets and complex urban backgrounds. Initially, the Mosaic-9 method is employed to enhance the training dataset, thereby boosting the network’s ability to generalize and adapt to real-world scenarios. Furthermore, the Squeeze-and-Excitation (SE) attention mechanism is incorporated to improve the network. Moreover, the YOLOv5 algorithm’s loss function is optimized by substituting it with Efficient Intersection over Union loss (EIoU_loss), which addresses issues like missed detection and false alarms. Experimental results demonstrate that the model trained with this enhanced network achieves an mAP (mean average precision) of 99.4% on a custom dataset, which is 6.3% higher than that of the original YOLOv5, while maintaining a detection speed of 74 f/s. Therefore, this algorithm offers higher detection accuracy and effectively meets real-time operational requirements. The proposed method has a strong application potential, and can be widely used in the field of automatic driving, assisted driving, etc. Its application is not only of great significance in improving the accuracy and speed of traffic sign detection, but also can provide technical support for the development of intelligent transportation systems.

Keywords:

traffic lights; unmanned systems; YOLOv5; mosaic-9; Squeeze-and-Excitation (SE); EIoU_loss

1. Introduction

In recent years, advancements in science, technology, and artificial intelligence have led to the development of autonomous and assisted driving technologies. These technologies, which utilize deep learning and computer vision, are progressively supplanting traditional target-detection algorithms in road traffic scenarios. Traffic lights are vital for road safety. Efficient and accurate detection of their status allows intelligent vehicles to gather crucial intersection information beforehand, thereby preventing accidents and enhancing passenger safety.

Traffic light detection currently grapples with numerous challenges. Complex lighting conditions have a profound impact on the visual attributes of traffic lights, with variations in lighting throughout different times and weather conditions substantially amplifying detection complexities. Intense light can induce halos and other visual distortions, thereby hindering the distinction of traffic light shapes and colors. Background interference is another salient issue. The vibrant lights of surrounding buildings, advertisement signage, vehicle headlights, and obstructions in the intricate urban street milieu can mimic traffic lights. This not only complicates detection but also increases the risk of false readings. Adverse weather conditions present an additional obstacle. Rain, fog, and snow can obscure traffic lights, yielding inaccurate or fragmented information for image capturing devices. Furthermore, the ‘small target’ dilemma is particularly acute. When traffic lights manifest as minute targets in images taken from extended distances or at unconventional angles, their reduced resolution and vague characteristics impede effective feature extraction. Meeting real-time demands poses a significant hurdle. In contexts such as autonomous driving, swift detection and response to traffic lights are paramount for ensuring both safety and efficiency. However, achieving a balance between processing and detection speed with intricate algorithms is a daunting task. Moreover, acquiring high-quality annotated data is a costly endeavor. Garnering an extensive array of traffic light imagery meticulously annotated across diverse scenes and states involves substantial financial resources, and ensuring the diversity and coverage of these data is a hard-fought battle. The generalization ability of resultant models in genuine complex and fluid traffic scenarios remains insufficient, constraining their capacity to adapt to novel environments. Finally, ongoing model updates and maintenance are imperative. As urban landscapes and transportation infrastructures evolve, the detection model necessitates periodic updates and optimizations. Yet, these updates often entail cumbersome procedures like re-collecting data and retraining, adding another layer of complexity to the process.

Most available traffic light-detection algorithms primarily rely on traditional image-processing methods, which involved manual feature extraction using sliding windows and the application of machine learning classifiers for detection and recognition [1,2,3,4,5]. Omachi et al. [6] employed the Hough transform to convert RGB images into standardized forms, accurately pinpointing traffic light positions within candidate regions. Wang Xiaoqiang et al. [7] transformed an image’s color space to HSV, applied the H color threshold to segment candidate areas, and used the Hough transform to predict suspected regions after performing grayscale morphological operations on the original image. The fusion and filtering of these methods facilitated the recognition of traffic light information. These traditional image-processing algorithms depend heavily on task-specific, hand-designed model features, which leads to low robustness and inadequate generalization, thus failing to meet the real-time requirements of complex traffic scenarios.

With the continuous development of convolutional neural network architecture, object-detection algorithms are experiencing significant growth. Currently, deep learning-based object-detection algorithms are categorized into two types: Two-stage algorithms, exemplified by Faster R-CNN [8], generate preliminary candidate bounding boxes and subsequently refine classifications, offering high accuracy but slower detection speeds. One-stage algorithms, such as Single Shot Multibox Detector (SSD) [9] and You Only Look Once (YOLO) [10], bypass candidate region generation, enabling faster detection speeds. However, these one-stage algorithms generally exhibit reduced accuracy compared to their two-stage counterparts. Pan Weiguo et al. [11] enhanced domestic traffic signal data, applied the Faster R-CNN algorithm to a custom dataset, and identified the best feature-extraction network through experimental analysis to detect and recognize traffic signals, though the detection efficiency remained low. The study in [12] describes a multi-task convolutional neural network, based on CIFAR-10, designed to detect traffic lights in complex environments; however, the model’s generalization and robustness were found to be lacking. The authors in [13] propose the SplitCS-Yolo algorithm for the rapid detection and recognition of traffic lights; although it is noted for its robustness, it requires improvements to detect yellow and digital traffic lights accurately. The study in [14] improves YOLOv3 by integrating two additional residual units into the second residual block of Darknet53, boosting the network’s ability to detect small targets with high accuracy, though it falls short of real-time processing requirements. Yan et al. [15] enhanced YOLOv5 using K-means clustering, which sped up traffic light detection in the BDD100K dataset, albeit at the cost of increased model complexity. While current deep learning-based traffic light-detection algorithms circumvent issues such as the manual feature extraction and task-specific dependence seen in traditional methods, they still face challenges including complex network architectures, extensive parameterization, reduced detection efficiency, and elevated training cost. The detection performance of these enhanced methods on traffic lights fails to meet practical requirements, struggling to balance speed and accuracy effectively.

To address issues in traditional image-processing methods and daily target-detection algorithms, a traffic signal light-detection model is designed based on the YOLOv5 target-detection algorithm. The main contributions of this paper are as follows. The Mosaic-9 method is utilized to enhance the training set for better adaptation to the application scenario. Additionally, the SE attention mechanism is incorporated into the network to emphasize the target features in the image and enhance the feature-extraction ability of the algorithm. Finally, Efficient Intersection over Union loss (EIoU_loss) is employed to optimize the training model and address missing and false detection. The experimental results show the effectiveness of the proposed traffic signal-detection algorithm in urban road traffic scenes using a self-made domestic urban traffic signal dataset, yielding favorable detection outcomes.

This paper is structured as follows. The first part is about traffic light-detection algorithms. The second chapter introduces the basic network of YOLO v5. The third chapter gives a summary about how we improved the algorithm to achieve better detection accuracy and a more lightweight computational complexity. The fourth chapter states the experimental results and analyzes them. The last chapter gives the conclusion of the whole thesis.

2. YOLOv5 Network Structure

The YOLO series, YOLOv1 to YOLOv5 [10,16], comprises single-stage object-detection algorithms and has integrated many advantages of deep learning object-detection networks. For this study, YOLOv5s, with the smallest depth and the fastest speed of the YOLOv5 models [17], was chosen as the base network for the experiment. Its network structure is illustrated in Figure 1 and consists of four parts: input, backbone, neck, and prediction.

2.1. Input

Firstly, the input terminal of the YOLOv5 network can perform Mosaic data enhancement on the dataset. This data-enhancement method concatenates four images using random scaling, cropping, and arrangement to enrich the dataset. In particular, the random scaling function adds numerous small targets, improving the detection ability of small objects and enhancing the network’s robustness. Specifically, the YOLOv5 network can adaptively generate various prediction boxes during model training and use NMS (Non-Maximum Suppression) to select the prediction box that is closest to the real box. Finally, to accommodate input images of different sizes, the YOLOv5 network’s adaptive scaling image function allows the image to be scaled to the appropriate size before input to the network, preventing problems such as mismatching between the feature tensor and the fully connected layer.

2.2. Backbone

Secondly, the backbone consists of CBS, CSP, and SPPF modules. The CBS module performs conventional convolution, Batch Normalization (BN) and the SiLU (Sigmoid-weighted Linear Unit) activation function primarily handle down sampling. The first CBS module in the backbone has a 6 × 6 convolution kernel size, suitable for high-resolution input images to capture global features effectively. Subsequent CBS modules use a 3 × 3 convolution kernel size. The CSP (Cross Stage Partial Network) module focuses on feature extraction and integrates feature information from different levels through a cross-stage structure to minimize gradient information repetition. The SPPF (Spatial Pyramid Pooling with Focal Loss) module is an improved version of the Spatial Pyramid Pooling (SPP) [18] module. It processes input features through three 5 × 5 maximum pooling operations, retaining the advantages of the SPP module in reducing repeated feature extraction and calculation cost.

2.3. Neck

Thirdly, the neck layer of the YOLOv5 network primarily combines a feature pyramid network (FPN) [19] and path aggregation network (PAN) [20]. The feature maps from different layers are concatenated, as illustrated in Figure 2. High-level features contain rich semantic information but have a low resolution, leading to inaccurate target location or even partial disappearance of the target. In contrast, low-level features provide accurate target location with a high resolution but limited semantic information. The FPN transmits deep semantic features from top to bottom, while the PAN transmits target location information from bottom to top. Through fusing top-down and bottom-up feature information, the model transmits the feature information of objects of different sizes, addresses the problem of multi-scale object detection, shortens the information-propagation path, and obtains richer feature information.

2.4. Prediction

Finally, the prediction component of the YOLOv5 network encompasses bounding box regression loss, confidence loss, and classification loss. The bounding box regression loss function utilizes CIoU_Loss (Complete Intersection over Union loss), which comprehensively considers three important geometric factors: overlapping area, center point distance, and aspect ratio. This approach enhances the prediction box regression speed and accuracy, particularly for targets with overlapping shading [21]. YOLOv5 uses a weighted NMS to sift through multiple target anchor boxes and eliminate redundant candidate boxes during the post-processing of target detection.

3. Improved YOLOv5 Model

This study proposes a YOLOv5-based traffic light-detection algorithm to tackle the challenges posed by small targets and complex urban backgrounds. Initially, the Mosaic-9 method is employed to enhance the training dataset, thereby boosting the network’s ability to generalize and adapt to real-world scenarios. Furthermore, the Squeeze-and-Excitation (SE) attention mechanism is incorporated to improve the network. Moreover, the YOLOv5 algorithm’s loss function is optimized by substituting it with EIoU_loss, which addresses issues like missed detection and false alarms. The overall structure is shown in Figure 3. Therefore, this algorithm offers a higher detection accuracy and effectively meets real-time operational requirements.

3.1. Data Enhancement

Data annotation is time-consuming in practice. The best approach to reduce annotation time is to create virtual data and add them to the training set. YOLOv5 utilizes the Mosaic data-augmentation method, which involves cropping, scaling, and randomly combining four images into a new image to increase the number of target samples. During normalization, the network calculates all four images simultaneously, thereby boosting training speed.

The Mosaic-9 data-enhancement method extends the concept of the traditional Mosaic method, expanding the original four images to nine, and then stitching them together for processing. By performing a series of random transformations on these images, such as flipping, rotating, scaling, cropping, etc., they are then stitched together into an image for training. As the size of the image increases, the label information becomes richer. Therefore, training one Mosaic image is equivalent to training nine small images at the same time, thus expanding the data sample size and improving the generalization ability of the model. The specific implementation method can be shown in Figure 4. This method increases the diversity of data and helps to improve the generalization performance of the model. This technique brings significant gains to the training of neural network models and helps to improve the performance of the models in different scenarios.

3.2. Squeeze-and-Excitation (SE) Attention Mechanism

The detection of traffic lights in this experiment presents challenges due to the complex image background, small pixel size, and indistinct features that are susceptible to background interference. The original model’s use of convolution for feature extraction may result in information loss and subpar target detection. To address this issue of convolutional neural networks struggling with global feature extraction, employing channel attention SENet [22] can significantly improve the model’s performance. Therefore, in this paper, SE attention mechanism replaces C3 structure in YOLOv5 backbone network. Its structure is shown in Figure 3.

The SE module acts as a channel attention mechanism that enhances the input feature map while maintaining its original size. In this study, the SE module is positioned at the end of the backbone to reinforce the overall channel features, enabling the subsequent neck part to effectively consolidate crucial features and ultimately enhance the model’s performance. The structure of the SE module, depicted in Figure 5, comprises three main components: compression, channel feature learning, and excitation. Firstly, the compression phase reduces the input feature map from W × H × C to a compact 1 × 1 × C feature map via global average pooling. Next, in the channel feature learning stage, a 1 × 1 × (C/r) feature map is generated using a 1 × 1 convolutional kernel with a stride of 1 and the SiLU activation function, where r represents the channel scaling factor. Subsequently, the channel weight coefficients for the 1 × 1 × C feature map are computed through a 1 × 1 convolutional layer with a stride of 1 and the Sigmoid activation function. Finally, in the excitation step, the original input features are multiplied by the channel weight coefficients to produce a feature map with channel attention, assigning varying significance levels to channel features based on their respective weight coefficients.

3.3. EIoU_Loss Function

Formula (1) is used for calculating the CIoU_loss function used in the original YOLOv5 [10]:

\begin{array}{l} l o s s_{C I o U} = 1 - I o U + \frac{ρ^{2} (b, b^{g t})}{c^{2}} + α ν \\ α = \frac{ν}{(1 - I o U) + ν} \\ ν = \frac{4}{π^{2}} {(\arctan \frac{w^{g t}}{h^{g t}} - \arctan \frac{w}{h})}^{2} \end{array}

(1)

This formula includes the following variables: b and b^gt denote the center points of the Predicted Box (PB) and ground truth (GT), ρ²() represents the Euclidean distance and is the length of a straight line connecting two points in space, α is a positive equilibrium parameter, c represents the shortest diagonal length of PB and GT, and v is the aspect ratio of PB and GT. Additionally, w^gt, h^gt, w, and h represent the length and width of PB and GT, respectively.

While CIoU_Loss considers the center point distance, aspect ratio, and overlap area of bounding box regression, further analysis of the formula reveals that it solely reflects the difference in aspect ratio without capturing the actual disparity between width and height and their respective confidence levels. Consequently, the study in [23] introduced the EIoU loss function by isolating the aspect ratio from CIoU, and its calculation is shown in Formula (2).

l o s s_{E I o U} = 1 - I o U + \frac{ρ^{2} (b, b^{g t})}{c^{2}} + \frac{ρ^{2} (w, w^{g t})}{c_{w}^{2}} + \frac{ρ^{2} (h, h^{g t})}{c_{h}^{2}}

(2)

In the formula, c_w and c_h represent the width and height, respectively, of the smallest bounding box that encompasses both boxes simultaneously. EIoU_Loss not only accelerates the convergence of the prediction frame compared with CIoU_Loss but also improves the regression accuracy. Therefore, this paper opts for EIoU_Loss over CIoU_Loss.

4. Experimental Results and Analysis

In this part, we introduce the composition of the dataset firstly, and then analyze the effectiveness of the improved method by comparing the experimental equipment used, evaluation indexes and ablation experiments. Experimental results verified the performances of our proposed method.

4.1. Collection of Datasets

Currently, existing open-source traffic signal datasets, such as LISA and LARA, are predominantly collected from foreign roads with a uniform background, high repetition rate, and significant regional variations both domestically and internationally. Conversely, domestic open-source traffic sign datasets, such as TT100K and CCTSDB, contain limited signal data, making them unsuitable for traffic signal detection. Therefore, custom datasets are created in this study.

Traffic signal data were collected in complex urban traffic environments in China using two methods: network screening and real-world photography. This resulted in approximately 2000 images. Subsequently, the LabelImg software version 1.5.1 was employed for manual labeling, resulting in three distinct label categories, as illustrated in Table 1. The dataset was then partitioned into a training set, a test set and a Validation set in a 8:1:1 ratio. Figure 6 presents a partial sample of the custom traffic signal dataset.

4.2. Experimental Environment and Evaluation Index

4.2.1. Experimental Environment and Parameter Configuration

The experimental environment utilized the Windows 10 operating system, with Pytorch, version 1.12.1, as the deep learning framework. The software and hardware platform device parameters are detailed in Table 2, and calculations were performed using an NVIDIA GeForce GTX 1650 graphics card (Santa Clara, CA, USA). The CUDA version is 11.4.0, and the Python language environment version is 3.9.2.

The model’s parameter settings are presented in Table 3. The total number of iterations was 400, with the batch size set to 8. The initial learning rate of the model was 0.01.

4.2.2. Evaluation Index

Precision (P), recall (R), mean average precision (mAP) and Frames Per Second (FPS) were used in this study to evaluate the detection capability of the model [24]. Formula (3) and (4) were used to calculate P and R:

Precision = \frac{T P}{T P + F P} \times 100 %

(3)

Recall = \frac{T P}{T P + F N} \times 100 %

(4)

TP represents the number of correctly detected traffic lights, FP represents the number of incorrectly detected traffic lights, and FN represents the number of missed detections. AP (average precision) can be regarded as the area under a specific P-R curve, and mAP is the average AP across all categories. A larger mAP value indicates a better detection effectiveness and identification accuracy of the algorithm. The calculation formulae for AP and mAP are shown in (5) and (6) [24]:

A P = \int_{0}^{1} P (R) d R

(5)

m A P = \frac{\sum_{i = 1}^{N} A P_{i}}{N}

(6)

where N is the total number of classes. FPS measures the number of image frames transmitted per second; a higher value indicates a faster detection speed of the network. In practical applications, an FPS of 25 is required to achieve real-time detection.

To sum up, in practical applications, especially in real-time traffic scenarios, precision, recall, mAP, and FPS are useful metrics. In real-time traffic scenarios, precision ensures the accuracy of detected traffic signals, vehicles, and other targets to avoid erroneous traffic decisions caused by misjudgments. Accurate identification can reduce the subsequent processing of the wrong target and save computing and processing resources. For example, in the monitoring of traffic violations, it is necessary to accurately judge whether a vehicle is violating the law and avoid wrongfully punishing innocent vehicles. Recall ensures that important traffic events or targets are detected as much as possible to reduce missed detection; When it comes to detecting traffic accidents, the high recall rate helps to identify all possible accidents in time for quick response. For example, in vehicle congestion detection, a high recall rate can more comprehensively detect congested sections and conduct traffic facilitation in time. mAP comprehensively considers the accuracy rate and recall rate to measure the advantages and disadvantages of the detection algorithm. When choosing different traffic-detection algorithms or models, mAP can be used as an important comparison index to choose the scheme with better performance. For example, when comparing different vehicle-recognition algorithms, an algorithm with a high mAP value is more likely to perform well in real traffic scenarios. FPS quickly processes image data in real-time traffic scenes to realize timely traffic control and decision-making. Ensure the smoothness of the traffic monitoring screen in order to better observe and analyze the traffic situation. For example, in intelligent transportation systems on highways, high FPS can quickly detect speeding vehicles and issue alerts in time. Therefore, the importance of these indicators in real-time traffic scenarios cannot be ignored, and they help to improve the efficiency, safety and reliability of the traffic system.

4.3. Analysis of Experimental Results

4.3.1. Ablation Experiment

To verify the effectiveness of the proposed model, ablation experiments were conducted in this study using the original YOLOv5 as the base network. The experimental results for different models on the test set are presented in Table 4. Experiment 1 involved training the traffic light dataset with the original YOLOv5, yielding a precision (P) of 93.6%, a recall ^® of 92.2%, an mAP of 93.1%, and a FPS of 53. In experiment 2, Mosaic-9 was applied to augment the data from experiment 1, resulting in a P of 96.6%, an R of 97.3%, an mAP of 97.2%, and an FPS of 65. Subsequently, experiment 3 introduced the SE attention mechanism based on experiment 2 to enhance detection accuracy, achieving a P of 97.6%, an R of 98.2%, an mAP of 98.1%, and an FPS of 67. Finally, experiment 4 utilized the EIoU loss function from experiment 3 to mitigate missed and false detections, leading to a P of 99.5%, an R of 98.9%, an mAP of 99.4%, and an FPS of 74.

In summary, the enhanced algorithm proposed in this paper achieves a detection accuracy of 99.4%, surpassing the original YOLOv5 algorithm by 6.3%. Additionally, the detection speed reaches 74 f/s, meeting the real-time requirements.

4.3.2. Contrast Experiment

To further validate the efficiency of the proposed algorithm, we compared the improved algorithm with the Faster R-CNN, YOLOv3, YOLOv4, SSD, YOLOv6, and YOLOv7 algorithms. The experimental results are presented in Table 5.

Table 5 shows that our improved algorithm named YOLOv5-MSE exhibits a significantly faster speed and a 7.9% increase in mAP compared to Faster R-CNN. It also demonstrates improved mAP values compared to YOLOv3 and YOLOv4. While SSD achieves a detection speed of 75 f/s, its accuracy is lower than the proposed algorithm. When compared to YOLOv6 and YOLOv7, the overall performance gap is minimal. In summary, the enhanced model excels in mAP and parameter quantity, achieving 99.4% and 42.3 MB, respectively. With a detection speed of 74 f/s, it outperforms other models and is well suited for deployment on embedded devices to meet real-time detection requirements, making it particularly suitable for traffic light recognition.

4.3.3. Comparison of Experimental Results

To visually demonstrate the recognition performance of the enhanced traffic light model, a comparative experiment was conducted to detect traffic lights under various scenarios. The image on the left depicts the original YOLOv5 model, while the image on the right shows the improved model. The test results are presented in Figure 6, displaying detection outcomes for four scenarios: normal traffic lights, strong light, long distance, and complex background. In Figure 7a, normal traffic lights are detected with improved accuracy compared to the original YOLOv5 model. In Figure 7b, the image on the left represents the result detected using the original YOLOV5 model under strong light, whereas the image on the right demonstrates a significant enhancement in detection accuracy. In Figure 7c, the model accurately identifies and outputs high accuracy over long distances. In Figure 7d, the proposed algorithm accurately detects the target under complex background, with improved accuracy.

In general, the reason why the proposed algorithm can achieve improved performance mainly lies in three aspects. First and foremost, the Mosaic-9 method is utilized to enhance the training set for better adaptation. Secondly, the SE attention mechanism is incorporated into the network to emphasize the target features. Last but not least, EIoU_loss is employed to optimize the training model and address missing and false detection.

5. Conclusions

In this paper, an improved YOLOv5 traffic signal-detection algorithm is proposed to address issues of missed detection, false detection, low accuracy, and excessive model parameters in traffic signal detection. This method enhances the training dataset using the Mosaic-9 method, improving network generalization for better adaptation to real-world scenarios. Additionally, the SE attention mechanism is integrated to enhance detection effectiveness, and EIoU_Loss is introduced to replace the original loss function, addressing missed and false detections while ensuring accuracy. The experimental results show that the improved algorithm achieves an mAP value of 99.4%, which is 6.3% higher than the original YOLOv5, with a detection speed of 74 f/s, balancing real-time performance and accuracy. Future work will focus on identifying algorithms that are more suitable for detecting traffic lights in complex urban scenes.

Author Contributions

Data curation, T.H. and Y.Z.; methodology, J.S. and M.C.; software, M.C. and Z.G.; validation, T.H.; writing—original draft, T.H.; writing—review and editing, J.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Postgraduate Research & Practice Innovation Program of Jiangsu Province (grant number: SJCX24_0384) and the college student innovation and entrepreneurship training program of Jiangsu Province (grant number: 202410298057Z).

Data Availability Statement

The data that support the findings of this study are available from the author Jun Song ([email protected]) upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lin, Y.H.; Wang, Y.S. Modular Learning: Agile Development of Robust Traffic Sign Recognition. IEEE Trans. Intell. Veh. 2024, 9, 764–774. [Google Scholar] [CrossRef]
Wang, Q.; Li, X.; Lu, M. An Improved Traffic Sign Detection and Recognition Deep Model Based on YOLOv5. IEEE Access 2023, 11, 54679–54691. [Google Scholar] [CrossRef]
Wang, J.; Chen, Y.; Ji, X.; Dong, Z.; Gao, M.; Lai, C.S. Vehicle-Mounted Adaptive Traffic Sign Detector for Small-Sized Signs in Multiple Working Conditions. IEEE Trans. Intell. Transp. Syst. 2024, 25, 710–724. [Google Scholar] [CrossRef]
Dharnesh, K.; Prramoth, M.M.; Sivabalan, M.A.; Sivraj, P. Performance Comparison of Road Traffic Sign Recognition System Based on CNN and Yolov5. In Proceedings of the 2023 Innovations in Power and Advanced Computing Technologies (i-PACT), Kuala Lumpur, Malaysia, 8–10 December 2023; pp. 1–6. [Google Scholar]
Chen, Y.; Chungui, L.; Bo, H. An improved feature point extraction algorithm for field navigation. J. Guangxi Univ. Technol. 2018, 29, 71–76. [Google Scholar]
Omachi, M.; Omachi, S. Traffic Light Detection with Color and Edge Information. In Proceedings of the IEEE International Conference on Computer Science and Information Technology, Beijing, China, 8–11 August 2009; pp. 284–287. [Google Scholar]
Wang, X.; Cheng, X.; Wu, X.; Zhou, H.; Chen, X.; Wang, L. Design of traffic light identification scheme based on TensorFlow and HSV color space. J. Phys. Conf. Ser. 2018, 1074, 012081. [Google Scholar] [CrossRef]
Zhu, Z.; Li, C.; Li, W.; Huang, W. The defect detection algorithm of vehicle injector seat based on Faster R-CNN model is improved. J. Guangxi Univ. Technol. 2020, 31, 1–10. [Google Scholar]
Tian, Y.; Gelernter, J.; Wang, X.; Chen, W.; Gao, J.; Zhang, Y.; Li, X. Lane marking detection via deep convolutional neural network. Neurocomputing 2018, 280, 46–55. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Pan, W.; Chen, Y.; Liu, B.; Shi, H. Traffic light detection and recognition based on Faster-RCNN. Sens. Micro-Syst. 2019, 38, 147–149+160. [Google Scholar]
Li, H. Research on Traffic Signal Detection Algorithm Based on Deep Learning in Complex Environment. Master’s thesis, Zhengzhou University, Zhengzhou, China, 2018.
Qian, H.; Wang, L.; Mou, H. Fast detection and identification of traffic lights based on deep learning. Comput. Sci. 2019, 46, 272–278. [Google Scholar]
Ju, M.R.; Luo, H.B.; Wang, Z.B.; He, M.; Chang, Z.; Hui, B. Improved YOLOv3 algorithm and its application in small target detection. Acta Opt. 2019, 39, 253–260. [Google Scholar]
Yan, S.; Liu, X.; Qian, W.; Chen, Q. An end-to-end traffic light detection algorithm based on deep learning. In Proceedings of the 2021 International Conference on Security, Pattern Analysis, and Cybernetics (SPAC), Chengdu, China, 18–20 June 2021; pp. 370–373. [Google Scholar]
Garg, H.; Bhartee, A.K.; Rai, A.; Kumar, M.; Dhakrey, A. A Review of Object Detection Algorithms for Autonomous Vehicles: Trends and Developments. In Proceedings of the 5th International Conference on Advances in Computing, Communication Control and Networking (ICAC3N), Greater Noida, India, 15–16 December 2023; pp. 1173–1181. [Google Scholar]
Zhang, C.; Hu, X.; Niu, H. Research on vehicle target detection based on improved YOLOv5. J. Sichuan Univ. (Nat. Sci. Ed.) 2022, 59, 79–87. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 28–23 June 2018; pp. 8759–8768. [Google Scholar]
Wang, P.; Huang, H.; Wang, M. Complex road target detection algorithm based on improved YOLOv5. Comput. Eng. Appl. 2022, 58, 81–92. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 28–23 June 2018; pp. 7132–7141. [Google Scholar]
Cira, C.-I.; Díaz-Álvarez, A.; Serradilla, F.; Manso-Callejo, M. Convolutional Neural Networks Adapted for Regression Tasks: Predicting the Orientation of Straight Arrows on Marked Road Pavement Using Deep Learning and Rectified Orthophotography. Electronics 2023, 12, 3980. [Google Scholar] [CrossRef]
Guo, S.; Li, L.; Guo, T.; Cao, Y.; Li, Y. Research on Mask-Wearing Detection Algorithm Based on Improved YOLOv5. Sensors 2022, 22, 4933. [Google Scholar] [CrossRef] [PubMed]

Figure 1. YOLOv5 network structure [17].

Figure 2. Feature pyramid network (PFN) [19] and path aggregation network (PAN) [20].

Figure 3. The overall structure.

Figure 4. Mosaic-9 data-enhancement flowchart.

Figure 5. Squeeze-and-Excitation attention mechanism.

Figure 6. Sample traffic signal datasets.

Figure 7. Comparison of the model detection results across various traffic scenarios. (a) Traffic lights with normal background; (b) traffic lights with strong light background; (c) traffic lights in the distance; (d) traffic lights with complex background.

Table 1. Distribution of categories in traffic signal datasets.

Label Category	Red	Green	Yellow
number	758	740	689

Table 2. Platform configuration parameters.

Configuration Name	Version Parameter
Operating system	Window 10
Deep learning framework	Pytorch 1.12.1
GPU	NVIDIA GeForce GTX 1650
CUDA	11.4.0
Programming language	Python 3.9.2

Table 3. Model parameter settings.

Epochs	Batch_Size	Initial Learning Rate
400	8	0.01

Table 4. Ablation experiment.

Experiment	Model	Precision (P)/%	Recall (R)/%	mAP/%	FPS/frame ∗ s⁻¹
Experiment 1	YOLOv5	93.6	92.2	93.1	53
Experiment 2	YOLOv5+Mosaic-9	96.6	97.3	97.2	65
Experiment 3	YOLOv5+Mosaic-9+SE	97.6	98.2	98.1	67
Experiment 4	YOLOv5+Mosaic-9+SE+EIoU	99.5	98.9	99.4	74

Table 5. Comparative experimental results.

Models	mAP/%	FPS/Frame ∗ s⁻¹	Params/MB
Faster R-CNN	91.5	53	74.4
YOLOv3	92.6	58	68.4
YOLOv4	92.1	67	50
SSD	93.5	75	43.5
YOLOv6	98.2	73	41.3
YOLOv7	99.3	73	41.5
Our improved algorithm	99.4	74	42.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Song, J.; Hu, T.; Gong, Z.; Zhang, Y.; Cui, M. TLDM: An Enhanced Traffic Light Detection Model Based on YOLOv5. Electronics 2024, 13, 3080. https://doi.org/10.3390/electronics13153080

AMA Style

Song J, Hu T, Gong Z, Zhang Y, Cui M. TLDM: An Enhanced Traffic Light Detection Model Based on YOLOv5. Electronics. 2024; 13(15):3080. https://doi.org/10.3390/electronics13153080

Chicago/Turabian Style

Song, Jun, Tong Hu, Zhengwei Gong, Youcheng Zhang, and Mengchao Cui. 2024. "TLDM: An Enhanced Traffic Light Detection Model Based on YOLOv5" Electronics 13, no. 15: 3080. https://doi.org/10.3390/electronics13153080

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

TLDM: An Enhanced Traffic Light Detection Model Based on YOLOv5

Abstract

1. Introduction

2. YOLOv5 Network Structure

2.1. Input

2.2. Backbone

2.3. Neck

2.4. Prediction

3. Improved YOLOv5 Model

3.1. Data Enhancement

3.2. Squeeze-and-Excitation (SE) Attention Mechanism

3.3. EIoU_Loss Function

4. Experimental Results and Analysis

4.1. Collection of Datasets

4.2. Experimental Environment and Evaluation Index

4.2.1. Experimental Environment and Parameter Configuration

4.2.2. Evaluation Index

4.3. Analysis of Experimental Results

4.3.1. Ablation Experiment

4.3.2. Contrast Experiment

4.3.3. Comparison of Experimental Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI