An Enhanced Deep Learning Model for Obstacle and Traffic Light Detection Based on YOLOv5

Li, Zhenwei; Zhang, Wei; Yang, Xiaoli

doi:10.3390/electronics12102228

Open AccessArticle

An Enhanced Deep Learning Model for Obstacle and Traffic Light Detection Based on YOLOv5

by

Zhenwei Li

^*

,

Wei Zhang

and

Xiaoli Yang

School of Medical Technology and Engineering, Henan University of Science and Technology, Luoyang 471023, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(10), 2228; https://doi.org/10.3390/electronics12102228

Submission received: 13 April 2023 / Revised: 12 May 2023 / Accepted: 12 May 2023 / Published: 14 May 2023

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Timely detection of dynamic and static obstacles and accurate identification of signal lights using image processing techniques is one of the key technologies for guidance robots and is a necessity to assist blind people with safe travel. Due to the complexity of real-time road conditions, current obstacle and traffic light detection methods generally suffer from missed detection and false detection. In this paper, an improved deep learning model based on YOLOv5 is proposed to address the above problems and to achieve more accurate and faster recognition of different obstacles and traffic lights that the blind may encounter. In this model, a coordinate attention layer is added to the backbone network of YOLOv5 to improve its ability to extract effective features. Then, the feature pyramid network in YOLOv5 is replaced with a weighted bidirectional feature pyramid structure to fuse the extracted feature maps of different sizes and obtain more feature information. Finally, a SIoU loss function is introduced to increase the angle calculation of the frames. The proposed model’s detection performance for pedestrians, vehicles, and traffic lights under different conditions is tested and evaluated using the BDD100K dataset. The results show that the improved model can achieve higher mean average precision and better detection ability, especially for small targets.

Keywords:

obstacle and traffic light detection; deep learning model; YOLOv5; coordinate attention; bidirectional feature pyramid network

1. Introduction

At present, there are more than 2.2 billion visually impaired people worldwide, which is a very large vulnerable group. The vast majority of visually impaired people carry out their daily lives within a limited area because they encounter many “obstacles” when traveling, such as pedestrians, stalled vehicles, road signs, and safety facilities. These obstacles can injure them and may, in severe cases, lead to their deaths. In recent years, researchers have continuously attempted to use guided robots to help blind people navigate and improve their quality of life. Multi-form and multifunctional guide robots continue to enter the public eye, among which the most representative one is mobile guidance robots [1,2,3]. Timely detection of dynamic and static obstacles and accurate identification of signal lights is one of the key technologies for guidance robots and is a necessity to assist blind people with safe travel [4]. There are many obstacle detection techniques in the area of robot guiding, for example, ultrasonic ranging, infrared ranging, impact sensing [1], and image sensing [5]. Obstacle detection based on ultrasonic ranging and infrared ranging can easily detect the location of an obstacle, but these methods are easily affected by rain and fog and also cannot obtain the actual image of the detection target. Impact sensors can only detect objects that are very close. These shortcomings limit the development and practical application of these three types of obstacle detection techniques.

With the development of neural-network-related technologies, obstacle detection and target identification based on image sensing using deep learning has made a breakthrough. Obstacle detection algorithms based on deep learning include two categories: the two-stage algorithm of the R-CNN series and the one-stage algorithm of the YOLO series [6]. The R-CNN algorithms can be divided into two stages. The first stage mainly extracts the target area, and in the second stage, target detection is performed. Two-stage obstacle detection algorithms include the SPP-Net model, Fast R-CNN, and, further, Faster R-CNN [7,8,9]. Since the obstacle detection algorithms of the R-CNN family need to traverse all candidate frames when detecting obstacles, which takes a long time, they are unable to meet the needs of real-time applications. The YOLO algorithm proposed by Redmon is a one-stage obstacle detection algorithm that directly extracts the feature information of the image through the neural network model and outputs the result at the last layer of the model [10]. The detection efficiency of the YOLO algorithm is higher than that of the R-CNN series of target detection algorithms, but in its detection process, some target features may be lost, which results in low accuracy for this algorithm. The one-stage-based obstacle detection models include YOLOv2, YOLOv3, YOLOv4, YOLOv5, YOLOv6, YOLOv7, and SSD network structures [11,12,13,14,15,16]. In terms of obstacle detection for guiding the blind, Qiu Xuyang [17] adopted the Denset network to improve YOLOv3 and combined it with a stereo matching model to realize obstacle detection in traffic scenes. Duan Zhongxing [18] used the deep learning model YOLOv4 in blind path obstacle recognition, introduced asymmetric convolution and other modules to improve the network, and achieved higher accuracy in a practical scenario test. Wang Weifeng [19] adopted the method of increasing the receptive field to realize target recognition on the SSD network and improved the detection speed. Jiang and Yu proposed a pedestrian detection method based on XGBoost, which optimizes the XGBoost model with a genetic algorithm [20]. Xin Huang realized the recognition of crowded pedestrians using a novel Paired-Box Model (PBM) [21]. Ouyang realized real-time traffic light detection on the road through a CNN [22]. Vasiliki Balaska designed a system for generating enhanced semantic maps to detect obstacles in ground and satellite images [23]. Avinash Padmakar Rangari used YOLOv7 to design an intelligent traffic light system that realized rapid traffic by detecting vehicles, pedestrians, and obstacles in the traffic road [24]. Yan Zhou proposed an improved Faster-RCNN obstacle detection algorithm to recognize small targets and occluded targets in automatic driving scenarios [25]. Mazin Hnewa has designed a new multi-scale domain adaptive MS-DAYOLO for object detection [26]. Shuiye Wu proposed a YOLOX based network model for multi-scale object detection tasks in complex scenes [27]. Vicent ortiz castell ó In order to help reduce accidents in the advanced driver assistance system, the original Leaky ReLU convolution activation function in the original YOLO implementation is replaced by the cutting-edge activation function in the YOLOv4 network to improve the detection performance [28]. When identifying obstacles in front of the blind, the obstacle detection algorithms mentioned above often have missed detection and false detection due to the diverse types of obstacles and complex conditions, such as occlusion, low contrast between target and background, and small target size.

In this paper, an improved obstacle detection model based on YOLOv5 is proposed to address the above problems and to achieve more accurate and faster recognition of different obstacles and traffic lights that the blind may encounter. The main contributions of this paper are as follows:

An improved deep learning model based on YOLOv5 is proposed. A coordinate attention layer is added to the backbone network of YOLOv5 to improve its ability to extract effective features in the target image. The feature pyramid network in YOLOv5 is replaced with a weighted bidirectional feature pyramid structure to fuse the extracted feature maps of different sizes and obtain more feature information. A SIoU loss function is introduced to increase the angle calculation of the frame, improve the detection speed of the frame regression, and improve the mean average precision.
The proposed model’s detection performance for pedestrians, vehicles, and traffic lights under different conditions is tested and evaluated using the BDD100K database [29]. The results show that the improved model can achieve higher mean average precision and better detection ability, especially for small targets.

2. Methodology

In YOLOv5, the CSPDarknet53 backbone network is used to extract the image features and includes Conv-Batchsize-SiLU (CBS), Spatial Pyramid Pooling—Fast (SPPF), and cross-stage partial network modules. The CBS structure is composed of convolution, batch normalization, and the SiLu activation function. The main role of SPPF is to extract high-level features and then fuse them. During the fusion process, the maximum-set operation is used several times to extract as many high-level semantic features as possible. Cross-Stage Partial networks (CSPs) used in backbone networks are added to the residual structure to more effectively extract high-level features [30]. In addition, CSP modules without residual structures are also used in feature fusion modules. The cross-stage network can describe the change in the image gradient in the feature graph, which reduces the network parameters and maintains the speed without losing precision [31]. The prediction part of the YOLOv5 model uses feature maps of three scales to generate prediction boxes for targets in the image and uses a non-maximum suppression method to obtain the box that is most similar to the real box [32].

In this paper, we propose the improved YOLOv5 algorithm to solve problems such as missed detection and false detection of different obstacles and traffic lights under different conditions. By improving the backbone network and replacing the feature fusion network and loss function, the network model is more suitable for road-object detection and recognition in practical application scenarios. The structure of the improved YOLOv5 model is shown in Figure 1.

2.1. Coordinate Attention Module

In recent years, the effectiveness of the attention mechanism in computer vision has been proven and has been widely used in target classification [33], detection [34], and segmentation [35]. The attention mechanism can help the network model pay more attention to the feature and location information of the region of interest as well as improve the performance of the model. It is worth noting that the use of attention mechanisms requires a lot of computing to implement, which means more time with larger computing devices. The coordinate attention (CA) mechanism can solve the computational overhead caused by most of the attention mechanisms, and it can embed the location information into the channel attention to help the model extract the features containing the channel information, direction information, and location information [36,37].

The feature map carries out global average pooling, convolution, and nonlinear activation functions along the dimensions of width and height, adding the location information of features to the channel attention. The feature plus coordinate attention calculation is shown in Equation (1).

y_{c} (i, j) = x_{c} (i, j) * g_{c}^{h} (i, j) * g_{c}^{w} (i, j),

(1)

where

g_{c}^{h}

and

g_{c}^{h}

are the weights obtained in the width and height directions of the feature map, respectively,

x_{c}

is the input feature, and

y_{c}

is the output feature.

The backbone network of the YOLOv5 algorithm uses the CSP-Darknet53 network structure for feature extraction, which is based on the introduction of the CSPNet network structure on top of Darknet53. To further improve the feature extraction capability of the backbone network and ensure a reduction in computation and the improvement of the inference speed without a decrease in detection accuracy, a coordinate attention network structure is added. The CA module is shown in the upper right of Figure 1.

2.2. Bidirectional Feature Pyramid Network Model

The original YOLOv5 uses the classical approach of combining a feature pyramid network (FPN) and path aggregation network (PANet) for feature fusion [38,39]. The FPN algorithm extracts deep semantic features from the top down, while the PANet algorithm obtains target location information from the bottom up. By integrating the information obtained from these two algorithms, the information about features is increased and the sensitivity of the network is improved. Compared with the classical approach of combining FPN and PANet, the bidirectional feature pyramid network (BiFPN) adds cross-layer connections and removes nodes with only one input edge [40]. Multiple operations can be performed on the same layer to realize feature fusion at a higher level. BiFPN is simpler and faster in multi-scale feature fusion than PANet because it requires fewer parameters and calculations. The BiFPN structure can make the prediction network more sensitive to objects with different resolutions and improve the performance of the detection network. Therefore, this paper adopts the BiFPN structure to ensure that the model can enhance the semantic and localization information of the features and improve the detection rate at the same time. The BiFPN structure is shown in Figure 2.

2.3. Loss Function

The original YOLO5 model uses CIoU as its loss function [41]. In boundary box regression calculations, the CIoU loss function adds the distance between the center point of the real frame and the predicted frame, the vertical-to-horizontal ratio, and the area of the overlapping part of the two frames, but it does not consider the regression direction between the real frame and the predicted frame, resulting in slow convergence speed and low efficiency. The SIoU loss function increases the vector angle between required regressions, redefines the penalty index, and adds the angle to the calculation of the distance between the real frame and the prediction frame [42]. Compared to CIoU, the SIoU loss function can achieve faster convergence speed and higher computational efficiency. Therefore, this paper introduces the SIoU to replace the CIoU.

The SIoU loss function is as follows:

L_{S I o U} = 1 - I o U + \frac{Δ + Ω}{2},

(2)

where

Δ

and

Ω

represent the distance cost and the shape cost, respectively.

Δ

is defined as:

Δ = \sum_{t = x, y} (1 - e^{- γ ρ_{t}}),

(3)

where

ρ_{t} = \{\begin{matrix} {(\frac{b_{c_{x}}^{g t} - b_{c_{x}}}{c_{w}})}^{2}, & t = x \\ {(\frac{b_{c_{y}}^{g t} - b_{c_{y}}}{c_{h}})}^{2}, & t = y \end{matrix}

(4)

γ = 2 - Λ,

(5)

where

Λ

represents the angle cost:

Λ = 1 - 2 * s i n^{2} (a r c s i n (x)) - \frac{π}{4};

(6)

Ω

is defined as:

Ω = \sum_{t = w, h} {(1 - e^{- ω_{t}})}^{θ} .

(7)

The SIoU loss function greatly improves the training and reasoning of the target detection algorithm, realizes faster convergence, and has better performance in reasoning, which can help the model converge faster and more accurately.

3. Experiment Results and Discussion

3.1. BDD100K Dataset

The dataset used in this experiment is derived from the road detection dataset BDD100K released by the University of California (Berkeley) in 2018 and contains a collection of 100,000 images of the streets of several different cities. There are common static and dynamic targets, such as traffic lights, pedestrians, motorcycles, and cars in the images. The images in the dataset have different levels of clarity to ensure their diversity. Considering the main obstacles encountered by blind people while traveling are pedestrians, vehicles, and traffic lights, we select these three different obstacles as the detection targets in the experiment and generate the training set and test set at random in a ratio of 8:2.

3.2. Evaluation Metrics

In this paper, IOU, Precision, Recall, and mAP50 are used to assess the performance. IOU refers to the area overlap ratio of the obtained detection and the ground truth boxes and can be expressed as:

I O U = \frac{A ⋂ B}{A ⋃ B}

(8)

Precision and Recall are defined as:

P r e c i s o n = \frac{T P}{T P + F P}

(9)

R e c a l l = \frac{T P}{T P + F N}

(10)

where

T P

,

T N

, and

F N

stand for true positives, true negatives, and false negatives, respectively. The Mean Average Precision (

m A P

) can be calculated as follows:

m A P = \frac{1}{N} \sum_{i = 0}^{N} (R e c a l l = \frac{i}{N})

(11)

where N denotes the number of classes. AP50 represents the Mean Average Precision at

I o U = 0.5

and is one of the main indicators used to compare the performance of different models.

3.3. Results and Discussion

All experiments in this paper were conducted in the same configuration environment and training environment configuration: AMD 5950X CPU, 64GB RAM, and NVIDIA GeForce RTX3090 24G GPU. The PyTorch1.12 framework and Anaconda Python3.7 interpreter built the network framework. The iteration period was set to 300 times for training, and the accuracy and average accuracy of the model were calculated.

3.3.1. Detection Results

We evaluated the results of the proposed method and six other representative methods using the mAP and AP50 indicators on the same BDD100K dataset. Table 1 shows the Precision and AP50 of the detection results of the three different detection categories, which are pedestrians, vehicles, and traffic lights, with the proposed model. It can be seen from Table 1 that vehicles have the highest detection accuracy at 84.60%, while the detection accuracy of traffic lights is the lowest at only 69.20%. The AP50s of the three categories are 68.10%, 80.60%, and 55.20%, respectively. Similarly, vehicles have the highest AP50, while traffic lights have the lowest. After analyzing the dataset, we found that the number of yellow traffic lights was only 510, which is 1/200 of the number of vehicle labels, which affects the overall detection accuracy of the traffic lights. Meanwhile, the traffic lights in the images are much smaller than the pedestrians and vehicles, which, to some extent, poses difficulties for their detection. In brief, insufficient training data and smaller sizes in the images are two main factors that affect the accuracy of traffic light detection.

The computational complexity of the CA module was evaluated through comparative experiments. In the experiments, a Convolutional Block Attention Module (CBAM), Squeeze Excitation (SE), and a CA module were added to the YOLOv5 model in turn, and then their parameters, GFLOPs, and mAP were calculated. Table 2 gives the comparison results. It can be seen from the table that the CA module in this paper results in higher predictive performance with less computational cost compared with CBAM and SE.

Table 3 gives the AP50 and mAP of the proposed method and six other classical objection detection algorithms. For AD-Faster-RCNN [25], SSD [16], YOLOv4-416 [28], and YOLOv4, as there are no detection data for traffic lights, only their pedestrian and vehicle detection results are provided, and the number of their detection categories N = 2. For MS-DAYOLO [26], YOLOv5, YOLOv6, YOLOv7-tiny [27], and the method in this paper, N = 3. From Table 3, the AP50s of the proposed model for pedestrian, vehicle, and traffic light detection reach 68.1%, 80.6%, and 55.2%, respectively, which are the best performances in the comparative experiments. Compared with the original YOLOv5 model, the AP50s of pedestrian, vehicle, and traffic light detection of our method has been increased by 11.9%, 4.6%, and 8.17%, respectively, and the mAP increases by 8.23%, demonstrating the effectiveness of the improved model. For YOLOv5, YOLOv6, YOLOv7-tiny, and the method in this paper, our model has the highest computational reasoning time, but it is still real-time, reaching 4.7 ms, which is an increase of 1.3 ms compared with the YOLOv5 model. It is wise to trade time for accuracy without compromising the real-time requirements of the model. Higher-precision detection models can be more capable of sensing the environment.

3.3.2. Detection Performance under Special Circumstances

To test the performance of the model under special circumstances, such as occlusion, low contrast between target and background, and small target size, three different images for each type of detection target are randomly selected, and each image to be detected represents a situation. The detection results are shown in Figure 3, Figure 4 and Figure 5, and some details are enlarged using yellow rectangular boxes to better carry out subjective visual evaluation. The first column (Figure 3a,d,h, Figure 4a,d,h and Figure 5a,d,h) in each set of images represents the image to be detected, the second column (Figure 3b,e,i, Figure 4b,e,i and Figure 5b,e,i) represents the detection results of the original YOLOv5 model, and the third column (Figure 3c,f,j, Figure 4c,f,j and Figure 5c,f,j) represents the detection results of the improved model.

Figure 3 shows the comparison of vehicle detection results between the original YOLOv5 model and our model. The first row of images shows the detection ability of the proposed model for occluded targets. From Figure 3b, it can be seen that the original YOLOv5 mistakenly identified each part of the two vehicles in the marked box on the left side of Figure 3b as one vehicle, while in Figure 3c, this error detection has been corrected. The second row of images shows the detection performance under low contrast between the background and the target. In Figure 3e, the gray and black car in the yellow box was not successfully detected, while in Figure 3f, it was accurately identified. The third row of image illustrates the comparison of detection capabilities for small targets: in this case, vehicles in the distance. Figure 3i is the detection result of the original YOLOv5; the vehicle in the distance of the identification box was not recognized, but it was correctly detected in Figure 3j using our model. Figure 4 shows the pedestrian detection results using the original YOLOv5 model and our model. The first row compares the detection results under normal conditions (good image quality), the second row provides the detection comparison of pedestrians in shadows (low image contrast), and the third row illustrates the detection effect of distant pedestrians (small targets). In Figure 4b, one pedestrian was detected as two pedestrians; in Figure 4e, the vehicle and background in the yellow box were incorrectly detected as pedestrians; in Figure 4i, only one of the two pedestrians in the yellow box was detected. All the false and missed detections in Figure 4b,e,i using the original YOLOv5 model were corrected with our model, which can be seen from Figure 4c,f,j.

Accurately detecting traffic lights at a distance can help guidance robots plan their travel routes reasonably, thereby saving travel time for blind people. The main problem in long-distance traffic light detection is the missed and false detection caused by the small size of traffic lights in the image. Therefore, this paper takes the detection of traffic lights as one of the factors to evaluate the performance of the proposed model. Figure 5 compares the long-distance traffic light detection performance between the original YOLOv5 model and our model. In Figure 5a, there are two small green lights in the yellow box, only one of which was recognized with the YOLO5 model (as shown in Figure 5b). In Figure 5c, both green lights were detected using the proposed model. In the second row, the original YOLO5 model mistakenly detected the yellow signal light in Figure 5d as red (Figure 5e), while the model this paper proposes did the right thing, as shown in Figure 5f. In the third row, the improved model successfully identified the red traffic light at the next intersection that appears in the identification box (Figure 5j), while the YOLO5 model did not (Figure 5i).

From the above analysis, it can be seen that the improved YOLOv5 model has the performance of detecting targets more accurately in complex environments and can effectively detect targets.

3.3.3. Ablation Experiments

Ablation experiments were carried out to assess the contributions of the three modules to the performance of the improved model. The CA, BIFPN, and SIoU were removed from the proposed model separately. The experiments are designed as follows:

Experiment 1: Remove SIoU, keep CA and BiFPN;

Experiment 2: Remove BiFPN, keep CA and SIoU;

Experiment 3: Remove CA, keep BiFPN and SIoU;

Experiment 4: Keep all three modules.

The mAPs and AP50s of the four experiments are given in Table 4. It can be easily seen from the table that each innovative module in this paper contributes to the whole performance and decreases the mean average precision when removed. The CA module has the greatest impact on the proposed model. This is because coordinate attention can improve the model’s ability to filter important features, increase the proportion of effective features, and improve the model’s feature expression ability, thereby improving the model’s detection performance. The ablation experiment results show that the combination of the CA, weighted BiFPN, and SIoU loss function can effectively improve the performance of the model for object recognition and improve the model’s performance as a whole.

4. Conclusions

This paper proposed an improved deep learning model based on YOLOv5 for static and dynamic obstacles and traffic light detection. Considering the characteristics of obstacles (targets) that blind people may encounter in real-time when traveling, three different modules—CA, BiFPN, and SIoU loss function—were introduced into the model to improve its detection ability. Detection precision tests, tests of detection performance under special conditions, and ablation experiments were conducted. The results of detection precision tests showed that the AP50s of pedestrian, vehicle, and traffic light detection of our method increased by 11.9%, 4.6%, and 8.17%, respectively, compared with the original YOLOv5 model, and the mAP increased by 8.23%, demonstrating the effectiveness of the improved model. Tests of detection performance under special conditions showed that the proposed model has the performance in detecting targets more accurately in complex environments and can effectively detect small targets, for example, long-distance pedestrians and signal lights. Ablation experiment results showed that each module contributed to the whole performance, and the mean average precision decreased when a module was removed.

Author Contributions

Conceptualization, Z.L. and W.Z.; methodology, W.Z.; formal analysis, Z.L., X.Y. and W.Z.; investigation, W.Z.; data curation, W.Z. and Z.L.; writing, Z.L. and W.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

All data generated from this study are included in this published article. Raw data are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Mori, H.; Kotani, S.; Saneyoshi, K.; Sanada, H.; Kobayashi, Y.; Mototsune, A.; Nakata, T. The matching fund project for practical use of robotic travel aid for the visually impaired. Adv. Robot. 2004, 18, 453–472. [Google Scholar] [CrossRef]
Kumar, K.M.A.; Krishnan, A. Remote controlled human navigational assistance for the blind using intelligent computing. In Proceedings of the Mediterranean Symposium on Smart City Application, New York, NY, USA, 25–27 October 2017; pp. 1–4. [Google Scholar]
Nanavati, A.; Tan, X.Z.; Steinfeld, A. Coupled indoor navigation for people who are blind. In Proceedings of the Companion of the 2018 ACM/IEEE International Conference on Human-Robot Interaction, Chicago, IL, USA, 5–8 March 2018; pp. 201–202. [Google Scholar]
Keroglou, C.; Kansizoglou, I.; Michailidis, P.; Oikonomou, K.M.; Papapetros, I.T.; Dragkola, P.; Michailidis, I.T.; Gasteratos, A.; Kosmatopoulos, E.B.; Sirakoulis, G.C. A Survey on Technical Challenges of Assistive Robotics for Elder People in Domestic Environments: The ASPiDA Concept. IEEE Trans. Med. Robot. Bionics 2023. [Google Scholar] [CrossRef]
Elachhab, A.; Mikou, M. Obstacle Detection Algorithm by Stereoscopic Image Processing. In Proceedings of the Ninth International Conference on Soft Computing and Pattern Recognition (SoCPaR 2017), Marrakech, Morocco, 11–13 December 2018; pp. 13–23. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 11–18 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Qiu, X.; Huang, Y.; Guo, Z.; Hu, X. Obstacle detection and depth estimation using deep learning approaches. J. Univ. Shanghai Sci. Technol. 2020, 42, 558–565. [Google Scholar]
Duan, Z.; Wang, J.; Ding, Q.; Wen, Q. Research on Obstacle Detection Algorithm of Blind Path Based on Deep Learning. Comput. Meas. Control 2021, 29, 27–32. [Google Scholar]
Wang, W.; Jin, J.; Chen, J. Rapid Detection Algorithm for Small Objects Based on Receptive Field Block. Laser Optoelectron. Progress 2020, 57, 021501. [Google Scholar] [CrossRef]
Jiang, Y.; Tong, G.; Yin, H.; Xiong, N. A pedestrian detection method based on genetic algorithm for optimize XGBoost training parameters. IEEE Access 2019, 7, 118310–118321. [Google Scholar] [CrossRef]
Huang, X.; Ge, Z.; Jie, Z.; Yoshie, O. Nms by representative region: Towards crowded pedestrian detection by proposal pairing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10750–10759. [Google Scholar]
Ouyang, Z.; Niu, J.; Liu, Y.; Guizani, M. Deep CNN-based real-time traffic light detector for self-driving vehicles. IEEE Trans. Mob. Comput. 2019, 19, 300–313. [Google Scholar] [CrossRef]
Balaska, V.; Bampis, L.; Kansizoglou, I.; Gasteratos, A. Enhancing satellite semantic maps with ground-level imagery. Robot. Auton. Syst. 2021, 139, 103760. [Google Scholar] [CrossRef]
Rangari, A.P.; Chouthmol, A.R.; Kadadas, C.; Pal, P.; Singh, S.K. Deep Learning based smart traffic light system using Image Processing with YOLO v7. In Proceedings of the 2022 4th International Conference on Circuits, Control, Communication and Computing (I4C), Bangalore, India, 21–23 December 2022; pp. 129–132. [Google Scholar]
Zhou, Y.; Wen, S.; Wang, D.; Mu, J.; Richard, I. Object detection in autonomous driving scenarios based on an improved faster-RCNN. Appl. Sci. 2021, 11, 11630. [Google Scholar] [CrossRef]
Hnewa, M.; Radha, H. Multiscale domain adaptive yolo for cross-domain object detection. In Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, 19–22 September 2021; pp. 3323–3327. [Google Scholar]
Wu, S.; Yan, Y.; Wang, W. CF-YOLOX: An Autonomous Driving Detection Model for Multi-Scale Object Detection. Sensors 2023, 23, 3794. [Google Scholar] [CrossRef] [PubMed]
Castelló, V.O.; Igual, I.S.; del Tejo Catalá, O.; Perez-Cortes, J.C. High-Profile VRU Detection on Resource-Constrained Hardware Using YOLOv3/v4 on BDD100K. J. Imaging 2020, 6, 142. [Google Scholar] [CrossRef]
Yu, F.; Xian, W.; Chen, Y.; Liu, F.; Liao, M.; Madhavan, V.; Darrell, T. Bdd100k: A diverse driving video database with scalable annotation tooling. arXiv 2018, arXiv:1805.04687. [Google Scholar]
Wang, C.Y.; Liao, H.Y.M.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 390–391. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Neubeck, A.; Van Gool, L. Efficient non-maximum suppression. In Proceedings of the 18th International Conference on Pattern Recognition (ICPR’06), Hong Kong, China, 20–24 August 2006; Volume 3, pp. 850–855. [Google Scholar]
Haut, J.M.; Paoletti, M.E.; Plaza, J.; Plaza, A.; Li, J. Visual attention-driven hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2019, 57, 8065–8080. [Google Scholar] [CrossRef]
Santavas, N.; Kansizoglou, I.; Bampis, L.; Karakasis, E.; Gasteratos, A. Attention! A lightweight 2d hand pose estimation approach. IEEE Sens. J. 2020, 21, 11488–11496. [Google Scholar] [CrossRef]
Sun, J.; Jiang, J.; Liu, Y. An introductory survey on attention mechanisms in computer vision problems. In Proceedings of the 2020 6th International Conference on Big Data and Information Analytics (BigDIA), Shenzhen, China, 4–6 December 2020; pp. 295–300. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Xie, C.; Zhu, H.; Fei, Y. Deep coordinate attention network for single image super-resolution. IET Image Process. 2022, 16, 273–284. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligenc, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar]
Gevorgyan, Z. SIoU Loss: More Powerful Learning for Bounding Box Regression. arXiv 2022, arXiv:2205.12740. [Google Scholar]

Figure 1. The improved YOLOv5 network structure.

Figure 2. YOLOv5 with improved feature fusion module.

Figure 3. Comparison of vehicle detection results between the original YOLOv5 model and our model. (a,d,h) are the images to be detected; (b,e,i) are the detection effect of YOLOv5; (c,f,j) are the detection effect of the improved model.

Figure 4. Comparison of pedestrian detection results between the original YOLOv5 model and our model. (a,d,h) are the images to be detected; (b,e,i) are the detection effect of YOLOv5; (c,f,j) are the detection effect of the improved model.

Figure 5. Comparison of long-distance traffic light detection results between the original YOLOv5 model and our model. (a,d,h) are the images to be detected; (b,e,i) are the detection effect of YOLOv5; (c,f,j) are the detection effect of the improved model.

Table 1. The Precision and AP50 of the three different detection categories.

Detection Category	Precision (%)	AP50 (%)
pedestrians	75.10	68.10
vehicles	84.60	80.60
traffic lights	69.30	55.20

Table 2. Comparison of computational complexity between different attention mechanisms. (N = 2, Epoch = 50).

Attention Mechanism	Parameters	GFLOPs	mAP (%)
SE	$7.12 \times 10^{6}$	16.0	62.95
CBAM	$7.13 \times 10^{6}$	16.1	62.70
CA	$7.10 \times 10^{6}$	16.0	63.05

Table 3. Comparison of detection results of nine different models.

Model	mAP (%)	AP50 (%)			Inference Time/ms
Model	mAP (%)	People	Car	Traffic Lights	Inference Time/ms
AD-Faster-RCNN	56.15	62.00	50.30	-	-
SSD	45.39	44.33	46.46	-	-
MS-DAYOLO	55.70	45.37	73.74	48.00
YOLOv4-416	60.86	50.32	71.40	-	-
YOLOv4	60.45	51.70	69.20	-	-
YOLOv5	66.10	56.20	76.00	47.03	3.4
YOLOv6	53.00	47.00	71.20	40.80	3.15
YOLOv7-tiny	56.03	51.70	72.60	43.80	2.6
Ours	67.97	68.10	80.60	55.20	4.7

Table 4. Ablation experiment results.

Model	CA	BiFPN	SIoU	mAP (%)	AP50 (%)
Model	CA	BiFPN	SIoU	mAP (%)	Pedestrians	Vehicles	Traffic Lights
Experiment 1	√	√	×	65.50	64.60	80.30	51.60
Experiment 2	√	×	√	61.38	58.20	77.40	48.53
Experiment 3	×	√	√	60.97	53.70	76.90	48.20
Experiment 4	√	√	√	67.96	68.10	80.60	55.17

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Z.; Zhang, W.; Yang, X. An Enhanced Deep Learning Model for Obstacle and Traffic Light Detection Based on YOLOv5. Electronics 2023, 12, 2228. https://doi.org/10.3390/electronics12102228

AMA Style

Li Z, Zhang W, Yang X. An Enhanced Deep Learning Model for Obstacle and Traffic Light Detection Based on YOLOv5. Electronics. 2023; 12(10):2228. https://doi.org/10.3390/electronics12102228

Chicago/Turabian Style

Li, Zhenwei, Wei Zhang, and Xiaoli Yang. 2023. "An Enhanced Deep Learning Model for Obstacle and Traffic Light Detection Based on YOLOv5" Electronics 12, no. 10: 2228. https://doi.org/10.3390/electronics12102228

APA Style

Li, Z., Zhang, W., & Yang, X. (2023). An Enhanced Deep Learning Model for Obstacle and Traffic Light Detection Based on YOLOv5. Electronics, 12(10), 2228. https://doi.org/10.3390/electronics12102228

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Enhanced Deep Learning Model for Obstacle and Traffic Light Detection Based on YOLOv5

Abstract

1. Introduction

2. Methodology

2.1. Coordinate Attention Module

2.2. Bidirectional Feature Pyramid Network Model

2.3. Loss Function

3. Experiment Results and Discussion

3.1. BDD100K Dataset

3.2. Evaluation Metrics

3.3. Results and Discussion

3.3.1. Detection Results

3.3.2. Detection Performance under Special Circumstances

3.3.3. Ablation Experiments

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI