Traffic-Sign-Detection Algorithm Based on SK-EVC-YOLO

Zhou, Faguo; Zu, Huichang; Li, Yang; Song, Yanan; Liao, Junbin; Zheng, Changshuo

doi:10.3390/math11183873

Open AccessArticle

Traffic-Sign-Detection Algorithm Based on SK-EVC-YOLO

by

Faguo Zhou

^*,

Huichang Zu

^*,

Yang Li

,

Yanan Song

,

Junbin Liao

and

Changshuo Zheng

School of Artificial Intelligence, China University of Mining and Technology-Beijing, Beijing 100083, China

^*

Authors to whom correspondence should be addressed.

Mathematics 2023, 11(18), 3873; https://doi.org/10.3390/math11183873

Submission received: 16 August 2023 / Revised: 8 September 2023 / Accepted: 8 September 2023 / Published: 11 September 2023

(This article belongs to the Special Issue Artificial Intelligence and Scientific Computing: Mathematical Techniques and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Traffic sign detection is an important research direction in the process of intelligent transportation in the Internet era, and plays a crucial role in ensuring traffic safety. The purpose of this research is to propose a traffic-sign-detection algorithm based on the selective kernel attention (SK attention), explicit visual center (EVC), and YOLOv5 model to address the problems of small targets, incomplete detection, and insufficient detection accuracy in natural and complex road situations. First, the feature map with a smaller receptive field in the backbone network is fused with other scale feature maps to increase the small target detection layer. Then, the SK attention mechanism is introduced to extract and weigh features at different scales and levels, enhancing the attention to the target. By fusing the explicit visual center to gather local area features within the layer, the detection effect of small targets is improved. According to the experiment results, the mean average precision (mAP) on the Tsinghua-Tencent Traffic Sign Dataset (TT100K) for the proposed algorithm is 88.5%, which is 4.6% higher than the original model, demonstrating the practicality of the detection of small traffic signs.

Keywords:

traffic sign; YOLOv5; small-target detection; attention mechanism; explicit visual center

MSC:

68T45

1. Introduction

With rapid socio-economic development, the fast-growing automobile industry has experienced a notable rise in small cars. Thanks to the development of the whole industry chain of China’s new energy vehicles, electricity has become the main driving energy. Together with the widespread use of car chips, private vehicles have become more advanced [1]. In this context, the intelligent transportation system (ITS) is created. ITS enables effective interaction among drivers, road conditions, and vehicles [2]. Traffic signs are graphics, texts and symbols utilized on roads to indicate traffic rules and provide information whilst warning of dangers. They are critical components in maintaining traffic order and ensuring road safety. Traffic signs can be classified into warning signs, regulatory signs, and prohibition signs, among others. Traffic sign detection [3] is a very critical link in the ITS. The technology that detects traffic signs in the assisted driving system offers drivers easier driving experiences. Traffic signs are detected and the detection results are provided to the driver or autonomous driving system in order to adjust the vehicle’s speed and direction.Drivers can predict and adjust to different situations when unfamiliar with the road conditions, enhancing driving comfort and reducing mental stress. In addition, this technology can help traffic management departments to monitor and control traffic flow, thus reducing traffic jams, congestion and traffic accidents in an effective way [4].

The detection method relying on the traditional method is based on the regular shape of traffic signs with bright colors, and easy-to-recognize characteristics of the detection. This method first extracts the features of color, shape, or both, and then distinguishes them to determine the location and category of traffic signs. However, these features are susceptible to interference from the external environment, such as color fading and shape deformation. It is impossible to ensure detection in real time, the accuracy of the results, and the robustness of the detection ability [5]. With the development of deep learning, the detection methods based on it have gradually become the main research direction, and the detection effect is fast and efficient. Deep learning-based detection can be divided into one-stage detection and two-stage detection. One-stage algorithms, such as SSD [6] algorithm and YOLO [7] series, are end-to-end regression algorithms; two-stage algorithms first select candidate regions for the target and then fine-tune the localization and discrimination of the candidate regions. Regarding the common algorithms, such as RCNN [8], Fast RCNN [9] and Faster RCNN [10], the RPN network in Faster RCNN is more advantageous in generating candidate regions, which greatly improves the detection speed and the quality of candidate frames, and has a great impact on the development of subsequent algorithms. While two-stage traffic-sign-detection algorithms exhibit strong performance in certain aspects, they also come with several drawbacks: high complexity, as two-stage detection algorithms typically involve multiple stages, resulting in a complex workflow that requires handling multiple steps and parameter tuning; slower speed, making them less suitable for real-time applications with stringent timing requirements; and dependency on candidate boxes, as two-stage algorithms often generate a large number of candidate boxes before classifying and localizing objects within them, which can lead to the identification of background areas or irrelevant objects as traffic signs, thus increasing false detection rates. Therefore, one-stage object-detection algorithms are primarily used for traffic sign detection now.

Gao et al. [11] used the SSD algorithm to detect traffic signs under natural road conditions to meet the requirements of real-time detection, but this algorithm is not accurate enough to detect small targets. Lin et al. [12] used the YOLOv3 network with an optimized feature pyramid and loss function for detection and improved the detection accuracy. Wang et al. [13] improved the YOLOv5 network and proposed an optimized feature pyramid model, which improved the [email protected] indicator to 0.6514 on the TT100k dataset. In order to better complete feature extraction and feature fusion, Jiang et al. [14] proposed an advanced YOLO v5 method using the global context module and balanced pyramid structure, and the evaluation criterion [email protected] increased by 1.9%. Based on the Faster RCNN model, Yuan et al. [15] improved the area generation network, processed the low-level feature map, and used maximum pooling to fuse high-level and low-level features to improve the detection ability of small targets. Li et al. [16] proposed a traffic-sign-recognition method based on lightweight YOLOv5. They utilized the k-means clustering algorithm for anchor box calculation and integrated the Stem module and ShufflenetV2 module with YOLOv5, achieving a detection accuracy of 95.9%. However, the model’s ability to detect small objects has not been comprehensively taken into account. In view of the above problems, this paper proposes the following improvements based on the YOLOv5 algorithm:

(a) Extract the feature layer of small targets in the backbone network to improve the detection ability of small targets.

(b) The SK attention mechanism is integrated into the original model to improve the network’s attention to the traffic sign target.

(c) In the backbone network section, this paper uses the CFPnet Explicit Visual Center (EVC) to strengthen the connection between features.

The detection accuracy of this model is effectively improved. The method solves the problem of missing small traffic sign targets under natural road conditions to some extent.

2. Related Work

The YOLOv5 network architecture is based on YOLOv4 and consists of four parts: inputs, backbone network, neck fusion and detection head. The structure is shown in Figure 1.

The attention mechanism plays an important role in improving the performance and effectiveness of traffic sign detection: in terms of region attention, the attention mechanism can enable models to focus on areas related to the target in the image, and the model can more accurately capture the position information of the target. It can also adjust the weight of the attention based on the size and position of the target in order to better fuse multi-scale information. Attention mechanisms are mainly divided into spatial attention and channel attention. Spatial attention can extract the relationship between pixel-level features, such as ECA Net [17]. Channel attention can capture the relationship between different channels, such as SE Net [18]. Hu et al. [19] proposed a sub-pixel convolutional attention module (SCAM) based on an attention mechanism based on the yolov5 model to reduce the impact of scale distribution. Wei et al. [20], based on the YOLOv5 algorithm, introduced recursive gated convolution and the SOCA attention mechanism; the improved YOLOv5 improved mAP by 43.7 percentage points on the TT100K dataset. Lang et al. [21] combined YOLOv5 with a coordinated attention (CA) mechanism and introduced a bidirectional feature pyramid network (BIFPN) to improve the detection accuracy of traffic signs.

Small objects typically have fewer pixels in an image, making them more prone to being overlooked or falsely detected. By introducing a dedicated small object detection layer, the model’s performance in handling these targets can be improved. This can involve using a smaller receptive field to focus the model on local details as well as introducing high-resolution feature maps in the network. This aids in capturing subtle features of the targets. Moreover, multi-scale feature fusion enables the model to effectively detect small objects at different scales [22]. Wang et al. [23] added a shallow detection layer as the detection layer for smaller targets in the YOLOv5 model, replacing the original algorithm’s three scales with four scales for detection.

3. Improved YOLOv5 Model

Compared to other models, the YOLOv5 model has a shallower model depth and guarantees a certain accuracy, so this paper chooses this algorithm for optimization. The improved model structure is shown in Figure 2.

3.1. Small Target Detection Layer

The YOLOv5 model can express image semantics and has a wide range of receptive fields. However, the feature information of small objects is severely lost in this model, resulting in poor small object detection performance. The traffic signs captured by the vehicle-mounted camera are generally small targets under a wide-angle lens. Aiming to preserve the feature information of the small target, this paper chooses to retain the multi-scale receptive field information [24]. The added small target detection layer is shown in Figure 3.

The backbone network of the original YOLOv5 model performs convolution operations on the 640 × 640 input image to extract features. The model then downsamples the feature map by 8, 16, and 32 times, and finally obtains feature map sizes of 80 × 80, 40 × 40 and 20 × 20. YOLOv5 initially sets nine anchor boxes of different sizes to detect the target. The smallest detection frame sizes are (10, 13) and (16, 30) and (33, 23), which are used to detect small objects. The detection frame sizes with a larger detection range are (30, 61) and (62, 45) and (59, 119), which are used to detect medium-sized targets. Box sizes (116, 90), (159, 198) and (373, 326) are used to detect the largest objects. However, in real situations, the position and occlusion of traffic signs in the image are different. There may be smaller targets, and the anchor box size of (10, 13) is not sufficient to meet the needs of small object detection.

In order to retain more small target features, this paper adds four-times downsampled feature maps and detection heads. The feature map size is 160 × 160, and the model uses three sizes—(5, 6), (8, 14), and (15, 11)—of the detection anchor boxes to improve the multi-scale detection effect. As this article adds a feature layer, it also needs to be changed in the neck feature fusion stage of YOLOv5. In the model, the feature pyramid network (FPN) transmits high-level semantic information, the path aggregation network (PAN) transmits low-level positional information to deep layers, and small target feature maps are obtained from the lower spatial features. After four downsampling operations, the output feature maps have small receptive fields and rich target information. Perform multi-scale fusion with the three-layer feature map fused from top to bottom, and then fuse with the multi-scale feature map again from bottom to top. Afterwards, the detection head is used for target detection.

3.2. SK Attention Mechanism

SKNet [25] (selective kernel network) is an image classification model based on a deep residual network structure. It uses the SK attention mechanism to strengthen the feature expression and generalization ability of the model. It is used to adaptively extract features of different scales and levels and select the most important features. The core idea of SKNet is to solve the problem of information overlap and redundancy in the model by introducing an adaptive attention mechanism to further improve the performance of the model.

In the SK attention mechanism, there are three key operations on the input feature maps, split, fuse and select, as shown in Figure 4.

The split operation is to perform grouping/depth convolution on the input feature map with 3 × 3 and 5 × 5 size convolution kernels, as well as batch normalization and ReLU activation.

The fuse operation uses gates to fuse the functions of several branches. It obtains U by summing U1 and U2.

F_{g p}

performs global average pooling.

F_{f c}

is a two-layer fully connected layer, and the process is to reduce the dimension and then increase the dimension. The output is a and b, where matrix b is a redundant matrix. The c-th element of s is computed by shrinking U of spatial dimension H × W:

s_{c} = F_{g p} (U_{c}) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} U_{c} (i, j)

(1)

Z is a compact feature vector that compresses the dimension for efficiency. It is implemented by the

F_{f c}

fully connected layer:

z = F_{f c} (s) = δ (B (W s))

(2)

δ

is the ReLU activation function. B represents batch normalization.

The select operation adds these features to obtain the vector V. It should be noted that a and b have to be weighted in advance. In this paper, SK attention is fused between the most informative feature map of the original model and the detection head. The feature maps processed by the attention module are fused between the multi-scale feature network and the detection head, preserving target weights for the most informative feature maps and improving the detection accuracy.

3.3. Explicit Visual Center

CFPNet [26] (centralized feature pyramid) is a centralized feature pyramid network that provides an explicit visual center to focus on a few key points and capture long-range dependencies.

The explicit visual center consists of two parallel blocks, the multilayer perceptron (MLP) and the learnable visual center (LVC). The lightweight MLP is used to capture the long dependencies of top-level features. The LVC focuses on local corner features. The Steam block is used to smooth the feature map, which consists of a 7 × 7 convolutional block. Batch normalization and activation function operations are then required. The structure diagram is shown in Figure 5.

The MLP consists of a deep convolution module followed by a channel MLP for residual connections. The input features are firstly subjected to deep convolution operations, which can improve the expression of the features. It completes the scaling and regularization of the channel and then performs a residual connection with the input. The process can be expressed as

{\tilde{X}}_{in} = DConv (GN (X_{in})) + X_{n}

(3)

where

{\tilde{X}}_{in}

is the output result. DConv is a depth convolution with a convolution kernel of 1 × 1.

The following operations are then performed on the output results:

M L P (X_{in}) = CM L P (GN ({\tilde{X}}_{in})) + {\tilde{X}}_{in}

(4)

The output of the depth convolution is first normalized and then fed into the channel MLP, which is the CMLP in Formula (4). It performs channel scaling and residual links.

The LVC module first performs a convolution operation and CBR processing on the output of the steam block. It then uses the scale factor s in the codebook part to map the position information to

{\tilde{x}}_{i}

and

b_{k}

. The formula is as follows:

e_{k} = \sum_{i = 1}^{N} \frac{e^{- s_{k} {∥ {\tilde{x}}_{i} - b_{k} ∥}^{2}}}{\sum_{j = 1}^{K} e^{- s_{k} {∥ {\tilde{x}}_{i} - b_{k} ∥}^{2}}} ({\tilde{x}}_{i} - b_{k})

(5)

{\tilde{x}}_{i}

is the i-th pixel.

b_{k}

represents the k-th visual code. The relative position information is obtained by subtracting

{\tilde{x}}_{i}

and

b_{k}

. S is the scaling factor. It then merges all

e_{k}

and use

ϕ

to implement the ReLU and BN calculations. The formula is shown in Formula (6):

e = \sum_{k = 1}^{K} ϕ (e_{k})

(6)

The model enters the full connection and convolution operation, and multiplies

X_{i n}

on the feature channel. It can be expressed as

Z = X_{in} \otimes (δ ({Conv}_{1 \times 1} (e)))

(7)

Finally, add the output

X_{in}

and the local feature Z to the channel, which can be expressed as follows:

LVC (X_{in}) = X_{in} \oplus Z

(8)

The YOLOv5 model concatenates the MLP and LVC parts and obtains the feature map after the above steps. This paper integrates the EVC module into the backbone network, and processes the preliminary extracted features of the first part of the network through the LVC submodule and MLP to improve the feature extraction ability. The output of more detailed feature maps enters the remaining part of the backbone network. In summary, based on the original model, this paper adds a small target detection layer to fuse multi-scale receptive field feature maps and integrates explicit visual centers in the backbone network to improve feature extraction. Finally, the SK attention module is introduced in the detection stage for detection.

4. Experiment and Analysis

4.1. Dataset

This paper conducts experiments on the TT100K [27] dataset jointly collected by Tsinghua University and Tencent, which is from the Tencent Street View panorama and is targeted at traffic sign detection in natural road conditions. This dataset is more in line with the natural streets and road conditions recorded by the car recorder. It contains entities such as vehicles, pedestrians, and street trees, which have a certain degree of influence on traffic sign detection. The dataset is labeled with 128 categories with 2048 × 2048 pixels, but the number of categories is not balanced. To avoid overfitting during the training process due to the imbalance of categories, this paper screens the images in the dataset with several categories of more than 100 for learning, which contains a total of 9170 data images. This paper also uses a script to convert the label information of the dataset from JSON format to Txt format and divides it according to the ratio of 8 to 2. There are 7201 images in the training set and 1969 images in the testing set.

This paper divides the dataset into an 80% training set and a 20% testing set, striking a balance between training and evaluation, which ensures that the model has ample samples for both training and testing data. From experiment and experience, when the training set accounts for 70%, the model’s generalization ability may be insufficient, while when the training set reaches 80%, the model can fully learn the features of traffic signs. At the same time, 20% of the data is reserved for the test set, which can determine if the model is overfitting and confirm that it is not overretaining the information in training.

The sample dataset is shown in Figure 6.

4.2. Experiment Environment and Parameters

The hardware and software environment parameters are shown in Table 1. The hyperparameters are shown in Table 2.

4.3. Evaluation Indicators

The model takes precision, recall and mAP as the main evaluation indexes. In this paper, mAP is used as a reference, and the higher the value, the better the ability of the model detection effect. The formula is as follows:

Precision = \frac{T P}{T P + F P}

(9)

Recall = \frac{T P}{T P + F N}

(10)

AP = \int_{0}^{1} Precision (t) dt

(11)

mAP = \frac{\sum_{n = 1}^{N} A P_{n}}{N}

(12)

4.4. Comparison with Other Algorithms

To test the performance of the improved algorithm, the SSD algorithm, Faster RCNN algorithm, YOLOv3 algorithm, YOLOv4 algorithm, EfficientNet [28], Swin Transformer [29] and YOLOX [30] are selected for training and learning on the same dataset, and compared with the algorithm in this paper. As shown in Table 3, the mAP of the improved algorithm in this paper exceeds that of other commonly used target detection algorithms, with a result of 88.5%. Compared with the original model of YOLOv5, the improvement is 4.6 percentage points, which shows that the optimization in this paper is effective in improving the detection accuracy.

4.5. Comparison of Different Attention Mechanisms

This paper tries to combine different attention mechanisms for comparison, such as common CBAM [31], CA [32], ECA [17], NAM [33], SIMAM [34] and SK. From Table 3, it can be seen that the original model combined with SK attention has a better effect on the improvement of the model detection accuracy. The SK attention can dynamically select the convolution kernel. What is more, the SK attention adjusts the receptive field by multi-scale features, and the highest recall indicates that the target is more fully detected, and the results are shown in Table 4.

4.6. Ablation Experiments

In order to confirm the effectiveness of the improved model, this paper conducts seven sets of ablation experiments with consistent environments and parameters. Model 1 to model 8 are YOLOv5, YOLOv5 + small-target detection, YOLOv5 + SK, YOLOv5 + EVC, YOLOv5 + small-target+SK, YOLOv5 + small-target+EVC, YOLOv5 + EVC + SK and YOLOv5 + small target + SK + EVC, respectively. The results of the specific ablation experiments are shown in Table 5.

As can be seen from Table 5, the original model can accomplish the task of detecting traffic signs, but it is not effective in detecting small targets. After adding the small target detection layer and the EVC module, the detection accuracy is improved obviously. Then, this paper adds the SK attention mechanism to further enhance the model’s attention to small targets and improve the model’s target-checking effect.

4.7. Visualization of Test Results

Figure 7 demonstrates an example of the detection effect of the improved model, and it can be seen that the model can also have a good detection effect on smaller targets. The non English characters in the figure do not affect reading, and the enlarged result image is displayed at the bottom left of the Figure 7. As indicated in Figure 8, the left side shows the detection effect of the improved model and the right side shows the detection effect of the original model. As can be seen from Figure 8, there is a leakage of small targets in the right-hand diagram, while the improved model on the left-hand side can detect the small targets. Therefore, the model proposed in this paper is effective.

5. Conclusions

The SK-EVC-YOLO model is a solution to the problem of insufficient detection accuracy of small targets in the initial YOLOv5 model in traffic sign detection under natural road conditions. First, SK-EVC-YOLO adds a feature layer with a smaller receptive field to fuse with other feature layers for multi-scale detection. Then, SK-EVC-YOLO uses the SK attention mechanism to perform different convolution kernels on different inputs to process the output. Finally, the model integrates the display visual center of CFPNet to capture the long-distance dependence of features and improve the detection effect of small targets. The experimental results show that the average precision mAP and the recall rate are 88.5% and 84.3%, respectively, which are 4.6 and 5.5 percentage points higher than that of the original model. Consequently, the detection effect on small objects is improved.

The present investigation has some limitations, primarily concerning sample bias and data quality. The dataset includes only common traffic sign categories, which could impede precise detection in some rare sign scenarios. Concerning data quality, the majority of targets are concentrated in the middle region of the image, with fewer at the edges. Additionally, the chosen model may have an excessive number of parameters. These limitations necessitate further research and refinement.

Future studies in the field of transportation could center on various applications and alternative statistical analysis techniques. Cluster analysis may be employed to assess the correlation between tire texture, wear amount, and road noise [35]. Additionally, it is feasible to investigate the utilization of machine learning algorithms, such as the double deep Q-network, in mitigating distributed denial of service (DDoS) attacks in the realm of Internet-connected vehicles, even in the absence of ample labeled data [36]. Presently, the implementation of deep learning approaches to tackle the problem of driver identification and verification is enjoying widespread popularity [37]. The transportation sector is an extensive and intricate system, with various areas suitable for examination.

Author Contributions

Conceptualization, F.Z. and H.Z.; methodology, H.Z.; software, Y.L.; validation, J.L.; data curation, C.Z.; writing—original draft preparation, H.Z. and Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation of China (No.62072008).

Data Availability Statement

Data sharing not applicable to this article, as no datasets were generated during the current study.

Conflicts of Interest

The authors declare no conflict of interest.

References

Zhang, X.; Zhang, Z. Research on a traffic sign recognition method under small sample conditions. Sensors 2023, 23, 5091. [Google Scholar] [CrossRef] [PubMed]
Garg, T.; Kaur, G. A systematic review on intelligent transport systems. J. Comput. Cogn. Eng. 2023, 2, 175–188. [Google Scholar] [CrossRef]
Li, W.; Li, X.; Qin, Y.; Song, W.; Cui, W. Application of Improved LeNet-5 Network in Traffic Sign Recognition. In Proceedings of the ICVIP 2019: 2019 the 3rd International Conference on Video and Image Processing, Shanghai, China, 20–23 December 2019. [Google Scholar]
Li, D.; Su, Z.; Liu, Y. Road traffic sign recognition based on improved YOLOv4. Opt. Precis. Eng. 2023, 31, 1366–1378. [Google Scholar] [CrossRef]
Huang, K. Traditional methods and machine learning-based methods for traffic sign detection. In Proceedings of the Third International Conference on Intelligent Computing and Human-Computer Interaction, Guangzhou, China, 12–14 August 2022. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Proceedings, Part I 14, Amsterdam, The Netherlands, 11–14 October 2016; Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the IEEE Computer Society, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 13–16 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; p. 28. [Google Scholar]
Gao, B.; Jiang, Z.; Zhang, J. Traffic Sign Detection based on SSD. In Proceedings of the 2019 4th International Conference, Guilin, China, 19–21 July 2019; pp. 1–6. [Google Scholar]
Lin, Y.; Chen, L.; Wang, G.; Sheng, Y.; Sun, L. Improved YOLOv3 Traffic Sign Recognition Algorithm. Sci. Technol. Eng. 2022, 22, 12030–12037. [Google Scholar]
Wang, J.; Chen, Y.; Dong, Z.; Gao, M. Improved YOLOv5 network for real-time multi-scale traffic sign detection. Neural Comput. Appl. 2022, 35, 7853–7865. [Google Scholar] [CrossRef]
Jiang, L.; Liu, H.; Zhu, H.; Zhang, G. Improved YOLO v5 with balanced feature pyramid and attention module for traffic sign detection. MATEC Web Conf. Edp Sci. 2022, 355, 03023. [Google Scholar] [CrossRef]
Yuan, X.; Wang, G.; Wang, Y.; Wang, Y.; Sun, H. Traffic sign recognition method based on improved convolutional neural network. Electron. Sci. Technol. 2019, 32, 28–32. [Google Scholar]
Li, Z.; Zhang, N. A lightweight YOLOv5 traffic sign identification method. Telecommun. Technol. 2022, 62, 1201–1206. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Online, 14–19 June 2020; pp. 11534–11542. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze and extension networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Hu, J.; Wang, Z.; Chang, M.; Xie, L.; Xu, W.; Chen, N. PSG-Yolov5: A Paradigm for Traffic Sign Detection and Recognition Algorithm Based on Deep Learning. Symmetry 2022, 14, 2262. [Google Scholar] [CrossRef]
Wei, Q.; Hu, X.; Zhao, H. Improving the Traffic Sign Detection Method of YOLOv5. Comput. Eng. Appl. 2023, 59, 229–237. [Google Scholar]
Lang, B.; Lv, B.; Wu, J.; Wu, R. A Traffic Sign Detection Model Based on CA-BIFPN. J. Shenzhen Univ. (Sci. Eng. Ed.) 2023, 4014, 335–343. [Google Scholar]
Li, K.; Wang, X.; Lin, H.; Li, L.; Yang, Y.; Meng, C.; Gao, J. Review of Single-stage Small Target Detection Methods in Deep Learning. Comput. Sci. Explor. 2022, 16, 41. [Google Scholar]
Wang, P.; Huang, H.; Wang, M. Complex Road Object Detection Algorithm for Improved YOLOv5. J. Comput. Eng. Appl. 2022, 58, 81–92. [Google Scholar]
Mao, Z.; Ren, Y.; Chen, X.; Ren, K.; Zhao, Y. An improved multi-scale object detection algorithm for YOLOv5s. J. Sens. Technol. 2023, 36, 267–274. [Google Scholar]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 510–519. [Google Scholar]
Quan, Y.; Zhang, D.; Zhang, L.; Tang, J. Centralized Feature Pyramid for Object Detection. arXiv 2022, arXiv:2210.02093. [Google Scholar] [CrossRef] [PubMed]
Zhu, Z.; Liang, D.; Zhang, S.; Huang, X.; Li, B.; Hu, S. Traffic-sign detection and classification in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 2110–2118. [Google Scholar]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; pp. 6105–6114. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Online, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Liu, Y.; Shao, Z.; Teng, Y.; Hoffmann, N. NAM: Normalization-based attention module. arXiv 2021, arXiv:2111.12419. [Google Scholar]
Yang, L.; Zhang, R.Y.; Li, L.; Xie, X. Simam: A simple, parameter-free attention module for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; pp. 11863–11874. [Google Scholar]
Wang, H.; Zhang, X.; Jiang, S. A laboratory and field universal estimation method for tire–pavement interaction noise (TPIN) based on 3D image technology. Sustainability 2022, 14, 12066. [Google Scholar] [CrossRef]
Li, Z.; Kong, Y.; Jiang, C. A Transfer Double Deep Q Network Based DDoS Detection Method for Internet of Vehicles. IEEE Trans. Veh. Technol. 2023, 72, 5317–5331. [Google Scholar] [CrossRef]
Xu, J.; Pan, S.; Sun, P.Z.; Park, S.H.; Guo, K. Human-factors-in-driving-loop: Driver identification and verification via a deep learning approach using psychological behavioral data. IEEE Trans. Intell. Transp. Syst. 2022, 24, 3383–3394. [Google Scholar] [CrossRef]

Figure 1. YOLOv5 model structure diagram.

Figure 2. Improved model structure.

Figure 3. Feature fusion network.

Figure 4. SK attention mechanism.

Figure 5. Explicit visual center.

Figure 6. Dataset example.

Figure 7. The detection performance of the improved model.

Figure 8. Comparison of detection effects.

Table 1. Software and hardware parameters.

Name	Parameters
System	Windows10
CPU	Intel core i9-10980
GPU	Nvidia RTX 3090
Video memory	24 G
Memory	256 G
CUDA	11.7
CUDNN	8.5.0
Pytorch	1.13.0

Table 2. Hyperparameters.

Parameter Name	Parameter Value
Epochs	250
Batchsize	16
Learning rate	0.01
Input image size	640 × 640

Table 3. Comparison with other algorithms.

Model	mAP
SSD	0.537
Faster RCNN	0.743
YOLOv3	0.684
YOLOv4	0.762
YOLOv5	0.839
EfficientNet	0.663
Swin Transformer	0.705
YOLOX	0.804
Ours	0.885

Table 4. Integrating different attention mechanisms.

Model	Precision	Recall	mAP
YOLOv5 + CBAM	0.851	0.81	0.833
YOLOv5 + CA	0.853	0.82	0.844
YOLOv5 + ECA	0.849	0.801	0.834
YOLOv5 + NAM	0.867	0.808	0.847
YOLOv5 + SIMAM	0.851	0.82	0.845
YOLOv5 + SK	0.852	0.83	0.848

Table 5. Ablation experiments.

Model	Small Object Detection Layer	SK	EVC	Precision	Recall	mAP
1				0.876	0.788	0.839
2	✓			0.859	0.83	0.867
3		✓		0.852	0.83	0.848
4			✓	0.847	0.827	0.86
5	✓	✓		0.875	0.824	0.875
6	✓		✓	0.868	0.827	0.876
7		✓	✓	0.88	0.841	0.88
8	✓	✓	✓	0.872	0.843	0.885

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhou, F.; Zu, H.; Li, Y.; Song, Y.; Liao, J.; Zheng, C. Traffic-Sign-Detection Algorithm Based on SK-EVC-YOLO. Mathematics 2023, 11, 3873. https://doi.org/10.3390/math11183873

AMA Style

Zhou F, Zu H, Li Y, Song Y, Liao J, Zheng C. Traffic-Sign-Detection Algorithm Based on SK-EVC-YOLO. Mathematics. 2023; 11(18):3873. https://doi.org/10.3390/math11183873

Chicago/Turabian Style

Zhou, Faguo, Huichang Zu, Yang Li, Yanan Song, Junbin Liao, and Changshuo Zheng. 2023. "Traffic-Sign-Detection Algorithm Based on SK-EVC-YOLO" Mathematics 11, no. 18: 3873. https://doi.org/10.3390/math11183873

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Traffic-Sign-Detection Algorithm Based on SK-EVC-YOLO

Abstract

1. Introduction

2. Related Work

3. Improved YOLOv5 Model

3.1. Small Target Detection Layer

3.2. SK Attention Mechanism

3.3. Explicit Visual Center

4. Experiment and Analysis

4.1. Dataset

4.2. Experiment Environment and Parameters

4.3. Evaluation Indicators

4.4. Comparison with Other Algorithms

4.5. Comparison of Different Attention Mechanisms

4.6. Ablation Experiments

4.7. Visualization of Test Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI