M-YOLO: Traffic Sign Detection Algorithm Applicable to Complex Scenarios

Liu, Yuchen; Shi, Gang; Li, Yanxiang; Zhao, Ziyu

doi:10.3390/sym14050952

Open AccessArticle

M-YOLO: Traffic Sign Detection Algorithm Applicable to Complex Scenarios

College of Information Science and Engineering, Xinjiang University, Urumqi 830046, China

^*

Author to whom correspondence should be addressed.

Symmetry 2022, 14(5), 952; https://doi.org/10.3390/sym14050952

Submission received: 7 April 2022 / Revised: 27 April 2022 / Accepted: 1 May 2022 / Published: 7 May 2022

Download

Browse Figures

Versions Notes

Abstract

:

Traffic signs can be seen everywhere in daily life. Traffic signs are symmetrical, and traffic sign detection is easily affected by distortion, distance, light intensity and other factors, which also increases the potential safety hazards of assisted driving in practical application. In order to solve this problem, a symmetrical traffic sign detection algorithm M-YOLO for complex scenes is proposed. The algorithm optimizes the delay problem by reducing the computational overhead of the network, and speeds up the speed of feature extraction. While improving the detection efficiency, it ensures a certain degree of generalization and robustness, and enhances the detection performance of traffic signs in complex environments, such as scale and illumination changes. Experimental results on CCTSDB dataset containing traffic signs in complex scenes and HRRSD small target dataset show that M-YOLO algorithm has good detection performance. Compared with other algorithms, it has higher detection accuracy and detection speed. The test results in real complex scenes show that the detection effect of this algorithm is better than that of YOLOv5l algorithm, and M-YOLO algorithm can accurately detect the traffic signs that cannot be detected by YOLOv5l algorithm. Therefore, the algorithm proposed in this article can effectively improve the detection accuracy of traffic signs, is suitable for complex scenes, and has a good detection effect on small targets.

Keywords:

traffic sign detection; deeep learning; MobileNet; YOLO; CSPNet

1. Introduction

With the increase of per capita income, the number of cars has increased year by year, and cars have become an indispensable part of people’s life. As a means of transportation, automobile can greatly shorten the travel time and improve people’s travel efficiency. Due to the increasing population and more and more vehicles driving on the road, the traffic system is becoming more and more complex, and the incidence of traffic congestion and traffic accidents is also increasing. In order to solve this problem, intelligent transportation system came into being. Intelligent transportation system is an important part of intelligent city. It connects various facilities and equipment, vehicles, drivers and pedestrians in road traffic, establishes efficient and reliable information communication, and reasonably plans traffic modes and routes through big data and artificial intelligence, so as to reduce environmental pollution, reduce traffic load and ensure traffic safety.

Traffic sign detection is an important part of intelligent transportation system, which provides important information for vehicle decision-making and navigation. At present, the most widely used is to obtain the traffic sign information of the road through the on-board camera. The on-board camera detects the traffic signs through the captured traffic signs, feeds back the detected traffic sign information to the on-board system and the driver, and gives corresponding guidance to the vehicle driving planning and driver operation, so as to improve the safety of assisted driving. However, the detection of traffic signs is easily affected by the changes of weather, shadow and light intensity, which brings great safety risks to the application of assisted driving. At present, most of the detection of traffic signs is only carried out in the normal environment. If the environment is complex and changeable, the detection of traffic signs will be inaccurate, which brings great safety hazards to assisted driving. With the application of deep learning in the field of target detection, using its advantages of fast target acquisition and accurate detection, it provides a feasible scheme for the detection of traffic signs. By combining the detected road traffic signs with the intelligent transportation system, it can provide traffic information for drivers in real time and accurately, and can effectively reduce the incidence of accidents. However, when using deep learning algorithm for detection, it is difficult to detect the target in complex scenes and ensure the detection accuracy of the target. Therefore, how to detect traffic signs from complex scenes with high quality has become a problem to be solved.

Since AlexNet [1] achieved great success in the challenge of visual recognition in 2012, more and more deep learning methods have been proposed and applied to the field of target detection. As an important part of computer vision, target detection is widely used in image segmentation, target tracking, automatic driving and other fields. Compared with the traditional target detection algorithm, the deep learning method can directly extract the classification features from the image, effectively avoiding the cumbersome operation of manual feature extraction, reducing the complexity of the algorithm and improving the efficiency of the algorithm. At present, popular traffic sign detection algorithms include Fast R-CNN [2], YOLO [3], SSD [4], CenterNet [5], ConnorNet [6], among which Fast R-CNN and YOLO are the most representative. The former is a two-stage model. The main idea of the two-stage detection algorithm is to extract candidate regions based on edge information and selective search [7]. The detection accuracy is high, but the detection speed is slightly slow. The latter is a single-stage model with fast detection speed but low accuracy. By optimizing the network structure, Google research team proposed the lightweight network MobileNetv3 [8], which has fast speed and detection accuracy. He K et al. [9] constructed SPPNet network to avoid image distortion caused by zoom, clipping and other operations. ChienYao Wang et al. [10] constructed CSPNet network in order to reduce the amount of calculation of feature extraction. The latest YOLOv5l algorithm constructs a new module Focus, which rearranges the original image features from the perspective of the effect of network extraction of shallow features, and improves the reasoning speed of the algorithm and the speed of gradient return. The current YOLO algorithm has not fused these modules. If these modules are fused, it can improve the target detection performance of the original YOLO algorithm, reduce the error and omission rate of small targets in complex scenes, improve the robustness and reliability of the model, and improve the detection ability of traffic signs. Therefore, we propose M-YOLO algorithm, which is a detection algorithm for detecting traffic signs in complex scenes. On the basis of MobileNetv3 lightweight network, the proposed M-YOLO algorithm extracts bottom-up features from the input image through the processing of Focus module, SPPNet and CSPNet module, and uses the multiple detection heads of YOLOv3 [11] to predict the target. Compared with the original YOLOv3 algorithm, M-YOLO algorithm combines the advantages of various networks to optimize the detection accuracy and speed of YOLOv3 algorithm. The CCTSDB traffic sign dataset is used to verify the effectiveness of the proposed algorithm. The dataset is a data set taken in complex scenes. After a large number of comparative experiments, it is concluded that M-YOLO algorithm can effectively detect traffic signs in complex scenes, and has higher detection accuracy than the latest detection algorithm on this dataset.

This article is organized as follows: Section 2 reviews previous work on traffic sign recognition. Section 3 presents our method. Section 4 presents the experiments and results, and Section 5 is dedicated to the conclusions.

2. Related Work

In order to improve the detection accuracy of traffic signs, scholars first start with the color of traffic signs. The colors of traffic signs are generally divided into red, yellow and blue. This is the most obvious feature of traffic signs, and scholars began to use this feature to study it. The research direction of color based traffic sign detection and recognition is mainly divided into two categories: one is based on RGB and the other is based on HSI. Benallal et al. [12] analyzed the traffic signs from the perspective of light traffic signs and found that the RGB components are different under different lighting conditions. Only comparing the traffic signs of two RGB components can be separated. The defect of this method is also very obvious. Of course, when the traffic signs are in a complex environment and affected by various external things, it is difficult to accurately monitor the traffic signs. HSI model directly uses h, s and I to describe color. Yang et al. [13] converted the input color image into a probability map and detected traffic signs by the maximum stable extreme value region method. Gao et al. [14] used CIECAM97 color model to process traffic signs and fosts model to detect traffic signs. In addition to the color features, the shape features of traffic signs are also very obvious. The shape of traffic signs can be divided into circle, triangle and rectangle. Therefore, the required candidate regions can be obtained through the segmentation of shape features, and then classified [15]. Gavrila [16] detects by establishing a transformation distance from the image and matching it with warning traffic signs. Wang et al. [17] combined hog method with SVM method to complete the detection of traffic signs with arbitrary shape. Paulo et al. [18] generate the region of interest through Harris detector, and then search the corner of the region of interest with six set control regions to detect triangular traffic signs. In addition to the above two methods, scholars also use multi feature fusion traffic sign detection method to detect traffic signs. Creusen et al. calculated the hog features on the RGB channel, respectively, to improve the accuracy of recognition. Achanta et al. [19] proposed an image saliency frequency adjustment method because the saliency image generated by the visual attention system has low resolution. With the development of deep learning, in order to improve the detection accuracy of traffic signs, schmidhuber [20] and others applied supervised learning to neural network to detect traffic signs. Zhang Jianming and others first proposed the improved YOLOv2 algorithm, but its detection accuracy and accuracy are very low. Then they proposed the cascaded R-CNN [21] algorithm of multi-scale fusion. Although the detection accuracy has increased, it has the problem of slow detection rate. In order to better fit the characteristics of real-time. Li Xudong [22] and others proposed a traffic sign detection algorithm based on multi-scale nested residual network. On cctsdb, the detection speed is as high as 200 fps, but the accuracy is reduced. Chen Changchuan [23] and others, based on YOLOv3, fused the residual network and abandoned the general pooling layer to use the convolution layer, proposed the T-YOLO algorithm. The detection accuracy on cctsdb is as high as 97.3%, and the average calculation time is only 19.3 milliseconds per frame. Liu Fei [24] proposed an improved YOLOv4 tiny traffic sign detection algorithm, which can detect traffic signs in a variety of environments. Zhou Ke et al. [25] proposed the attention network (PFANet) based on high-resolution traffic sign classification for traffic sign detection in harsh environment. The latest algorithm YOLOv5s, although the detection speed is very fast, the detection accuracy and detection accuracy are very low.

3. M-YOLO Traffic Sign Detection Algorithm

In the field of traffic sign detection, it is usually improved on the basis of YOLOv3 algorithm, but the current improved scheme has the problems of low detection accuracy and slow detection speed, so this article proposes M-YOLO model to optimize this problem.

3.1. Improvement Scheme

(1): Replace the YOLOv3 Backbone Network with MobileNetv3

The MobileNet family is widely used in target detection due to its fast and accurate detection, and it has become the representative of lightweight networks.After the accumulation of the first two generations of v1 and v2, MobileNetv3 algorithm has excellent performance and speed, and is sought after by academia and industry. MobileNetv3 parameters are obtained by NAS [26]. MobileNetv3 inherits some practical achievements of v1 and v2 and attracts the attention mechanism of Se channel, which improves the detection accuracy to a certain extent. MobileNet uses deeply separable convolution. Compared with the classic CNN model, it mainly replaces part of the fully connected layer to achieve the effect of reducing the amount of computation. Depth-separable convolution and standard convolution structures are shown in Figure 1.

Calculation cost of deeply separable convolution:

D_{k} \times D_{k} \times M \times D_{F} \times D_{F}

(1)

Calculation cost of point-by-point convolution:

M \times N \times D_{F} \times D_{F}

(2)

The calculation cost of depth-separable convolution and point-by-point convolution is:

\frac{D_{k} \times D_{k} \times M \times D_{F} \times D_{F} + M \times N \times D_{F} \times D_{F}}{D_{k} \times D_{k} \times M \times N \times D_{F} \times D_{F}} = \frac{1}{N} + \frac{1}{D_{2}^{K}}

(3)

MobileNetv3-YOLOv3, which replaces YOLOv3 backbone network darknet-53 with MobileNetv3 lightweight network to build a symmetric network structure, takes advantage of MobileNetv3 features. The model is guaranteed to be lightweight. MobileNetv3 operates on the convolution kernel of the multi-channel

3 \times 3

depth separable convolution, and then completes the feature graph fusion through the

1 \times 1

point convolution operation, so as to reduce the size of the model. Both MobileNetv1 and MobileNetv2 operate from a conventional

3 \times 3

convolution layer with 32 filters. However, experiments show that this layer is relatively time-consuming, and only 16 filters are enough to complete the filtering of

224 \times 224

feature images. Although it does not save a lot of parameters, it does reduce the amount of calculation. With the characteristics of fast data processing of YOLOv3, the training speed of the model can be accelerated to achieve the purpose of real-time data processing. The network structure of MobileNetv3-YOLOv3 algorithm is shown in Figure 2.

Figure 2 shows the structure of MobileNetv3-YOLOv3 algorithm. Firstly, the image of

416 \times 416 \times 3

dimension is input into MobileNetv3 network, and a feature map is obtained through 16 layer convolution neural network feature extraction operation. Then, the obtained feature map is input into the preprocessing module of YOLOv3 for prediction at three levels, which are feature blocks of

76 \times 76

,

38 \times 38

,

19 \times 19

, associate the characteristic diagrams of the three levels with the labels between the corresponding anchor boxes, and establish the loss function. Finally, through the processing of non maximum suppression, the candidate boxes are screened and the identified targets are obtained. The flow chart is shown in Figure 3.

MobileNetv3-YOLOv3 reduces the computation on the basis of YOLOv3, but there are still some disadvantages. Experiments show that the depth of the separable convolution in the size of the

3 \times 3

; its computation amount being less than the standard convolution 8–9 times makes the model of shrinkage and reduces the amount of calculation and cost; and within the characteristics of large-scale processing, there still exists the problem of insufficient feature extraction, and in the case of inputting a frame, a multi-scale feature map under YOLOv3 is defined as Formulas (4) and (5).

F_{n} = t_{n} (F_{n - 1}) = t_{n} (t_{n - 1} (\cdot \cdot \cdot t_{1} (I)))

(4)

D = f_{n} (d_{n} (F_{n}), \cdot \cdot \cdot, d_{n - k} (F_{n - k})), n > k > 0

(5)

where the NTH layer feature graph,

t_{1} (I)

, is the first layer of the input image and the nonlinear relationship between the characteristic figures, after a series of convolution operations, will correspond to the value of n nonlinear function test results. In Formula (5), in the same way, it is not hard to see from the above formula that the input of the value of n is different, featuring corresponding layer test results that will also change, in order to guarantee the accuracy of test results; each layer’s characteristics contain enough information regarding MobileNetv3-YOLOv3. However, although deep convolution network is adopted, this is in the process of feature extraction, and it is easy to overlook some shallow characteristics information, leading to access to too little information; the final test results also mostly show clear results in which it is not hard to see from this that MobileNetv3-YOLOv3 is not accurate in the detection of small targets, and the object to be detected in this article is a traffic sign, which belongs to the category of small target detection.

(2): Add Focus Module

Because MobileNetv3 has the problem of insufficient shallow feature extraction, this article introduces the Focus module. The concept of the Focus module was first applied to YOLOv5. The function of Focus module is to segment the picture before it enters the backbone network, and take each pixel value of the picture for down sampling processing. After a series of convolution operations, the feature map without information loss can be obtained. This can effectively reduce the amount of calculation and speed up the processing speed. The operation of the focus layer is to add a

4 \times 4

pictures take a value every other pixel, and finally get 4 pictures. The four pictures complement each other and have similar characteristics. In this way, the original image is expanded to 4 times of the input channel, which is equivalent to 4 times of the original image. Taking Figure 4 as an example, the image with the original image of

2 r \times 2 r \times n

is input into the focus structure for slicing. First, it becomes the feature graph of

r \times r \times 4 n

, and then after a convolution operation, it finally becomes the feature graph of

r \times r \times 8 n

. The number of channels is expanded by 4 times, and the features are not lost.

(3): Add SPPNet Module

At present, deep learning neural networks need to input pictures of fixed size, such as 416 × 416, this situation will lead to the reduction of its accuracy in the face of any size proportion, resulting in low recognition rate. In order to solve this problem, a spatial pyramid pooling network (SPPNet) has been proposed. SPPNet can basically improve all CNN networks. By fusing the features of different resolutions, more effective can be obtained. SPPNet is more suitable for the field of image recognition. SPPNet network supports the input of pictures of any size, which can more conveniently train the model in multi-scale, SPPNet can get a fixed size output from the input of any size picture, making it possible to extract features from multiple regions of a picture at one time. The module can also convert the feature map of any resolution into a feature vector matching the same dimension of the full connection layer, so that the feature vector is no longer a single scale, effectively avoiding the problem of image distortion caused by a series of operations such as scaling, clipping and so on, SPPNet attempts to map the image pixels to the center of the receptive field of the feature map through enhanced image processing, enhance the receptive field and improve the accuracy of the processing results. SPPNet also solves the problem of feature repeated extraction in neural network, which can improve the speed of candidate box and greatly reduce the cost. SPPNet algorithm divides the feature candidate area of any size into area blocks with the size ratio of

4 \times 4

,

2 \times 2

,

1 \times 1

, then maximizes the pooling operation for each area block, and finally outputs the fixed dimension features spliced after pooling, and then performs the full connection operation. The processing flow of SPPNet module is shown in Figure 5.

(4): Add CSPNet Module

The main function of cspnet module is to enhance the learning ability of CNN, so as to prevent the occurrence of lightweight network model and reduce the accuracy of detection. CSPNet module can be easily integrated into RESNet [27] and DENSENet [28]. Its generalization is very good, and it can effectively reduce the amount of calculation when combined with other models. The amount of calculation usually decreases by 10% to 20%, but its accuracy is higher than the original algorithm. Another important feature of cspnet is that it can remove the computing bottleneck with high computing power. If the computing bottleneck is high, the algorithm reasoning time will be too long or some computing units will be lost. CSPNet reduces the computing bottleneck of peleenet by almost half and speeds up the operation efficiency of the algorithm. CSPNet can also reduce the memory occupation rate and make the algorithm more efficient in computer application. This article is to optimize the network structure of YOLO by using the ability of easy combination of CSPNet, reducing computing bottleneck and memory consumption, so as to make the algorithm more efficient, accurate and faster. The features proposed by CPET can be divided into two levels, and then the features proposed by CPET can be combined into two levels. Our main concept is to make the gradient flow propagate through different network paths by dividing the gradient flow. The flow is shown in Figure 6.

3.2. M-YOLO Backbone Network Design

Although MobileNetv3-YOLOv3 reduces the amount of calculation, it also has some defects. The detection accuracy of small targets is not high. Due to low resolution, small target objects occupy low pixels and carry less information, it is easy to cause feature loss. This is because the image is not preprocessed before entering the MobileNetv3 backbone network. Therefore, in order to solve these problems, this article uses the focus module to preprocess the image. The loss of image features is prevented, and the reference space pyramid pool network (SPPNet) is used to prevent image distortion. The introduction of these two modules will enhance small target detection, but it will also increase the amount of calculation. Therefore, I finally introduced the cross phase partial network (CSPNet) to eliminate the repetitive features generated in the calculation process and eliminate the calculation bottleneck, so as to ensure the accuracy and training speed of the model, and reduce the size of the model. The structure of M-YOLO backbone network is shown in Table 1. The parameter configuration indicates the number of input channels, output channels, step information and the size of convolution kernel of the module, respectively. Benck represents the remaining network part of the inverter of MobileNetv3 network structure, focus represents the focus module, conv represents the convolution layer. SPP represents the feature pyramid pool layer, and CSP represents the cross phase local network.

As shown in Table 1, the backbone network of M-YOLO consists of 18 network modules. First, the input image is processed in the focus module, and the image is sliced to obtain the feature map without information loss. Then, the feature is extracted after convolution. Then, the processed features are optimized by the SPP module and CSP module. Finally, the processed feature map is convoluted to further extract features, Ensure the integrity of the extracted features.

3.3. M-YOLO Network Structure

The M-YOLO network structure is shown in Figure 7. The network can accept the input of any RGB color image format. Firstly, the image in the dimension of

416 \times 416 \times 3

is input into the M-YOLO backbone network, then the image is sliced through the processing of the focusing module, and each pixel value of the image is down sampled. After a series of convolution operations, the feature map without information loss is obtained, and then the graphic features are transmitted to the SPP network through deep separable convolution operations. In this way, the feature map can be transformed into a feature vector matching the same dimension of the whole connection layer. In this way, we can strengthen the network layer, further determine the accuracy of the feature map information, and then use the CSP module for multi-channel operation, which can eliminate the repeated features generated in the calculation process and improve the operation efficiency of the model. Finally, reach the head and obtain scales of different sizes of 13 × 13 × 255, 26 × 26 × 255 and 52 × 52 × 255, respectively, after up sampling for three times. Different scales are used to detect different object sizes. 13 × 13 × 255 is used to detect large targets, 26 × 26 × 255 is used to detect medium targets, and 52 × 52 × 255 is used to detect small targets.

3.4. M-YOLO Algorithm Loss Function

The M-YOLO uses CIoU as the loss function. IoU [29] represents the intersection ratio and is a common indicator of target detection. It compares the prediction detection frame with the real detection frame. The closer the comparison value is to 1, the better the effect is and the closer to the real value.

I o U = \frac{A \cup B}{A \cap B}

(6)

As a loss function, IoU will also have some problems. For example, if the target detection frame does not intersect with the formal detection frame, the value of IoU is 0, which cannot reflect the coincidence degree of the two. Meanwhile, the value of Loss is also 0, and the model cannot be trained without gradient return. GIoU [30] is proposed to solve the problem that gradient return cannot be carried out due to non-overlapping box values. The formula of GIoU is shown in Formula (7), where A represents prediction box, B represents real box, and C is the minimum enclosing box of A and B. However, GIoU has the problem of slow convergence speed Penalty term is used to minimize the distance between two detection frames. The formula of DIoU [31] is shown in Formula (8), where A represents the prediction frame, B represents the real frame, A_ctr represents the point coordinates of the center of the prediction frame, B_ctr represents the point coordinates of the center of the real frame, and

p (\cdot)

represents the calculation formula of the Euclidean distance. CIoU is an enhanced version of DIoU, which introduces aspect ratios. The loss value of CIoU is not simplified, but contains the distance, aspect ratio and overlap area of the center point, and its formula is shown in Formulas (9)–(11), where A represents the prediction frame, B represents the real frame, A_ctr represents the point coordinates of the center of the prediction frame, B_ctr represents the point coordinates of the center of the real frame, w^gt and h^gt represents the width and height of the real frame, w and H represent the width and height of the prediction frame.

L_{G I o U} = 1 - I o U (A, B) + | C - A \cup B | / | C |

(7)

L_{D I o U} = 1 - I o U (A, B) + p^{2} (A_{c t r}, B_{c t r}) / c^{2}

(8)

L_{C I o U} = 1 - I o U (A, B) + p^{2} (A_{c t r}, B_{c t r}) / c^{2} + α υ

(9)

υ = \frac{4}{π^{2}} {(a r c t a n \frac{w^{g t}}{h^{g t}} - a r c t a n \frac{w}{h})}^{2}

(10)

α = \frac{υ}{(1 - I o U) + υ}

(11)

4. Experimental Results and Analysis

4.1. Experimental Dataset

4.1.1. CCTSDB Dataset

The dataset of CCTSDB [32] was taken in China and is a dataset of traffic signs in China. It contains three categories of prohibition, warning and instruction, as shown in Figure 8. In order to ensure the accuracy of the results, a huge amount of data is needed to support the dataset. There are 16, 123 images in this data set, among which 10, 123 are used as the training set, 3000 as the verification set and 3000 as the test set. All of them are taken in real scenes, which is more in line with the traffic sign detection in the current complex scenes.

4.1.2. HRRSD Dataset

Objects detected in a 256 × 256 graph with less than 80 pixels can be defined as small objects. At the present stage, it is a great difficulty to detect traffic signs in complex scenes, especially small target traffic signs. Currently, there is no data set especially used for detecting small target traffic signs. Therefore, in order to verify the ability of the algorithm proposed in this article to detect small targets, the HRRSD small target public data set is used for testing. The HRRSD small target open dataset was published by Xi ’an Institute of Optics and Fine Mechanics, Chinese Academy of Sciences. The proportion of each target in the entire dataset is between 35 and 80 pixels, which fully conforms to the definition of small target. The HRRSD small target dataset contains 13 categories, which are mainly divided into ship, bridge, track and field, aircraft, vehicle, intersection and other categories. Each category is divided into training set, verification set and test set in detail, as shown in Table 2.

4.2. Experimental Configuration

This experiment was conducted in ubuntu18 04.4 lts system, using pytorch1 8.1 in depth learning framework, the parameter configuration of relevant training experimental platform is shown in Table 3.

The correlation parameter Lr0 represents the learning rate, LRF represents the cosine annealing hyperparameter, momentum represents the learning rate, momentum, weight_ Decay represents the weight coefficient, epoch represents the training batch, and batchsize represents the batch size, as shown in configuration Table 4.

4.3. Experimental Evaluation Index

Target detection uses multiple indicators to evaluate the algorithm, including precision (P). Precision represents the proportion of traffic signs with correct classification. For example, if there are 10 traffic signs in a picture, the algorithm only detects four targets, of which three are traffic signs, then the accuracy is 75%. Recall (R): recall rate refers to the proportion of traffic signs with correct classification. For example, if there are 10 traffic signs in a picture, and the algorithm detects three traffic signs, the recall rate is 30%. The mean average precision (map) represents the accuracy of multiple groups of data, and the average value is taken. Frame per second (FPS) represents how many pictures can be processed per second. Their calculation process is shown in Formulas (12)–(15), where TP represents the number of positive samples with correct prediction, FN represents the number of positive samples with failed prediction,

p_{r c}

and

r_{c}

represents the value when recall rate is p.

P = \frac{T P}{T P + F P}

(12)

R = \frac{T P}{T P + F N}

(13)

A P_{c} = \frac{1}{N_{c}} \sum_{r_{c} \in R_{c}}^{} p (r_{c})

(14)

m A P = \frac{1}{N} \sum_{}^{} (A P_{c})

(15)

4.4. Experimental Results

4.4.1. Effectiveness Experiments

In order to effectively verify the proposed algorithm, the added modules and YOLOv3 are combined to conduct ablation experiments on the CCTSDB public dataset. The detection results are shown in Table 5. Table 5 lists the three evaluation indicators P, R and mAP. The check box in Table 5 indicates that this module is used.In the case of (CIoU value is 0.6), compared with YOLOv3, the algorithm proposed in this article increases P value by 8.6%, R value by 0.5% and mAP value by 3.3%. The P value and PR value curves of the algorithm proposed in this article under the cctsdb data set are shown in Figure 9 a, b, where 0 is a warning traffic sign, 1 is an indication traffic sign and 2 is a no traffic sign. In order to enrich the verification results, the confusion matrix is listed, as shown in Figure 10. The experimental results are shown in Figure 11. Through the comparison of the above experiments, it can be concluded that the algorithm proposed in this article has been significantly improved.

In order to test the effectiveness of M-YOLO algorithm, this article adds relevant tests in the real scene, uses the vehicle camera to shoot, selects a captured video containing traffic signs for frame extraction processing, processes the video into multiple pictures, and tests the processed pictures with M-YOLO algorithm. The test results are shown in Figure 12 and Figure 13.

As shown in the above figure, these test pictures are taken on urban roads, which are traffic sign detection in line with complex scenes. The orange identification box represents indicative traffic signs, and the green identification box represents prohibited traffic signs. It can be seen that the detection accuracy of M-YOLO algorithm is very high, which basically achieves complete recognition.

As shown in the above figure, these test pictures are taken on the highway, in which the blue identification box represents warning traffic signs. It can be seen that the detection accuracy of M-YOLO algorithm is very high, and it basically achieves complete recognition. To sum up, it can be concluded that M-YOLO algorithm can be applied to detect traffic signs in complex environment.

4.4.2. Performance Comparison

In order to fully verify the algorithm proposed in this article, a large number of comparative experiments are carried out on the CCTSDB public dataset. The comparison results are shown in Table 6. According to the comparative experiments in Table 6, the algorithm M-YOLO proposed in this article achieves the best results in P value, R value and mAP value.

According to the above table, we selected several algorithms for comparison. SSD algorithm has many defects and cannot accurately detect traffic signs. Shan H [33] and others proposed improved SSD traffic sign detection algorithm, which fused sppnet module with SSD algorithm to improve the average detection accuracy of the algorithm. In order to further improve the detection speed and accuracy of SSD algorithm, Kun Ren [34] and others proposed improved MobileNetv2 SSD traffic sign detection algorithm, which fused MobileNetv2 with SSD algorithm to reduce the repeated extraction of features and improve the detection speed and accuracy of the algorithm. Faster R-CNN algorithm is the latest two-stage detection algorithm. It can accurately detect traffic signs for traffic sign detection. However, due to the limitation of two-stage, the detection speed of the algorithm is difficult to be improved. The most representative one-stage algorithm is YOLOv3 algorithm. YOLOv3 algorithm has achieved good results in detection speed and detection accuracy. ChangCuan Chen et al. [23] proposed T-YOLO algorithm based on YOLOv3, which integrates residual network, down sampling and abandons the general pooling layer to use convolution layer. The detection accuracy on CCTSDB is as high as 97.3%, but the average detection time is only 19.3 milliseconds per frame. The latest algorithms, YOLOv4 algorithm and YOLOv5l algorithm, have excellent detection accuracy and speed in detecting traffic signs, but there are still many improvements. The M-YOLO algorithm proposed in this article is optimized by combining a variety of modules, maintains the detection speed similar to that of YOLOv5l, and is better than the previous algorithms in detection accuracy, recall and average detection accuracy.

In this article, six conditions of CCTSDB dataset were selected for testing, including normal environment, deformation environment, noise environment, dark environment, reflective environment and ultra-distant perspective. The YOLOv5l algorithm is compared with the algorithm in this article. The test results under normal environment are shown in Figure 14a–d, where 0 stands for warning traffic sign, 1 for indicating sign, and 2 for prohibiting traffic sign.

It can be seen from the above figure that the algorithm in this parper performs well in the normal environment. In the first figure, the detection accuracy of YOLOv5l algorithm for warning traffic signs is 79% and 71%, respectively, while the detection accuracy of M-YOLO algorithm is 92% and 91%, respectively. The improvement effect is obvious. In figure (c), the detection result of YOLOv5l algorithm for prohibited traffic signs is 67%, while the detection accuracy of M-YOLO algorithm is 74%. It can be seen that in the normal environment, The M-YOLO algorithm has good promotion effect. When deformation interference is added, the results are compared as shown in Figure 15a–d.

It can be seen from the above figure that the algorithm in this article is better than the YOLOv5l algorithm in the distorted deformation environment. It can be seen from figure (c) that the YOLOv5l algorithm detects errors and mistakenly detects other red signs as traffic signs. In figure (d), the detection accuracy of this algorithm is 37% higher than that of the YOLOv5l algorithm. When noise interference is added, the results are compared as shown in Figure 16a–d.

It can be seen from the above figure that in the noise environment, the difference between the detection results of M-YOLO algorithm and YOLOv5l algorithm is not very large, and the error is within 5%. Comparison of detection results in reflective environment is shown in Figure 17a–d.

Compared with the test results in reflective environment, it can be seen that there is little difference between the detection effect of the algorithm in this article and YOLOv5l algorithm, and the detection accuracy is improved by about 3% compared with YOLOv5l. Comparison of detection results in dark environment is shown in Figure 18a–d.

Comparing the detection results of the algorithm in the dark environment, it can be concluded that the detection effect of M-YOLO algorithm is very good, and it is better in detecting small traffic signs, as shown in the warning traffic signs in figure (a). Compared with YOLOv5l algorithm, it is improved by 30% in detecting small targets. Comparison of detection results in ultra-distant environment is shown in Figure 19a–d.

From the ultra far perspective, it can be seen that the detection effect of M-YOLO algorithm is much better than that of YOLOv5l algorithm. As shown in figure (a), the detection accuracy of small target prohibited traffic signs is 41% higher than that of YOLOv5l algorithm. In figure (c), YOLOv5l algorithm does not recognize distorted traffic signs at high altitude, while M-YOLO algorithm detects distorted small target traffic signs and the detection effect is good.

Because there is no special dataset for detecting small target traffic signs, in order to verify the ability of the algorithm proposed in this article in small target detection, it is compared with different algorithms on HRRSD public dataset. The comparison results are shown in Table 7. It is not difficult to see from the table that compared with other mainstream algorithms, the detection accuracy and detection speed of the algorithm in this article are the best. Compared with YOLOv5l, the mAP value of this algorithm is increased by 9.9%. From several groups of experimental results, it can be concluded that the algorithm in this article has certain advantages in small target detection and is suitable for small target detection.

In order to more fully verify the small-target detection capability of the algorithm in this article, the test results of the algorithm in this article and YOLOv5l algorithm on HRRSD are shown in Figure 20a,b.

It can be seen from the above figure that the detection accuracy of M-YOLO is higher than that of YOLOv5l. YOLOv5l mistakenly identifies other targets as storage tanks. It can be concluded that M-YOLO algorithm has higher accuracy in detecting small targets.

5. Conclusions

Aiming at the problems that the current mainstream algorithms have low accuracy in traffic sign detection and are easy to be disturbed by various factors, this article proposes M-YOLO traffic sign detection algorithm, which can detect traffic signs quickly and accurately. From the comparative experiment of CCTSDB dataset, it can be concluded that the detection accuracy, recall and average detection accuracy of M-YOLO algorithm have achieved the best results, among which the average accuracy of traffic sign detection is as high as 97.8%. At the same time, the algorithm is tested in a variety of scenarios and compared with YOLOv5l algorithm. The results show that the algorithm proposed in this article is better than YOLOv5l algorithm. Because there is no published traffic sign small target dataset, in order to verify the detection ability of the algorithm proposed in this article, compared with other algorithms on hrrsd small target dataset, the results show that the algorithm proposed in this article has achieved the best results in horizontal detection speed and detection accuracy, in which the average detection accuracy is as high as 85.5%. However, the M-YOLO algorithm still needs to be improved in many aspects. Although the fusion of multiple modules is adopted in this paper to improve the detection accuracy and operation efficiency of the algorithm, the fusion of multiple modules may lead to the problem of insufficient global information acquisition. In the future, an attention mechanism module can be introduced to better capture global information and improve the performance of algorithm detection. Finally, we will further optimize the performance and stability of our detection model and try to transfer it to other work scenarios to complete the detection task. It is believed that in the future, object detection for real road scenes is still a topic worthy of long-term research.

Author Contributions

Conceptualization, Y.L. (Yuchen Liu); methodology, Y.L. (Yuchen Liu); software, Y.L. (Yuchen Liu); validation, Y.L. (Yuchen Liu) and Z.Z.; formal analysis, G.S.; investigation, G.S.; resource, Y.L. (Yuchen Liu); data curation, Y.L. (Yuchen Liu); writing-original draft preparation, Y.L. (Yuchen Liu); writing-review and editing, G.S., Z.Z. and Y.L. (Yanxiang Li) All authors have read and agreed to the published version of the manuscript.

Funding

This work is financially supported by Natural Science Foundation of Xinjiang Uygur Autonomous Region (Grant No. 2020D01C047).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Krizhevsky, A.; Sutskever, I.; Hinton, G. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; MIT Press: Cambridge, MA, USA, 2012; Volume 1, pp. 1097–1105. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Patterr Recognition, Las Vegas, NV, USA, June 26–1 July 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 779–788. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Zhou, X.; Wang, D.; Krähenbühl, P. Objects as points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
Law, H.; Deng, J. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 734–750. [Google Scholar]
Liu, X.; Chi, M.; Zhang, Y.; Qin, Y. Classifying high resolution remote sensing images by fine-tuned VGG deep networks. In Proceedings of theIGARSS 2018—2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 23–27 July 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 7137–7140. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Wang, C.Y.; Liao, H.Y.M.; Yeh, I.; Wu, Y.; Chen, P.; Hsieh, J. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 390–391. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Benallal, M.; Meunier, J. Real-time color segmentation of road signs. In Proceedings of the Canadian Conference on Electrical and Computer Engineering, Montreal, QC, Canada, 4–7 May 2003; pp. 1823–1826. [Google Scholar]
Yang, Y.; Wu, F. Real-time traffic sign detection via color probability model and integral channel features. In Proceedings of the Chinese Conference on Pattern Recognition, Montreal, QC, Canada, 4–7 May 2003; Springer: Berlin/Heidelberg, Germany, 2014; pp. 545–554. [Google Scholar]
Gao, X.W.; Podladchikova, L.; Shaposhnikov, D.; Hong, K.; Shevtsova, N. Recognition of traffic signs based on their colour and shape features extracted using human vision models. J. Vis. Commun. Image Represent. 2006, 17, 675–685. [Google Scholar] [CrossRef]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–26 June 2005; IEEE: Piscataway, NJ, USA, 2005; Volume 1, pp. 886–893. [Google Scholar]
Gavrila, D.M. Traffic sign recognition revisited. In Mustererkennung 1999; Springer: Berlin/Heidelberg, Germany, 1999; pp. 86–93. [Google Scholar]
Wang, G.Y.; Ren, G.H.; Wu, Z.L.; Zhao, Y.; Jiang, L. A robust, coarse-to-fine traffic sign detection method. In Proceedings of the 2013 International Joint Conference on Neural Networks (IJCINN), Dallas, TX, USA, 4–9 August 2013; pp. 1–5. [Google Scholar]
Paulo, C.F.; Correia, P.L. Automatic detection and classification of traffic signs. In Proceedings of the Eighth International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS’07), Santorini, Greece, 6–8 June 2007; IEEE: Piscataway, NJ, USA, 2007; p. 11. [Google Scholar]
Creusen, I.M.; Wijnhoven, R.G.J.; Herbschleb, E.; de With, P.H.N. Color exploitation in hog-based traffic sign detection. In Proceedings of the 2010 IEEE International Conference on Image Processing, Hong Kong, 6–29 September 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 2669–2672. [Google Scholar]
Achanta, R.; Hemami, S.; Estrada, F.; Susstrunky, S. Frequency-tuned salient region detection. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 1597–1604. [Google Scholar]
Zhang, J.; Xie, Z.; Sun, J.; Zou, X.; Wang, J. A cascaded R-CNN with multiscale attention and imbalanced samples for traffic sign detection. IEEE Access 2020, 8, 29742–29754. [Google Scholar] [CrossRef]
Li, X.; Zhang, J.; Xie, Z.; Wang, J. Fast Traffic Sign Detection Algorithm based on Three-scale Nested Residual Structure. Comput. Res. Dev. 2020, 057, 1022–1036. [Google Scholar]
Chen, C.; Wang, H.; Zhao, Y.; Wang, Y.; Li, L.; Li, K.; Zhang, T. A depth based traffic sign recognition algorithm. Telecommun. Technol. 2021, 61, 76–82. [Google Scholar]
Liu, F. Traffic sign Detection based on YOLOv4-Tiny. Inf. Technol. Informatiz. 2021, 5, 18–20. [Google Scholar]
Zhou, K.; Zhan, Y.; Fu, D. Learning region-based attention network for traffic sign recognition. Sensors 2021, 21, 686. [Google Scholar] [CrossRef] [PubMed]
Tan, M.; Chen, B.; Pang, R.; Vasudevan, V.; Sandler, M.; Howard, A.; Le, Q.V. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 2820–2828. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Jiang, B.; Luo, R.; Mao, J.; Xiao, T.; Jiang, Y. Acquisition of localization confidence for accurate object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 784–799. [Google Scholar]
Rezatofighi, H.; Tsoi, N.; Gwak, J.Y.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar]
Zhang, Z.; Wang, H.; Zhang, J.; Yang, W. A vehicle real-time detection algorithm based on YOLOv2 framework. In Real-Time Image and Video Processing 2018. Int. Soc. Opt. Photonics 2018, 10670, 106700N. [Google Scholar]
Shan, H.; Zhu, W. A small traffic sign detection algorithm based on modified ssd. In Proceedings of the IOP Conference Series: Materials Science and Engineering, Wuhan, China, 10–12 October 2019; IOP Publishing: Tokyo, Japan, 2019; Volume 646, p. 012006. [Google Scholar]
Ren, K.; Huang, L.; Fan, C. Real-time Small Traffic Sign Detection Algorithm based on Multi-scale Pixel Feature Fusion. Signal Process. 2020, 36, 1457–1463. [Google Scholar]

Figure 1. Depth separable convolution and standard convolution structures.

Figure 2. MobileNetv3-YOLOv3 network structure. CBH: The smallest component in MobileNetv3 network structure, By Conv + BN + H-wish activation function consists of three parts. CBL: The smallest component in YOLOv3 network structure, By Conv + BN + Leakyrelu activation function consists of three parts. DBH: Depth separable convolution + point convolution.

Figure 3. MobileNetv3-YOLOv3 algorithm processing.FC:Full connection layer. NMS:Non-Maximum Suppression.

Figure 4. Focus module processing.

Figure 5. SPPNet module processing.

Figure 6. CSPNet module processing.

Figure 7. M-YOLO network structure.CBH: The smallest component in MobileNetv3 network structure, By Conv + BN + H-wish activation function consists of three parts. CBL: The smallest component in YOLOv3 network structure, By Conv + BN + Leakyrelu activation function consists of three parts. DBH: Depth separable convolution + point convolution. Focus: Slice the picture. Composed of 4 slices layers.SPP: Feature pooling.Composed of 3 maxpool layers. CSP: Dual channel operation. It is mainly composed of CBL + Resunit + Conv activation function consists of three parts.

Figure 8. Sample CCTSDB dataset.

Figure 9. P, PR value curve. (a) P value curve. (b) PR value curve.

Figure 10. Confusion_matrix curve.

Figure 11. Result curve.

Figure 12. Urban road test results.

Figure 13. Highway test results.

Figure 14. Comparison diagram of algorithms in normal environment. (a) YOLOv5l test results. (b) M-YOLO test results. (c) YOLOv5l test results. (d) M-YOLO test results.

Figure 15. Comparison diagram of algorithms in distortion deformation environment. (a) YOLOv5l test results. (b) M-YOLO test results. (c) YOLOv5l test results. (d) M-YOLO test results.

Figure 16. Comparison diagram of algorithms in noise environment. (a) YOLOv5l test results. (b) M-YOLO test results. (c) YOLOv5l test results. (d) M-YOLO test results.

Figure 17. Comparison diagram of algorithms in reflective environment. (a) YOLOv5l test results. (b) M-YOLO test results. (c) YOLOv5l test results. (d) M-YOLO test results.

Figure 18. Comparison diagram of algorithms in dark environment. (a) YOLOv5l test results. (b) M-YOLO test results. (c) YOLOv5l test results. (d) M-YOLO test results.

Figure 19. Comparison of algorithms from a remote perspective. (a) YOLOv5l test results. (b) M-YOLO test results. (c) YOLOv5l test results. (d) M-YOLO test results.

Figure 20. HRRSD dataset algorithm comparison diagram. (a) YOLOv5l test results. (b) M-YOLO test results.

Table 1. M-YOLO backbone network.

Layer	Module	Parameter Configuration
1	Focus	[3, 32, 3]
2	InvertedResidual(benck)	[32, 16, 1, 1]
3	InvertedResidual(benck)	[16, 24, 2, 6]
4	InvertedResidual(benck)	[24, 24, 1, 6]
5	InvertedResidual(benck)	[24, 32, 2, 6]
6	InvertedResidual(benck)	[32, 32, 1, 6]
7	InvertedResidual(benck)	[32, 32, 1, 6]
8	Conv	[32, 1024, 3, 2]
9	SPP	[1024, 1024, [5, 9, 13]]
10	CSP	[1024, 1024, 6]
11	InvertedResidual(benck)	[102, 64, 1, 6]
12	InvertedResidual(benck)	[64, 96, 1, 6]
13	InvertedResidual(benck)	[96, 96, 1, 6]
14	InvertedResidual(benck)	[96, 96, 1, 6]
15	InvertedResidual(benck)	[96, 160, 2, 6]
16	InvertedResidual(benck)	[160, 160, 1, 6]
17	InvertedResidual(benck)	[160, 160, 1, 6]
18	InvertedResidual(benck)	[160, 230, 1, 6]

Table 2. HRRSD dataset details.

	Name	N_Train	N_Val	N_Test
1	ship	950	948	1988
2	bridge	1123	1121	2326
3	ground track field	859	856	2017
4	storage tank	1099	1092	2215
5	basketball court	923	920	2233
6	tennis court	1043	1040	2212
7	airplane	1226	1222	2451
8	baseball diamond	1007	1004	2022
9	harbor	967	964	1953
10	vehicle	1188	1186	2382
11	crossroad	903	901	2219
12	T junction	1066	1065	2289
13	parking lot	1241	1237	2480

Table 3. Experimental platform configuration.

Attribute	Value
OS	Ubuntu18.04.4LTS
GPU	NVIDIA RTX 2080Ti
CUDA	10.0
Deep learning framework	Pytorch1.8.1

Table 4. Experimental training configuration.

Attribute	Value
lr0	0.01
lrf	0.2
momentum	0.937
weight_ decay	0.0005
epoch	110
batchsize	12

Table 5. M-YOLO ablation results on CCTSDB dataset.

Model	SPPNet	CPSNet	FOCUS	P/%	R/%	mAP@0.5/%
baseline				86.1	95.8	94.5
a	✓			92	95.1	95.3
b		✓	✓	88	96	95.1
c			✓	90.7	95.6	96.9
d	✓	✓		91.8	96.1	97.2
e	✓		✓	93.1	96.1	97.5
f		✓	✓	91.5	96.8	96.7
ours	✓	✓	✓	93.5	96.3	97.8

Table 6. CCTSDB dataset comparison results of different algorithms.

Model	P/%	R/%	mAP@0.5/%	FPS $/ f \times s - 1$
Improved SSD [33]	-	-	85	-
Improved MobileNetv2-SSD [34]	-	-	93.2	45
Faster R-CNN	91.6	90.7	93.5	21.7
YOLOv3	88.1	94.6	96	73
T-YOLO [23]	91.3	-	97.3	19.3
YOLOv4	88.1	92.8	95.8	78
YOLOv5l	84.9	95.2	95.4	85
Ours	93.5	96.3	97.8	84

Table 7. HRRSD dataset comparison results of different algorithms.

Model	mAP@0.5 $/ %$	FPS $/ f \times s - 1$
Fast R-CNN	72.4	5
Faster R-CNN	74.6	7
YOLOv2	79.2	80
SSD	78.9	120
YOLOv3	81.2	110
YOLOv3-SPP	83.3	85
YOLOv5l	75.6	148
YOLOv5s-MobileNetv3	75.5	82
Ours	85.5	150

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, Y.; Shi, G.; Li, Y.; Zhao, Z. M-YOLO: Traffic Sign Detection Algorithm Applicable to Complex Scenarios. Symmetry 2022, 14, 952. https://doi.org/10.3390/sym14050952

AMA Style

Liu Y, Shi G, Li Y, Zhao Z. M-YOLO: Traffic Sign Detection Algorithm Applicable to Complex Scenarios. Symmetry. 2022; 14(5):952. https://doi.org/10.3390/sym14050952

Chicago/Turabian Style

Liu, Yuchen, Gang Shi, Yanxiang Li, and Ziyu Zhao. 2022. "M-YOLO: Traffic Sign Detection Algorithm Applicable to Complex Scenarios" Symmetry 14, no. 5: 952. https://doi.org/10.3390/sym14050952

APA Style

Liu, Y., Shi, G., Li, Y., & Zhao, Z. (2022). M-YOLO: Traffic Sign Detection Algorithm Applicable to Complex Scenarios. Symmetry, 14(5), 952. https://doi.org/10.3390/sym14050952

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

M-YOLO: Traffic Sign Detection Algorithm Applicable to Complex Scenarios

Abstract

1. Introduction

2. Related Work

3. M-YOLO Traffic Sign Detection Algorithm

3.1. Improvement Scheme

3.2. M-YOLO Backbone Network Design

3.3. M-YOLO Network Structure

3.4. M-YOLO Algorithm Loss Function

4. Experimental Results and Analysis

4.1. Experimental Dataset

4.1.1. CCTSDB Dataset

4.1.2. HRRSD Dataset

4.2. Experimental Configuration

4.3. Experimental Evaluation Index

4.4. Experimental Results

4.4.1. Effectiveness Experiments

4.4.2. Performance Comparison

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI