Improved Detection of Multi-Class Bad Traffic Signs Using Ensemble and Test Time Augmentation Based on Yolov5 Models

Garta, Ibrahim Yahaya; Tai, Shao-Kuo; Chen, Rung-Ching

doi:10.3390/app14188200

Open AccessArticle

Improved Detection of Multi-Class Bad Traffic Signs Using Ensemble and Test Time Augmentation Based on Yolov5 Models

by

Ibrahim Yahaya Garta

,

Shao-Kuo Tai

and

Rung-Ching Chen

^*

Department of Information Management, Chaoyang University of Technology, Taichung 41349, Taiwan

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(18), 8200; https://doi.org/10.3390/app14188200

Submission received: 12 August 2024 / Revised: 6 September 2024 / Accepted: 10 September 2024 / Published: 12 September 2024

(This article belongs to the Special Issue Generative Artificial Intelligence Technologies and Applications for Road Environment Understanding)

Download

Browse Figures

Versions Notes

Abstract

Various factors such as natural disasters, vandalism, weather, and environmental conditions can affect the physical state of traffic signs. The proposed model aims to improve detection of traffic signs affected by partial occlusion as a result of overgrown vegetation, displaced signs (those knocked down, bent), perforated signs (those damaged with holes), faded signs (color degradation), rusted signs (corroded surface), and de-faced signs (placing graffiti, etc., by vandals). This research aims to improve the detection of bad traffic signs using three approaches. In the first approach, Spiral Pooling Pyramid-Fast (SPPF) and C3TR modules are introduced to the architecture of Yolov5 models. SPPF helps provide a multi-scale representation of the input feature map by pooling at different scales, which is useful in improving the quality of feature maps and detecting bad traffic signs of various sizes and perspectives. The C3TR module uses convolutional layers to enhance local feature extraction and transformers to boost understanding of the global context. Secondly, we use predictions of Yolov5 as base models to implement a mean ensemble to improve performance. Thirdly, test time augmentation (TTA) is applied at test time by using scaling and flipping to improve accuracy. Some signs are generated using stable diffusion techniques to augment certain classes. We test the proposed models on the CCTSDB2021, TT100K, GTSDB, and GTSRD datasets to ensure generalization and use k-fold cross-validation to further evaluate the performance of the models. The proposed models outperform other state-of-the-art models in comparison.

Keywords:

bad traffic signs; C3TR; ensemble; SPPF; test time augmentation; traffic sign; Yolov5 model

1. Introduction

Traffic sign detection plays an important role in modern Transport Management Systems by promoting road safety and enhancing traffic management [1,2]. Ideally, traffic signs should be visible, reflective, retro-reflective, and legible to assist drivers in making decisions and maintain road safety [3]. However, the physical condition of a traffic sign and environmental factors can impact their detection and recognition [4]. The physical condition of signs can be affected by natural disasters such as hurricane, earthquake, and flood, which can even displace a sign. Vandalism, such as graffiti, can deface a sign, while prolonged exposure to excessive sunlight can cause signs to fade, and moisture can cause rust [5]. Additionally, environmental factors such as overgrown vegetation can lead to occlusion [6]. All these conditions can affect the safety of road users, especially drivers of conventional and autonomous vehicles. We refer to the term ‘bad traffic sign’ as those signs that are physically degraded, in addition to those affected by lighting illumination.

Traffic sign recognition (TRS) is a technology introduced to assist drivers by providing real-time information about road conditions ahead [7]. Traffic Recognition Systems face performance challenges in detecting bad traffic signs [4]. Therefore, TSR systems need to be designed and trained to handle these varying conditions and challenges.

In recent years, various techniques have been developed and implemented to significantly improve automated systems for detecting traffic signs, particularly those affected by physical, weather, and lighting conditions. These techniques include modification to deep learning architecture [8,9], ensemble methods in Machine and Deep [10], and test time augmentation [11], among others.

To address the low detection accuracy of obscure traffic signs, Luo et al. [12] proposed a method for detecting and recognizing partially obscured traffic signs during vehicle motion. The authors use a color-shape recognition approach to extract traffic sign regions and fuse information from multiple frames to enhance the accuracy of the traffic sign information. The algorithm can identify traffic signs not recognized by YOLOV4 and YOLOV8 when a vehicle is moving at speeds of 18 km/h, 36 km/h, and 54 km/h.

Yan et al. [13] introduced an adaptive enhancement technique to improve the quality of images in complex illumination scenes. The authors integrate a lightweight attention block into a single-stage target detection algorithm SSD with ResNet and VGG as the backbone networks. Their approach is effective in detecting and recognizing images in dark scenes.

Lim et al. [14] used predictions from ResNet50, DenseNet121, and VGG16 pre-trained models to enhance traffic sign detection by implementing a voting ensemble. They evaluated their method on the German Traffic Sign Recognition Benchmark (GTSRB), the Belgium Traffic Sign Dataset (BTSD), and the Chinese Traffic Sign Database (TSRD) datasets, achieving accuracies of 98.84%, 98.33%, and 94.55%, respectively.

Wang et al. [15] proposed a Context-Aware module and an Attention-Driven Weighted Fusion Network module for the detection of small and occluded traffic signs. The Context modules help the model to understand the broader context in which traffic signs exist, which is useful in detecting small and occluded traffic signs. The results of the proposed method show improved performance on the TT100K traffic sign.

In this study, we use Yolov5 models to improve the detection accuracy of bad traffic signs because of their speed, support for real-time detection, range of sizes suitable for our model, ease of customization, and support for ensemble and test time augmentation (TTA) techniques.

The major contributions of the research are: (1) introducing SPPF and C3TR modules into the head of Yolov5 models, which enhances the ability of the models to capture complex features, leading to improved accuracy; (2) using predictions of trained Yolov5 models as base learners to implement a mean ensemble, which shows improvement in performance metrics; (3) implementing test time augmentation by applying flipping and scaling to test images, which improves performance; (4) utilizing a dataset containing physically degraded signs, which are uncommon in public datasets, and using stable diffusion techniques to augment images of some classes; (5) understanding and comparing how classes in our dataset perform across the three methods.

The rest of this paper is organized as follows: Methods are presented briefly in Section 2. Results and analysis are presented in Section 3. Discussions on the overall results, performance, and future research direction are given in Section 4, while conclusions are drawn in Section 5.

2. Materials and Methods

2.1. Dataset

We collected 1495 traffic signs online and used a stable diffusion technique to generate 605 additional images, augmenting the dataset to 2100. The dataset is processed and labeled using the LabelImg annotation tool [16] using a Yolo format. It consists of seven classes with 300 images each. We augmented the dataset with 191 faded, 179 defaced signs, 91 occluded signs, 139 perforated signs, and 5 rusted signs. The description of the dataset is presented in Figure 1. A total of 1470 images are used for training 420 for validation and 210 for testing.

The dataset is diverse and contains images with a wide range of variations in terms of physical conditions, angles, backgrounds, and object appearances. The diversity is advantageous in training a robust model that can generalize well to unseen data.

2.2. The Yolov5 Model

The Yolov5 model [17,18] is a state-of-the-art object detection model in the You Only Look Once (YOLO) family known for its accuracy and speed in real-time object detection tasks. There are five variants of the Yolov5 model, as shown in Table 1, each with a different number of parameters and Giga Floating Point Operations (GFLOP).

The structure of the Yolov5 model consists of the backbone, neck, and head, as shown in Figure 2. The image inputs are fed into the network via the CSPDarknet [19] to extract features. In the neck, the PANet [20,21] takes and enhances the features from the backbone while the neck makes the final predictions based on the features extracted by the backbone and neck. It outputs the bounding boxes, class predictions, and confidence scores for each detected object.

2.3. C3TR Module and SPPF

The C3TR module is an enhanced C3 module that integrates transformer blocks to improve its performance [22], as shown in Figure 3. The C3 module consists of multiple convolutional layers and shortcut connections. The convolutional layers are good at capturing local features and spatial hierarchies of bad traffic sign images while the shortcut connections help mitigate the problem of vanishing gradients. In the C3TR module, the traditional bottleneck layer is replaced with a transformer block, which incorporates a self-attention mechanism. The self-attention mechanism allows the model to weigh the importance of different parts of the input bad traffic sign image and allows the model to capture long-term dependencies. It is represented in Equation (1).

Attention (Q, K, V) = softmax (\frac{{QK}^{T}}{\sqrt{d_{k}}}) V,

(1)

where Q denotes the query, K is the key, and V represents the value. And

d_{k}

is the dimension of the key vector, while the softmax ensures that the attention weight is summed to 1.

Spatial Pyramid Pooling-Fast (SPPF) module [23] is an optimized version of the traditional Spatial Pyramid Pooling (SPP) layer [24]. The module extracts features at multiple spatial scales by applying pooling operations with different kernel sizes, which provides a rich representation that is further processed by the network to detect bad traffic signs of varying sizes and conditions.

In the proposed model, as shown in Table 2, placing SPPF at the beginning of the head and C3TR modules before the detection head has significantly improved the performance of Yolov5n, Yolov5s, and Yolov5m models.

2.4. Mean Ensemble

A deep learning ensemble is a technique that combines predictions of multiple deep learning models to obtain an improved final prediction by using different ensemble strategies such as training different base models on the same dataset or using different data samples [25]. The goal is to combine the strengths of various deep learning architectures as base learners to achieve a better prediction than the respective base models and also reduce overfitting [26]. In the proposed models, as shown in Figure 4, Yolov5n, Yolov5n, and Yolov5m are used as base models. Each model is trained on images of bad traffic signs. The predictions from all three models are aggregated using mean ensemble method to obtain the final prediction.

2.5. Test Time Augmentation

Test time augmentation (TTA) is a method used to enhance a trained model’s performance by applying different transformations to test images and aggregating predictions from these augmented versions at test time. [27,28]. In the proposed model, predictions from Yolov5n, Yolov5s, and Yolov5m are aggregated to implement TTA. Each image is increased from 640 × 640 pixels to 832 × 832 pixels to capture more details and improve accuracy. The model processes images at three different scales (1, 0.83, and 0.67) and applies two flip operations (no flip and left–right flips). However, only three augmented versions namely the original image size with no flip, 0.83 scale with left–right flip, and 0.67 scale with no flip are selected for inference to balance the benefits of augmentation with computational costs. The final prediction for a test image is derived by aggregating the predictions from all augmented samples. The process increases the accuracy of predictions considering different perspectives of the bad traffic sign images. Figure 5 shows the proposed model based on test time augmentation.

2.6. Experimental Environment and Settings

We use a GPU-enabled Google Colab environment, Python-3.10.12 PyTorch 2.2.1+cu121 CUDA:0 (Tesla T4, 15102MiB (NVIDIA, Santa Clara, CA, USA)) for the experiment. We use an SDG optimizer with an initial learning rate of 1E-2 and momentum of 0.937 to help accelerate it in the relevant direction and dampen oscillation. The network is trained for 100 epochs and a constant learning rate is maintained. A weight decay of 0.0005 is used to prevent overfitting. We set the batch size to 16 and the size of the input image to 640 × 640.

2.7. Performance Evaluation Metrics

We use precision (P), recall (R), and mean average precision at the threshold of 50 (mAP@50) to evaluate the performance of the proposed models as follows.

Precision (P) is the ratio of correctly predicted positive observations to the total predicted positives as given in Equation (2):

Precision (P) = \frac{TP}{TP + FP},

(2)

where TP (true positive) is the number of correctly recognized bad traffic signs. FP (false positive) is the number of instances where the model incorrectly identifies an object as a bad traffic sign when it is not.

Recall (R) is the ratio of correctly predicted positive observations to all the observations in the actual class. It is given in Equation (3):

Recall (R) = \frac{TP}{TP + FN},

(3)

where FN (false negative) is the number of bad traffic signs that are present but not identified by the model.

Mean average precision mAP@50 is calculated by averaging the precision values at different recall levels across all object classes. It is calculated in Equation (4):

mAP = \frac{1}{C} \sum_{i = 1}^{c} {AP}_{i},

(4)

where

{AP}_{i}

is the average precision (AP) value for all class i. C is the total number of classes and average precision (AP) summarizes the precision–recall performance of the model across all the recall levels ranging from 0 to 1.

In addition to the above, we use the k-fold cross-validation technique to evaluate the stability of the metrics and provide a reliable performance evaluation. In k-fold cross-validation [29,30], a dataset is split into k subsets of equal size. The model is then trained on k−1 folds and validated on the remaining fold. The process is repeated k times with a different fold, which serves as the validation set. The performance metrics are averaged across all folds to obtain the cross-validated performance. The mean of these performance metrics provides an estimate of the model’s general performance. The standard deviation indicates the variability or consistency of the model’s performance across different data splits. This technique helps to assess how well the model is likely to generalize to unseen data and reduces bias in the performance evaluation.

In this study, we use k = 5 and split the dataset into five (5) folders. The test set has 210 images while the remaining 1890 images are randomly split into five (5) folders of 378 images each. One folder is used for validation and the remaining folders are used for training. Performance metrics (precision, recall, and mAP@50) are averaged over all folds. In addition, we calculate the mean, variance, and standard deviation to gain further insight into the behavior of the models.

Mean, also called average, is the sum of all the values in a dataset divided by the number of values. It is calculated in Equation (5).

Mean (μ) = \frac{1}{N} \sum_{i}^{N} x_{i}

(5)

where (N) is the number of observations and x_i represents each individual observation.

Variance measures how much the values in a dataset differ from the mean. In k-fold cross-validation, it measures how consistent a model’s performance is across different data splits (folds). It is calculated in Equation (6).

Variance (σ^{2}) = \frac{1}{N} \sum_{i = 1}^{N} {{(x}_{i} - μ)}^{2}

(6)

Standard deviation measures the amount of variation or dispersion in a dataset. It offers a clear measure of how much a model’s performance varies or remains consistent across the different folds. It is the square root of the variance as given in Equation (7).

Standard deviation = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {{(x}_{i} - μ)}^{2}}

(7)

3. Results

3.1. Results of Yolov5 Models

3.1.1. Base Models

The results of training the base models are presented in Table 3. The performance of each model is evaluated in terms of precision (P), recall (R), and mean average precision at 50% (mAP@50). From the table, we can see that Yolov5m has the highest precision, recall, and mAP@50, indicating improvement in detecting bad traffic signs. However, it also has the most layers, parameters, and the highest GFLOP, making it is the most computationally expensive compared to Yolov5n and Yolov5s. On the other hand, Yolov5n has the lowest precision, recall, and mAP@50, but it is the least computationally expensive model.

3.1.2. Improved Models

The results of the improved models, as shown in Table 4, indicate that Yolov5m has the highest mAP@50 of 0.845, Yolov5n recorded the lowest mAP@50 of 0.818 while Yolov5s achieved a mAP@50 0.839. In terms of precision and recall, Yolov5m recorded the highest value while Yolov5n had the lowest.

3.1.3. Comparisons of ‘All Class’ Performance

We compare the overall performance of the base models and the improved models across all classes. Figure 6 shows a comparison of mAP@50 before and after improvement. The introduction of SPPF and C3TR modules in the model architecture has led to increases in performance. Specifically, the mAP@50 of Yolov5n increased from 0.803 to 0.818, Yolov5s from 0.820 to 0.839, and Yolov5m from 0.838 to 0.845. Additionally, Table 1 and Table 2 indicate that the models have recorded increases in precision, recall, parameters, and GFLOPs.

3.1.4. K-Fold Cross-Validation

Table 5 shows the performance metrics across each fold of the dataset for Yolov5n, Yolov5s, and Yolov5m models.

Table 6 shows the average precision, recall and mAP@50 of the fold. The mean, variance, and standard deviation of each of the metrics are also given. The average mean of precision is 0.6467, recall is 0.578 and mAP@50 is 0.602. The standard deviation for Map@50 is 0.0420 indicates that the overall detection performance across the models does not differ significantly.

Table 7 shows the performance of splits across the models. The variance of 0.00176 of mAP@50 indicates how stable the models perform across folds.

3.1.5. Performance of Each Class on the Base and Improved Models

A comprehensive view of the performance of each model across different classes before the introduction of the modules is presented in Table 8 and after in Table 9. The mAP@50 of each class is summarized below:

Occluded: The mAP@50 improved from 0.71 to 0.73 for Yolov5n, decreased from 0.821 to 0.8 for Yolov5n, and also decreased from 0.82 to 0.808 for Yolov5m;
Displaced: The mAP@50 for Yolov5n decreased from 0.796 to 0.786, Yolov5s recorded increase from 0.736 to 0.971 while mAP@50 increase from 0.771 to 0.8;
Perforated: The mAP@50 increased from 0.875 to 0.876 for Yolov5n, increased from 0.872 to 0.903 for Yolov5s, and increased from 0.887 to 0.919 for Yolov5m;
Faded: The mAP@50 decreased from 0.764 to 0.719 for Yolov5n, increased from 0.765 to 0.795 for Yolov5s, and increased from 0.788 to 0.832 for Yolov5m;
Rusted: The mAP@50 increased from 0.872 to 0.927 for Yolov5n, increased from 0.884 to 0.892 for Yolov5s, and increased slightly from 0.915 to 0.919 for Yolov5m;
Defaced: The mAP@50 increased from 0.76 to 0.785 for Yolov5n, increased from 0.778 to 0.827 for Yolov5s, and decreased from 0.814 to 0.798 for Yolov5m;
Good: The mAP@50 increased from 0.852 to 0.861 for Yolov5n, increased slightly from 0.86 to 0.864 for Yolov5s, and decreased from 0.869 to 0.841 for Yolov5m;
In general, the mAP@50, precision, and recall have improved for some classes, which indicates that the modifications made to the models have been effective in improving their performance.

3.2. Results of Mean Ensemble

Table 10 shows the results for the ‘all class’ category using the mean ensemble of the base models. The base model achieved mAP@50 of 0.84, which indicates that it is better than mAP@50 of any of the base models, as shown in Table 1. However, Yolov5m has a precision of 0.839 while the mean ensemble recorded 0.837. Yolov5 also recorded a recall of 0.7821 compared to the mean ensemble of 0.772. The model can process approximately 31.95 frames per second.

Figure 7 shows the precision–recall curve graph for all the classes. The graph evaluates the performance of the mean ensemble model at different thresholds. The model has an average precision (AP) of 0.785 when detecting occluded traffic signs, AP of 0.779 when detecting displaced signs, and AP of 0.898 when detecting perforated traffic signs. Furthermore, it has an AP of 0.798 when detecting faded traffic signs, an AP of 0.904 when detecting rusted traffic signs, an AP of 0.818 when detecting defaced traffic signs, and an AP of 0.897 when detecting good traffic signs. The model has a mean average precision (mAP) of 0.84 at an intersection over union (IoU) of 0.5 when detecting all classes of traffic signs.

3.3. Results of Test Time Augmentation

Table 11 shows the precision, recall, and mAP@50 of all the classes, as calculated using the mean ensemble of the base models. The mAP@50 for ‘all class’ of the TTA is 0.847, which outperforms the mAP@50 of any of the base models, as shown in Table 1. Precision also witnessed an increment from 0.808 to 0. 867, and recall increased from 0.755 to 0.72.

Figure 8 shows the F1 scores curve of TTA, which peaks at 0.80 at a confidence threshold of 0.736. The result implies that the proposed model minimizes both false positives and false negatives across all classes. The model is effective in detecting good traffic signs with a peak F1 score of 0.85.

3.4. Comparision of All the Proposed Models

We compare Yolov5n, Yolov5s, Yolov5m, mean ensemble, and test time augmentation based on the mean average precision at 50% Intersection over the Union (mAP@50) threshold, as shown in Figure 9. Yolov5n recorded the lowest accuracy of 0.803 among the models while Yolov5n slightly improved to 0.820 and Yolov5m showed further improvement to 0.838. It indicates that increasing the complexity of the model improves detection accuracy. The mean ensemble recorded mAP@50 of 0.845, which outperformed the improved models. TTA achieved the highest performance with mAP@50 of 0.85, which indicates that by applying various augmentation techniques at test time, performance improved significantly.

3.5. Test on Benchmark Datasets

We test the proposed models on the GTSRD, TT100K, CCTSDB2021, and GTSDB datasets. Figure 10 shows some detection results on the public datasets.

The German Traffic Sign Recognition Benchmark (GTSRD) [31] contains a large number of images of German Traffic Signs captured under various conditions. It is commonly used for training and evaluating traffic sign recognition systems.

Tsinghua-Tencent (TT100K) [32] includes 100,000 images of traffic signs from China, which provides a diverse set of traffic sign types and conditions. It is widely used for training models to recognize a wide variety of traffic signs.

Chinese Traffic Sign Detection Benchmark 2021 (CCTSDB2021) [33] is an updated version of the CCTSDB dataset. It consists of over 4000 real traffic scene images with detailed annotations.

The German Traffic Sign Detection Benchmark (GTSDB) [34] is similar to GTSRD but focuses on detection rather than recognition of traffic signs. The images are annotated with bounding boxes.

In Figure 10c, Yolov5m correctly detects good traffic signs but misclassifies an object as a rusted traffic sign, while Figure 10e is a misdetection recorded by Yolov5s.

4. Discussion

In training the base models, Yolov5m shows the best performance across all metrics, but it has higher computational costs. It can be reduced by reducing the number of layers, model pruning, and quantization. Yolov5n and Yolov5s are more suitable when computational resources and speed are priorities. The modifications improved precision and mAP@50 for all models, with a slight increase in GFLOPs and the number of parameters. Recall remained stable, with minor decreases for Yolov5n and Yolov5m, and an increase for Yolov5s. However, recall can be improved through hyper-parameter tuning, increasing epoch during training, and lowering the confidence threshold during inference. The number of layers also increased, indicating improved performance but at the cost of increased complexity and computational requirements. Despite the performance of the improved models, [32] achieved a higher frame per second (FPS) of 142 compared to Yolov5n with an FPS of 75, as shown in Table 12. However, the FPS of Yolov5n meets the threshold for real-time applications. The proposed mean ensemble model performs best on rusted and good traffic signs, with the highest AP values, but performs least on displaced traffic signs. Test time augmentation significantly improves accuracy but at a higher computational cost and increased inference time. Our proposed models outperform the state-of-the-art models in terms of accuracy and FPS.

Employing k-fold cross-validation has further highlighted and provided insight into the performance of the models across data splits. In Table 6 and Table 7, YOLOv5n has high precision but low recall, which indicates that it misses detection of some traffic signs, while YOLOv5m has high recall but low precision, which leads to false positives. YOLOv5s shows balanced performance but needs improvement in overall accuracy. However, all models have a moderate mAP@50, indicating room for better localization accuracy, and show some variability across folds, suggesting the need for model stability improvements.

The three techniques we employ to improve the detection accuracy of bad traffic signs have shown that there are few cases where the mAP@50 score has decreased, which suggests that there is room for further optimization.

In this study, we use YOLOv5 models to improve the detection accuracy of bad traffic signs due to their speed, support for real-time detection, range of sizes suitable for our model, ease of customization, and support for ensemble and test time augmentation (TTA) techniques. While other models such as YOLOv4 and YOLOv8 can achieve good accuracy on the dataset, YOLOv4 offers excellent performance, and YOLOv8 is more accurate than YOLOv5. However, YOLOv5 still provides state-of-the-art accuracy for our model.

Future research will focus on using different ensemble techniques on modified architectures of other YOLO models and introducing additional classes to the dataset for diversity.

5. Conclusions

In conclusion, this study enhances the detection accuracy of bad traffic signs using improved Yolov5 models, mean ensemble, and test time augmentation techniques. The introduction of the SPPF and C3TR module in Yolov5n, Yolov5s, and Yolov5m architectures has improved their performances. The mean ensemble technique further improves mAP@50 over the base models. Implementing TTA through scaling and flipping at test time enhances performance across precision, recall, and mAP@50 of all classes. These three techniques show significant performance improvement when tested on the TT100K, GTRSD, and CCDB2012 datasets. The use of the k-fold cross-validation technique gives insight into the performance of the models across five (5) dataset splits and indicates the direction for further improvement.

The results indicate that the improved Yolov5 models, mean ensemble, and test time augmentation (TTA) techniques can be effectively deployed in real-time scenarios. This study has made significant contributions to the field of traffic sign detection and recognition systems, which are essential for both conventional and autonomous vehicles.

Despite the improved performance, this study has limitations, which include (1) the size of the dataset and the lack of representation of a class affected by varying lighting conditions. In future studies, we will increase the size of the dataset and extend the number of classes to include those affected by lighting conditions. (2) Ensemble and test time augmentation techniques increase the inference time. In future studies, techniques such as a reduction in layers, model pruning, and model quantization will be explored to reduce the inference time.

Author Contributions

Writing—original draft, I.Y.G.; Writing—review & editing, S.-K.T.; Supervision, R.-C.C. All authors have read and agreed to the published version of the manuscript.

Funding

This paper is supported by the Ministry of Science and Technology, Taiwan. The Nos are NSTC-112-2221-E-324-003-MY3 and NSTC-112-2221-E-324-011-MY2.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

The authors declare no conflict of interes.

References

Wali, S.A.; Abdullahi, M.A.; Hanna, M.A.; Hussain, A.; Samad, S.A.; Ker, P.J.; Mansor, M.B. Vision-Based Traffic Sign Detection and Recognition Systems: Current Trends and Challenges. Sensors 2019, 19, 2093. [Google Scholar] [CrossRef] [PubMed]
Prakash, A.J.; Scruthy, S. Enhancing Traffic Sign Recognition (TRS) by Classifying Deep Learning Models to Promote Road Safety. Signal Image Video Process. 2024, 18, 4713–4729. [Google Scholar] [CrossRef]
Saleh, R.; Fleyeh, H. Factors Affecting Night-Time Visibility of Retroreflective Road Traffic Signs: A Review. Int. J. Traffic Trans. Eng. 2021, 11, 115–128. [Google Scholar]
Gua, J.; Lu, J.; Qu, Y.; Li, C. Traffic-Sign Spotting in the Wild via Deep Features. In Proceedings of the IEEE Intelligence Vehicles Symposium (IV), Changshu, China, 26–30 June 2018; pp. 120–125. [Google Scholar]
Trpković, A.; Šelmić, M.; Jevremović, S. Model for the Identification and Classification of Partially Damaged and Vandalized Traffic Signs. KSCE J. Civ. Eng. 2021, 25, 3953–3965. [Google Scholar] [CrossRef]
Chandnani, M.; Shukla, S.; Wadhvani, R. Multistage Traffic Sign Recognition under Harsh Environment. Multimed. Tools Appl. 2024. [Google Scholar] [CrossRef]
Lim, X.R.; Lee, C.P.; Kian Ming Lim, K.M.; Ong, T.S.; Alqahtani, A.; Ali, M. Recent Advances in Traffic Sign Recognition: Approaches and Datasets. Sensors 2023, 23, 4674. [Google Scholar] [CrossRef]
Cui, Y.; Guo, D.; Yuan, H.; Gu, H.; Tang, H. Enhanced YOLO Network for Improving the Efficiency of Traffic Sign Detection. Appl. Sci. 2024, 14, 555. [Google Scholar] [CrossRef]
Dewi, C.; Chen, R.-C.; Jiang, X.; Yu, H. Deep Convolutional Neural Network for Enhancing Traffic Sign Recognition Developed on Yolov4. Multimed. Tools Appl. 2022, 88, 37821–37845. [Google Scholar] [CrossRef]
Utane, A.S.; Mohod, S.W. Traffic Sign Recognition Using Hybrid Deep Ensemble Learning for Advanced Driving Assistance Systems. In Proceeding of the 2nd International Conference on Emerging Smart Technologies and Applications, Ibb, Yemen, 25–26 October 2022; pp. 1–5. [Google Scholar]
Magalhães, R.; Bernardino, A. Quantifying Object Detection Uncertainty in Autonomous Driving with Test-Time Augmentation. In Proceedings of the 2023 IEEE Intelligent Vehicles Symposium (IV), Anchorage, AK, USA, 4–7 June 2023. [Google Scholar]
Luo, S.; Wu, C.; Li, L. Detection and Recognition of Obscured Traffic Signs During Vehicle Movement. IEEE Access 2023, 11, 122516–122525. [Google Scholar] [CrossRef]
Yan, Y.; Deng, C.; Ma, J.; Wang, Y.; Li, Y. A Traffic Sign Recognition Method under Complex Illumination Conditions. IEEE Access 2023, 11, 39185–39196. [Google Scholar] [CrossRef]
Lim, X.P.; Lee, C.P.; Lim, K.M.; Ong, T.S. Enhanced Traffic Sign Recognition with Ensemble Learning. J. Sens. Actuator Netw. 2023, 12, 33. [Google Scholar] [CrossRef]
Wang, G.; Zhou, K.; Wang, L.; Wang, L. Context-Aware and Attention-Driven Weighted Fusion Traffic Sign Detection Network. IEEE Access 2023, 11, 42104–42112. [Google Scholar] [CrossRef]
Pande, B.; Padamwar, K.; Bhattacharya, S.; Roshan, S.; Bhamare, M. A Review of Image Annotation Tools for Object Detection. In Proceedings of the 2022 International Conference on Applied Artificial Intelligence and Computing (ICAAIC), Salem, India, 9–11 May 2022; pp. 976–982. [Google Scholar]
Terven, J.; Córdova-Esparza, D.-M.; Romero-González, J.-A. A Comprehensive Review of YOLO Architectures in Computer Vision: From YOLOv1 to YOLOv8 and YOLO-NAS. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
Vijayakumar, A.; Vairavasundaram, S. YOLO-based Object Detection Models: A Review and its Applications. Multimed. Tools Appl. 2024. [Google Scholar] [CrossRef]
Kamal, B.; Kishore, A.; Rajkumar, S.; Saravanakumar, K.; Dhanaselvam, J.; Rajesh, R. Traffic Speed Limit Sign Recognition using Deep Learning. In Proceedings of the 2024 International Conference on Inventive Computation Technologies (ICICT), Lalitpur, Nepal, 24–26 April 2024; pp. 754–761. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Hossain, M.A.; Hossain, A.; Jabiullah, M.I. Traffic Sign Detection and Recognition System Using Improved YOLOV5s. In Emerging Technologies in Computing (iCETiC 2022); Miraz, M.H., Southall, G., Ali, M., Ware, A., Eds.; Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering; Springer: Cham, Switzerland, 2022; Volume 463. [Google Scholar]
Gao, F.; Huang, W.; Chen, X.; Weng, L. Traffic Sign Recognition Model Based on Small Object Detection. In PRICAI 2023: Trends in Artificial Intelligence (PRICAI 2023); Liu, F., Sadanandan, A.A., Pham, D.N., Mursanto, P., Lukose, D., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2024; Volume 14327. [Google Scholar]
Krolkral, N.W.; Faraoun, K.M.; Bousahba, N.; Rezzouk, B.; Hamouda, I.A. Improved YOLOv5s for Object Detection. In Proceedings of the 2023 International Conference on Electrical Engineering and Advanced Technology (ICEEAT), Batna, Algeria, 5–7 November 2023; pp. 1–6. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef]
Ganaiea, M.A.; Hub, M.; Malika, A.K.; Tanveera, M.; Suganthanb, P.N. Ensemble deep learning: A review. Eng. Appl. Artif. Intell. 2022, 115, 105151. [Google Scholar] [CrossRef]
Mienye, I.D.; Sun, Y. A Survey of Ensemble Learning: Concepts, Algorithms, Applications, and Prospects. IEEE Access 2022, 10, 99129–99149. [Google Scholar] [CrossRef]
Kimura, M. Understanding test-time augmentation. In Proceedings of the International Conference on Neural Information Processing 2021, Sanur, Bali, Indonesia, 8–12 December 2021; pp. 558–569. [Google Scholar]
Shanmugam, D.; Blalock, D.; Balakrishnan, G.; Guttag, J. Better Aggregation in Test-Time Augmentation. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 1194–1203. [Google Scholar]
Refaeilzadeh, P.; Tang, L.; Liu, H. Cross-Validation. In Encyclopedia of Database Systems; Liu, L., Özsu, M.T., Eds.; Springer: Boston, MA, USA, 2009. [Google Scholar] [CrossRef]
Kohavi, R. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of the International Joint Conference on Articial Intelligence, Montreal, QC, Canada, 20–25 August 1995. [Google Scholar]
Stallkamp, J.; Schlipsing, M.; Salmen, J.; Igel, C. The German Traffic Sign Recognition Benchmark: A multi-class classification competition. In Proceedings of the 2011 International Joint Conference on Neural Networks, San Jose, CA, USA; 2011; pp. 1453–1460. [Google Scholar]
Zhu, Z.; Liang, D.; Zhang, S.; Huang, X.; Li, B.; Hu, S. Traffic-Sign Detection and Classification in the Wild. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA; 2016; pp. 2110–2118. [Google Scholar]
Zhang, J.; Zou, X.; Kuang, L.-D.; Wang, J.; Sherratt, R.S.; Yu, X. CCTSDB 2021: A more comprehensive traffic sign detection benchmark. Hum.-Centric Comput. Inf. Sci. 2021, 12, 23. [Google Scholar]
Houben, S.; Stallkamp, J.; Salmen, J.; Schlipsing, M.; Igel, C. Detection of traffic signs in real-world images: The German traffic sign detection benchmark. In Proceedings of the 2013 International Joint Conference on Neural Networks (IJCNN), Dallas, TX, USA, 4–9 August 2013; pp. 1–8. [Google Scholar]
Chung, H.; Lim, M. Feature-Selection-Based Attentional-Deconvolution Detector for German Traffic Sign Detection Benchmark. Electronics 2023, 12, 725. [Google Scholar] [CrossRef]
Abraham, A.; Purwanto, D.; Kusuma, H. Traffic Lights, and Traffic Signs Detection System using Modified You Only Look Once. In Proceedings of the 2021 International Seminar on Intelligence Technology and its Applications (ISITA), Surabaya, Indonesia, 21–22 July 2021; pp. 141–146. [Google Scholar]
Wang, J.; Chen, Y.; Dong, Z.; Gao, M. Improved YOLOv5 Network for Real-Time Multi-Scale Traffic Sign Detection. Neural Comput. Appl. 2023, 35, 7853–7865. [Google Scholar] [CrossRef]
Chen, B.; Fan, X. MSGC-YOLO: An Improved Lightweight Traffic Sign Detection Model under Snow Conditions. Mathematics 2024, 12, 1539. [Google Scholar] [CrossRef]

Figure 1. Sample images representing all the classes in the dataset: (a) occluded; (b) displaced; (c) faded; (d) perforated; (e) good; (f) rusted; (g) defaced.

Figure 2. Structure of the Yolov5 model.

Figure 3. Structures of C3 and C3TR modules.

Figure 4. Flowchart of the proposed ensemble model.

Figure 5. Flowchart of the proposed test time augmentation.

Figure 6. Comparison of accuracy of all classes for base and improved models.

Figure 7. Precision–recall curve of the mean ensemble.

Figure 8. F1 score of the TTA model.

Figure 9. Graph showing mAP@50 of the proposed models.

Figure 10. Detection results on some public datasets: (a) Detection results on the GTSRD by TTA; (b) detection results on the TT100K test image by mean ensemble; (c) detection result showing misclassification and detection of good traffic signs on GTSRD image by improved Yolov5m; (d) detection result by improved Yolov5s on CCTSDB2021; (e) misdetection by Yolov5s on CCTSDB2021 as rusted traffic sign and correctly detect good traffic sign; (f) detection result by Yolov5m on GTSDB dataset.

Table 1. Variants of the Yolov5 model showing parameters and GLOP.

Model	Parameter (Millions)	GFLOP (Billion)
Yolov5n	1.9	4.5
Yolov5s	7.2	16.5
Yolov5m	21.2	49.0
Yolov5l	46.5	109.1
Yolov5m	86.7	205.7

Table 2. Architecture of the proposed model.

Module	Filter	Kernel Size	Stride	Padding
Conv	64	6	2	2
Conv	128	3	2	-
C3	128	-	-	-
Conv	256	3	2	-
C3	256	-	-	-
Conv	512	3	2	-
C3	512	-	-	-
Conv	1024	3	2	-
C3	1024	-	-	-
SPPF	1024	5	-	-
SPPF	1024	5	-	-
nn.Unsample	-	-	-	-
Concat	-	-	-	-
C3	512	-	-	-
Conv	512	3	2	-
Concat	-	-	-	-
C3	1024	-	-	-
C3TR	1024	-	-	-
Detect	nc, anchors	-	-	-

Table 3. Results of training the base models.

Model	Precision	Recall	mAP@50	Layer	Parameter	GFLOP
Yolov5n	0.808	0.739	0.803	214	1,773,388	4.3
Yolov5s	0.836	0.755	0.820	214	7,038,508	16.0
Yolov5m	0.839	0.782	0.838	293	20,895,564	48.3

Table 4. Results of the improved models.

Model	Precision	Recall	mAP@50	Layer	Parameter	GFLOP
Yolov5n	0.822	0.736	0.818	243	2,280,140	4.7
Yolov5s	0.85	0.772	0.839	243	9,051,436	17.5
Yolov5m	0.863	0.776	0.845	303	22,225,356	48.4

Table 5. Performance metrics of folds across models.

Split	Metrics	Yolov5n	Yolov5s	Yolov5m
Folder 1	P	0.607	0.514	0.53
	R	0.627	0.667	0.792
	mAP@50	0.644	0.672	0.703
Folder 2	P	0.639	0.765	0.893
	R	0.541	0.601	0.653
	Map@50	0.580	0.704	0.644
Folder 3	P	0.736	0.797	0.65
	Recall	0.476	0.496	0.605
	mAP@50	0.581	0.609	0.666
Folder 4	P	0.56	0.429	0.489
	R	0.464	0.636	0.67
	mAP@50	0.538	0.658	0.672
Folder 5	P	0.296	0.443	0.355
	R	0.452	0.459	0.538
	mAP@50	0.394	0.437	0.567

Table 6. Average performance of k-fold cross-validation.

	Precision	Recall	mAP@50
Yolov5n	0.842	0.512	0.547
Yolov5s	0.589	0.571	0.610
Yolov5m	0.509	0.651	0.649
Mean	0.6467	0.578	0.602
Variance	0.0201	0.003245	0.00177
Standard deviation	0.142	0.057	0.0420

Table 7. Average performance of k-fold cross-validation on improved models.

Model	Precision	Recall	mAP@50
Yolov5n	0.735	0.610	0.630
Yolov5s	0.746	0.65	0.671
Yolov5m	0.615	0.701	0.732
Mean	0.698	0.654	0.677
Variance	0.00352	0.00139	0.00176
Standard deviation	0.059	0.0373	0.0419

Table 8. Performance of models on each class before improvement.

Class	Metrics	Yolov5n	Yolov5s	Yolov5m
Occluded	P	0.813	0.953	0.895
	R	0.691	0.648	0.68
	mAP@50	0.71	0.821	0.82
Displaced	P	0.843	0.811	0.79
	R	0.657	0.675	0.7
	Map@50	0.796	0.763	0.771
Perforated	P	0.87	0.813	0.906
	Recall	0.77	0.786	0.787
	mAP@50	0.875	0.872	0.887
Faded	P	0.846	0.94	0.942
	R	0.71	0.684	0.71
	mAP@50	0.764	0.765	0.788
Rusted	P	0.777	0.811	0.803
	R	0.873	0.9	0.933
	mAP@50	0.872	0.884	0.915
Defaced	P	0.835	0.807	0.849
	R	0.636	0.727	0.767
	mAP@50	0.76	0.778	0.814
Good	P	0.669	0.713	0.684
	R	0.836	0.866	0.896
	mAP@50	0.852	0.86	0.869

Table 9. Performance of models on each class after improvement.

Class	Metrics	Yolov5n	Yolov5s	Yolov5m
Occluded	Precision	0.923	0.832	0.869
	Recall	0.567	0.667	0.736
	mAP@50	0.773	0.8	0.808
Displaced	P	0.896	0.804	0.845
	R	0.615	0.7	0.703
	Map@50	0.786	0.971	0.80
Perforated	P	0.833	0.896	0.959
	R	0.82	0.82	0.768
	mAP@50	0.876	0.903	0.919
Faded	P	0.76	0.938	0.959
	R	0.652	0.696	0.667
	mAP@50	0.719	0.795	0.832
Rusted	P	0.843	0.801	0.886
	R	0.883	0.917	0.917
	mAP@50	0.927	0.892	0.919
Defaced	P	0.856	0.857	0.838
	R	0.722	0.727	0.704
	mAP@50	0.785	0.827	0.798
Good	P	0.644	0.724	0.684
	R	0.896	0.881	0.866
	mAP@50	0.861	0.864	0.841

Table 10. Results of ‘all class’ for mean ensemble model.

Model	Precision	Recall	mAP@50	FPS
Base model	0.837	0.772	0.84	31.95

Table 11. Results of TTA across the classes.

Metrics	All Class	Occluded	Displaced	Perforated	Faded	Rusted	Defaced	Good
Precision	0.867	0.931	0.913	0.927	0.941	0.807	0.846	0.707
Recall	0.762	0.619	0.657	0.803	0.697	0.95	0.727	0.881
mAP@50	0.847	0.806	0.804	0.909	0.808	0.9	0.81	0.891

Table 12. Comparisons of the proposed models with state-of-the-art models.

Model	Dataset	Precision	Recall	mAP@50	FPS
FSADD [35]	GTSDB	-	-	0.739	-
YOLOv4-CSP [36]	Self-constructed	-	-	0.797	29
Yolov5 [37]	TT100K	-	-	0.6514	-
MSGC YOLOv8 [38]	TT100K	-		0.751	142
Yolov5n	Bad traffic sign	0.822	0.736	0.818	75
Yolov5s	Bad traffic sign	0.85	0.772	0.839	68
Yolov5m	Bad traffic sign	0.863	0.776	0.845	53
Mean ensemble	Bad traffic sign	0.837	0.772	0.84	32
TTA	Bad traffic signs	0.867	0.762	0.847	12

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Garta, I.Y.; Tai, S.-K.; Chen, R.-C. Improved Detection of Multi-Class Bad Traffic Signs Using Ensemble and Test Time Augmentation Based on Yolov5 Models. Appl. Sci. 2024, 14, 8200. https://doi.org/10.3390/app14188200

AMA Style

Garta IY, Tai S-K, Chen R-C. Improved Detection of Multi-Class Bad Traffic Signs Using Ensemble and Test Time Augmentation Based on Yolov5 Models. Applied Sciences. 2024; 14(18):8200. https://doi.org/10.3390/app14188200

Chicago/Turabian Style

Garta, Ibrahim Yahaya, Shao-Kuo Tai, and Rung-Ching Chen. 2024. "Improved Detection of Multi-Class Bad Traffic Signs Using Ensemble and Test Time Augmentation Based on Yolov5 Models" Applied Sciences 14, no. 18: 8200. https://doi.org/10.3390/app14188200

APA Style

Garta, I. Y., Tai, S.-K., & Chen, R.-C. (2024). Improved Detection of Multi-Class Bad Traffic Signs Using Ensemble and Test Time Augmentation Based on Yolov5 Models. Applied Sciences, 14(18), 8200. https://doi.org/10.3390/app14188200

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Improved Detection of Multi-Class Bad Traffic Signs Using Ensemble and Test Time Augmentation Based on Yolov5 Models

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.2. The Yolov5 Model

2.3. C3TR Module and SPPF

2.4. Mean Ensemble

2.5. Test Time Augmentation

2.6. Experimental Environment and Settings

2.7. Performance Evaluation Metrics

3. Results

3.1. Results of Yolov5 Models

3.1.1. Base Models

3.1.2. Improved Models

3.1.3. Comparisons of ‘All Class’ Performance

3.1.4. K-Fold Cross-Validation

3.1.5. Performance of Each Class on the Base and Improved Models

3.2. Results of Mean Ensemble

3.3. Results of Test Time Augmentation

3.4. Comparision of All the Proposed Models

3.5. Test on Benchmark Datasets

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI