YOLO-TSF: A Small Traffic Sign Detection Algorithm for Foggy Road Scenes

Li, Rongzhen; Chen, Yajun; Wang, Yu; Sun, Chaoyue

doi:10.3390/electronics13183744

Open AccessArticle

YOLO-TSF: A Small Traffic Sign Detection Algorithm for Foggy Road Scenes

¹

School of Computer Science, China West Normal University, Nanchong 637009, China

²

School of Physics and Electronic Engineering, Sichuan Normal University, Chengdu 610101, China

³

School of Electronic Information Engineering, China West Normal University, Nanchong 637009, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(18), 3744; https://doi.org/10.3390/electronics13183744

Submission received: 8 August 2024 / Revised: 5 September 2024 / Accepted: 9 September 2024 / Published: 20 September 2024

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

The accurate and rapid detection of traffic signs is crucial for intelligent transportation systems. Aiming at the problems that traffic signs have including more small targets in road scenes as well as misdetection, omission, and low recognition accuracy under the influence of fog, we propose a model for detecting traffic signs in foggy road scenes—YOLO-TSF. Firstly, we design the CCAM attention module and combine it with the idea of local–global residual learning thus proposing the LGFFM to enhance the model recognition capabilities in foggy weather. Secondly, we design MASFFHead by introducing the idea of ASFF to solve the feature loss problem of cross-scale fusion and perform a secondary extraction of small targets. Additionally, we design the NWD-CIoU by combining NWD and CIoU to solve the issue of inadequate learning capacity of IoU for diminutive target features. Finally, to address the dearth of foggy traffic signs datasets, we construct a new foggy traffic signs dataset, Foggy-TT100k. The experimental results show that the mAP@0.5, mAP@0.5:0.95, Precision, and F1-score of YOLO-TSF are improved by 8.8%, 7.8%, 7.1%, and 8.0%, respectively, compared with YOLOv8s, which proves its effectiveness in detecting small traffic signs in foggy scenes with visibility between 50 and 200 m.

Keywords:

YOLO model; computer vision; small object detection; foggy scene

1. Introduction

The field of autonomous driving is currently engaged in extensive research into the Traffic Signs Detection and Recognition (TSDR) system, and TSDR is an indispensable link in the driving system of future automobiles. Traditional traffic sign recognition is mainly by the detection of distinctive visual characteristics such as color and shape, but since traffic signs are relatively small targets in the image and are susceptible to factors such as the weather and obstruction by external objects, there are serious problems such as omission, misdetection, and low recognition accuracy. In recent years, through the sustained efforts of researchers, artificial intelligence technology has made significant advancements and has been integrated into various aspects of our work and life. The field of autonomous driving is one such area where these developments are particularly evident. Therefore, the application of deep learning to the traffic sign recognition is the subject of extensive research, with the resulting algorithms demonstrating a notable enhancement in accuracy compared to traditional methods. The principal process may be subdivided into the following three constituent stages: Firstly, the real-time picture of the road during driving is captured by the car’s onboard camera. Then, the TSDR system will extract regions of interest in the images, and the traffic signs are localized and classified according to the feature information. Finally, the TSDR system will feedback the classification results to the driver by displaying or broadcasting them, so that the driver can make the corresponding changes according to the traffic sign information. However, weather factors have a large impact on the effectiveness of the currently dominant traffic sign detection algorithms; furthermore, the target is small, the background is complex, and other problems still exist, especially on foggy days and in low-light conditions, and thus they cannot fully meet the requirements of automated driving, so it is necessary for researchers to do a further optimization of the above problems.

In the field of traffic sign recognition, there are continuous improvements made by researchers. Gong et al. [1] proposed a new model based on YOLOv3 [2]; it fused the spatial pyramid pooling structure for traffic sign detection and was capable of fusing local and global features. Li et al. [3] selected the Faster R-CNN [4] as the Baseline and designed the ResNet50-D [5] feature extractor, Attention-guided Context Feature Pyramid Network (AC-FPN) [6], and automatic enhancement techniques to improve the problem that traffic signs have a large proportion of small targets in the input image which are difficult to detect. Zhao et al. [7] proposed a lightweight model, SEDG-Yolov5, and introduced a response-based object scaling knowledge distillation strategy to optimize the model, addressing the problem of accuracy degradation caused by the lightweight design. Wang et al. [8] designed a novel feature pyramid structure, AF-FPN, to enhance the detection capability of the YOLOv5 [9] for multi-scale targets under the premise of ensuring real-time detection. However, the performance of the current traffic sign detection algorithms is susceptible to weather factors, and small targets and complex backgrounds still exist. Therefore, detecting and recognizing traffic signs under complex weather and lighting conditions is still a challenging task, which requires researchers to further optimize the above problems.

Based on the above problems, more and more researchers are devoted to solving the effects of complex environment and weather factors on traffic sign detection. Saxena et al. [10] improved the YOLOv4 [11] algorithm and used data preprocessing and image enhancement strategies to enhance the illumination of nighttime images to address the poor detection of traffic signs at night. Wang et al. [12] used an image preprocessing module before feeding the image into the improved YOLOv4 algorithm to achieve the classification and denoising of images in complex environments. Yao et al. [13] used a two-phase approach of light enhancement followed by target detection for low-illumination traffic sign images; firstly, they used the Illumination–Reflect model to adjust the light, and then used the Mask R-CNN [14] for traffic sign detection, which has a better detection capability for low-illumination images. However, although the above research performs well in traffic sign detection and image denoising, most current studies on object detection in complex environments primarily focus on using denoising networks or image enhancement operations before feeding images into the object detection network. They do not solve the fundamental problem of poor detection capabilities of current object detection algorithms in complex weather conditions and have not made targeted improvements in these algorithms. Therefore, we propose a foggy traffic sign detection algorithm based on YOLOv8s, YOLO-TSF, which aims to optimize the object detection model without any preprocessing operation on the original image and realizes the enhancement of the accuracy of traffic sign detection under foggy road scenes, especially under small targets.

The principal contributions of this paper are as follows:

We proposed the Channel-Coordinate Attention Module (CCAM) and combined it with the local–global residual learning structure to design the Local–Global Feature Fusion Module (LGFFM). This effectively addresses the issue of indistinct features in foggy images due to a decrease in contrast.
To solve the feature loss problem of small targets in cross-scale fusion, we designed the Multi-Head Adaptive Spatial Feature Fusion Detection Head (MASFFHead). It can effectively handle targets of different scales and, by integrating more shallow features, it reduces the feature loss problem of small targets after convolution pooling operations and performs secondary extraction of small targets.
We designed a new loss function, NWD-CIoU, which can calculate the similarity between bounding boxes without considering whether they overlap. This solves the problem of IoU being highly sensitive to objects of different scales.
In response to the lack of foggy traffic signs datasets, we took photos of foggy road scenes and performed fogging operations on the TT100K dataset. Finally, data enhancement processing, such as random flipping, Copy-Paste, and brightness adjustment, was performed to construct the Foggy-TT100k dataset.

2. Related Works

2.1. Object Detection Models

With the development of artificial intelligence technology, automatic driving requires higher detection accuracy and faster speeds, so the object detection algorithm plays a crucial role. From the early traditional methods to today’s deep learning-based methods, object detection algorithms have more than twenty years of development history. The current deep learning-based object detection algorithms can be categorized into two types—two-stage target detection algorithms and one-stage target detection algorithms. The two-stage algorithms such as R-CNN [15], Fast-RCNN [16], and Faster-RCNN [4] first extract pre-selected frames based on the target features, followed by the classification and localization of the target, which is then processed by a convolutional neural network. The other one-stage algorithms represented by the YOLO series, DETR [17], and SSD [18] do not need to extract candidate frames in advance but directly predict the locations and categories of different targets. Among them, the two-stage algorithm has the advantage of high accuracy compared to the one-stage algorithm, but due to its more complex structure resulting in slower detection speed compared to the one-stage algorithm, it is challenging to meet the recognition speed requirements for the automatic driving task. In addition, the one-stage algorithm, under continuous development in recent years, has a faster detection speed, while the detection accuracy has also been improved. Considering the real-time requirements of autonomous driving in traffic scenarios, we chose the YOLO series of the one-stage algorithms to complete the traffic sign detection and recognition task.

YOLOv1 [19] was released by Redmon et al. in 2016 and directly adopts a regression approach to classify and localize targets, and the core idea is to directly divide the input image into multiple cell grids, each of which simultaneously performs the prediction of bounding boxes, the computation of category probabilities, and the detection speed, gaining a great improvement compared to the traditional target detection algorithms; subsequently, YOLOv2 [20] and YOLOv3 [2] were released in 2017 and 2018. As the classic version of the YOLO series, YOLOv3 uses the DarkNet-53 network as the backbone, along with the idea of FPN [21], which significantly improves its detection performance and real-time performance; YOLOv4 [11] and YOLOv5 [8] were both released in 2020, in which YOLOv4 adopts CSPDarknet53 as the backbone network, using SPP [22] and feature fusion network to make the network more lightweight; YOLOv5, on the other hand, uses improved strategies such as data augmentation, the Bottleneck CSP and focusing module, and the use of the FPN-PAN structure instead of the original PAN structure, which can acquire relatively richer feature information of small targets, further improving the detection and real-time efficiency, and thus it has emerged as one of the most used algorithms in the target detection space in recent years. Recurrently, the YOLO series of algorithms has been continuously enhanced, and YOLOX [23], YOLOv6 [24], YOLOv7 [25], and other subsequent versions have been successively launched, which have somewhat enhanced target recognition performance.

In 2023, the Ultralytics team, which had previously developed YOLOv5, released YOLOv8 [26], with support for image classification, object detection, and instance segmentation tasks. YOLOv8 uses a C2f structure in the backbone and neck instead of the C3 of YOLOv5, which enhances feature fusion and further lightens the weight of YOLOv8. Compared with the earlier YOLO family models, the head part is more modified, using the decoupled head structure, separating the classification and detection heads, and changing from the original anchor-based to anchor-free methods. Overall, YOLOv8 mainly draws on the relevant designs of the historical versions of the YOLO series algorithms and other cutting-edge models, which allows YOLOv8 to have a higher detection accuracy and faster detection speed in terms of target detection, and it is also more biased towards engineering practice. Given this, we selected YOLOv8s as the traffic sign detection algorithm and improved it.

However, although YOLO and the other target detection algorithms can have more accurate detection results in many applications, the detection effect will be greatly reduced in some special environments; for example, in foggy weather, there are a large number of suspended particles in the air that will affect the propagation of light, resulting in blurred vision, weak light, and so on, resulting in leakage, false detection, and other problems. Regarding autonomous driving, the aforementioned circumstance is a serious issue that affects both the driver’s and other passengers’ safety. Therefore, it is all the more necessary for researchers to focus on the development and optimization of target detection algorithms in various complex environments as well as extreme weather conditions to handle road scene complexity and variety in an efficient manner.

2.2. Foggy Object Detection

Recently, researchers have started looking into target detection techniques for complex weather and illumination conditions, typically foggy days, and the resulting low-light conditions. To solve this challenging problem, some researchers have improved the target detection in foggy conditions by refining the defogging network and combining it with conventional target detection algorithms. Others have tried to optimize the foggy target detection methods by designing image preprocessing methods as well as data enhancement methods.

For instance, Ma et al. [27] improved the accuracy of target detection in foggy scenes by performing the defogging operation through the DehazeNet [28] algorithm, followed by the SSD target detection algorithm. Huang et al. [29] designed an object detection algorithm DSNet for bad weather conditions, where the image is first enhanced, and then the target detection is performed through RetianNet [30]. Li et al. [31] designed a non-associative foggy target detection method combining the defogging network PDR-Net and Faster R-CNN, and their experiments proved that the method had better results in various indexes for both real and synthetic images. In order to increase the accuracy of foggy target detection, Li et al. [32] designed an end-to-end method based on AODNet and Faster R-CNN, which improves the foggy target detection accuracy, but problems such as artifacts would occur in the de-fogged images, resulting in unsatisfactory de-fogging effects.

Although the above studies have good performance in target detection as well as image denoising, but there are still some shortcomings in the current solutions. For example, most of the studies focused on using defogging algorithms or image enhancement techniques to preprocess the images before they were entered or fed into the target detection network. Although the above solutions have made progress to a certain extent, they do not make targeted improvements to the existing target detection algorithms regarding the fundamental problem of poor detection performance of target detection networks in complex weather scenes. Furthermore, the effect of image preprocessing is not always stable and may even introduce some unnecessary noise or lose some valuable feature information. Secondly, this approach increases the overall computation and affects the real-time performance of the model. Therefore, we aim to improve the target detection algorithm without any preprocessing operation on the input image, which would realize the improvement in traffic sign detection accuracy in foggy road scenes, especially for small targets.

3. Methods

Object detection in foggy scenes and small object detection have long been two major challenges in the area of object detection. Foggy conditions can lead to detail loss, reduced image contrast, and blurred traffic signs. Additionally, finding and identifying traffic sign information as soon as possible is essential in real-world driving scenarios so that the system has ample time and space to react, but this also results in images that contain many small targets. Because of this, traffic sign detection in existing algorithms is challenging in foggy conditions. In response to the above problems, we propose improvements to the YOLOv8s network, introducing the YOLO-TSF network. The improved model incorporates a Local–Global Feature Fusion Module (LGFFM), which combines the CCAM attention module with a local–global residual learning structure during the feature extraction phase. A Multi-head Adaptive Spatial Feature Fusion Detection Head (MASFFHead) is designed, and a new loss function, NWD-CIoU, is developed by combining NWD and CIoU with specific weights. Figure 1 shows the enhanced overall network structure.

3.1. LGFFM

To enhance the detection performance of traffic signs in foggy traffic scenarios, this paper designs CCAM attention module and combines it with the concept of local–global residual learning, proposing a Local–Global Feature Fusion Module (LGFFM). This module retains shallow information through local–global residual learning and transmits it to deeper layers for fusion. Additionally, CCAM adaptively assigns weights to features in both the channel and spatial dimensions, achieving effective fusion between the input feature channels and coordinates. The structure is depicted in Figure 2.

In the LGFFM, the input image is first preprocessed by a layer of 3 × 3 convolution. Then, the input features are passed to three RAM modules, where the CCAM module adaptively assigns weights to the input features in the channel and spatial dimensions. Multiple local residual connections are used to retain the learned input features, achieving feature extraction, enhancement, and optimization. Next, the features of different depths output by the three RAM modules are concatenated and input into the CCAM module, where they are fused and enhanced based on the significance of the extracted features in the channel and spatial dimensions, generating optimized feature maps. Finally, two convolution layers are used to modify the dimensions and channels in order to produce a feature map that is the same size as the input image. The unprocessed input image is then connected through a global residual connection to achieve effective processing and optimization of the input image.

RAM (Residual Attention Module): The RAM is composed of the CCAM and a local residual learning structure, as shown in Figure 2. Its core purpose is to extract and enhance features. In the RAM, the original input feature map is firstly extracted through the convolutional layer then connected with the original input feature map for residual connection to enhance the network’s stability and expressiveness. Subsequently, the features are further extracted through a convolutional operation, and then channels and spatial weights are assigned through the CCAM. Finally, residuals are utilized to connect the acquired features to the original feature map, enabling a better fusion of the deeper features with the original features to produce the final improved feature map.

CCAM (Channel-Coordinate Attention Module): The attention mechanism [33] has been commonly used in the artificial intelligence space nowadays, and the CCAM module proposed in this paper is a combination of Channel Attention (CA) [34] and Coordinate Attention (CoordA) [35]. The reason for combining CA and CoordA is that CA can determine the weights based on the relative significance of the many input feature channels, which causes each channel to be given different weights by CA, but the weights at different coordinate positions in the same channel are the same. However, CoordA learns features based on the degree of importance of different coordinate regions, which gives each channel the same weight but different weights at different coordinate locations in the same channel, and CoordA also pays more attention to the edges and textures of the objects in the image as well as the thick blurred regions. Therefore, after combining CA with CoordA, CCAM can adaptively study significance weights for different channels and coordinate regions of input features, which improves the model’s ability to extract important feature information. Among them, Figure 3 illustrates the precise composition of the CA.

First, the feature map undergoes a global average pooling operation to compress the features of each channel and transform the global spatial information of each channel into channel descriptors as shown in Equation (1):

g_{c} = H_{p} (F_{c}) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} X_{c} (i, j)

(1)

In Equation (1),

H_{p}

is the global pooling function,

X_{c}

is the c-th channel, and

X_{c} (i, j)

is the value representing the channel at position

(i, j)

. At this point the size of the feature graph is transformed from

C \times H \times W

to

C \times 1 \times 1

, and each node in the resulting channel descriptor contains global information about a channel. Subsequently, the importance weights of different channels are obtained by a convolutional layer, ReLU activation function, and yet another convolutional layer and Sigmoid activation function. The detailed calculation process is shown in Equation (2):

C A_{c} = σ (C o n v (δ (C o n v (g_{c}))))

(2)

In Equation (2), the ReLU activation function is represented by

δ

, and the Sigmoid activation function is represented by

σ

. And the input feature

F_{c}

is multiplied (elementwise) with the importance weights of the learned channels in the order of the elements as shown in Equation (3) below:

F_{c}^{*} = C A_{c} \otimes F_{c}

(3)

Coordinate Attention (CoordA) integrates spatial coordinate information while considering a broader range of positional information for each pixel. Its primary function is to determine the importance of each coordinate position on the feature map, thereby capturing both local and global spatial relationships more effectively. Coordinate information embedding and coordinate information embedding are the two main processes in the Coordinate Attention mechanism that encode channel relations and long-range dependencies, as illustrated in Figure 4.

Subsequent to the feature map being input, the global average pooling technique is applied to the input feature map. The features in the horizontal and vertical directions are independently encoded through two pooling kernels of size

(H, 1)

and

(1, W)

, respectively. Two direction-aware attention features,

Z^{h}

and

Z^{w}

, are produced by this. The detailed calculation formulas are shown as follows:

Z_{c}^{h} (h) = \frac{1}{W} \sum_{0 \leq i < W} x_{c} (h, i)

(4)

Z_{c}^{w} (w) = \frac{1}{H} \sum_{0 \leq j < H} x_{c} (j, w)

(5)

Next, after performing a Concat operation on the output feature map, a

1 \times 1

convolution is used to extract features and compress the channel dimensions. The next step is Batch Normalization (BN) and a non-linear activation process, as shown in Equation (6):

f = δ (F_{1} ([Z^{h}, Z^{w}]))

(6)

Subsequently, a split operation is performed on the feature map to split

f

into two independent features,

f^{h}

and

f^{w}

, in the height and width directions, and then the features are transformed by two

1 \times 1

convolutions,

F_{h}

and

F_{w}

, and the sigmoid function so that the dimensions of the input and the output match, and the ultimate weight in the height and width directions are obtained. The specific calculation process is shown as follows:

g^{h} = σ (F_{h} (f^{h}))

(7)

g^{w} = σ (F_{w} (f^{w}))

(8)

Lastly, the input feature maps are multiplied with weights

g^{h}

and

g^{w}

to adapt the importance of the original feature maps in the height and width positions, which in turn strengthens the feature maps’ depiction, as shown in Equation (9):

y_{c} (i, j) = x_{c} (i, j) \times g_{c}^{h} (i) \times g_{c}^{w} (j)

(9)

3.2. MASFFHead

In the YOLOv8 algorithm, the semantic information included in the feature layers of various scales is different, and the large and small targets are mapped into the deep and shallow feature layers, respectively. However, when an object is assigned and regarded as a positive sample in one feature layer, this means that the corresponding region in other feature layer is regarded as the background, which leads to a large difference in the semantic data that various feature layers include, and this inconsistency between the features of various scales disrupts the training gradient operation and has an impact on the feature fusion effect. The photos captured while the car was being driven contain a lot of little targets, resulting in more feature information concentrated within the shallow feature layer, so the detail information is easily lost after the feature fusion operation, resulting in the missed detection of small targets.

The Adaptive Spatial Feature Fusion (ASFF) [36] structure can inhibit the inconsistency between different scale features when filtering the contradictory data in the spatial domain, which benefits the characteristics’ improved scale invariance thus improving the accuracy of detection. In this paper, the ASFF is combined with the detection head and, for the characteristics of the traffic sign detection task where there are mostly small targets, a small target detection layer with quadruple downsampled features is additionally added to the initial three detection layers of the YOLOv8s algorithm, to propose the Multi-head Adaptive Spatial Feature Fusion Detection Head (MASFFHead) to resolve the issue of feature loss in the cross-scalar fusion and to perform the secondary extraction of small targets. Its structure is shown in Figure 5.

The two primary components of the MASFFHead are adaptive fusion and feature adjustment. Feature adjustment ensures scale invariance during feature fusion by scaling feature maps of different scales to match the corresponding feature map. Adaptive fusion involves multiplying each feature map by four learnable weights

α

,

β

,

γ

,

δ

, which represent the importance of each feature map at different scales. The computation process is shown in Equation (10) as follows:

y_{i j}^{l} = α_{i j}^{l} X_{i j}^{1 \to l} + β_{i j}^{l} X_{i j}^{2 \to l} + γ_{i j}^{l} X_{i j}^{3 \to l} + δ_{i j}^{l} X_{i j}^{4 \to l}

(10)

In Equation (10),

y_{i j}^{l}

represents the new feature map obtained through MASFF-1,

X_{i j}^{s \to t}

represents the feature map resized from layer s to match the dimensions of layer t, and

α_{i j}^{l}

,

β_{i j}^{l}

,

γ_{i j}^{l}

, and

δ_{i j}^{l}

represent the importance weight values of

α

,

β

,

γ

, and

δ

at position

(i, j)

, respectively. The computation process is shown in Equation (11), and they satisfy the condition

α_{i j}^{l} + β_{i j}^{l} + γ_{i j}^{l} + δ_{i j}^{l} = 1

,

α_{i j}^{l}, β_{i j}^{l}, γ_{i j}^{l}, δ_{i j}^{l} \in [0,1]

.

α_{i j}^{l} = \frac{e^{λ_{α_{i j}}^{l}}}{e^{λ_{α_{i j}}^{l}} + e^{λ_{β_{i j}}^{l}} + e^{λ_{γ_{i j}}^{l}} + e^{λ_{δ_{i j}}^{l}}}

(11)

Therefore, we use a Multi-head Adaptive Spatial Feature Fusion Detection Head to suppress the inconsistency between the features at different scales by learning the connection between the different feature maps to solve the feature loss problem of cross-scale fusion and perform a secondary extraction of small targets. This module can effectively deal with targets of different scales and fuses more shallow features to reduce the feature loss problem of small targets after convolutional pooling operation, which allows the algorithm to greatly enhance the detection of small targets.

3.3. NWD-CIoU Loss Function

YOLOv8’s loss function comprises classification and regression losses. It combines the CIoU (Complete IoU) loss function and DFL (Distribution Focal Loss) to compute the regression loss of bounding boxes. DFL measures the distance loss between the regression predicted box and the target box, while CIoU measures the intersection-over-union loss (Box_Loss) between the predicted box and the ground truth box. The following are the formulas for calculating CIoU:

L_{C I o U} = 1 - I o U + \frac{ρ^{2} (b, b^{g t})}{c^{2}} + α v

(12)

v = \frac{4}{π^{2}} {(\arctan \frac{ω^{g t}}{h^{g t}} - \arctan \frac{ω}{h})}^{2}

(13)

α = \frac{v}{(1 - I o U) + v}

(14)

In Equation (12), IoU represents the intersection area over the union area between the predicted box and the ground truth box. The Euclidean distance between the centers of the predicted box and the ground truth box is shown by the symbol

ρ (b, b^{g t})

, where

h

and

ω

represent the height and width of the predicted box, and

h^{g t}

and

ω^{g t}

denote the height and width of the ground truth box, respectively.

However, CIoU is very sensitive to the positional deviation of small targets and can only work when the bounding box has overlapping parts, resulting in the poor detection of small targets, but one of the characteristics of the traffic sign recognition task is that there are many small targets, which makes its detection of small targets of traffic signs less than ideal. To resolve the issue, we introduce a metric based on Normalized Wasserstein Distance (NWD) [37], because NWD is not sensitive to the positional deviation of the target and no longer considers whether the two are overlapping, so it has a better effect on the detection of small targets, and its formula is shown in Equation (15) as follows:

W_{2}^{2} (N_{a}, N_{b}) = {‖({[c x_{a}, c y_{a}, \frac{ω_{a}}{2}, \frac{h_{a}}{2}]}^{T}, {[c x_{b}, c y_{b}, \frac{ω_{b}}{2}, \frac{h_{b}}{2}]}^{T})‖}_{2}^{2}

(15)

In the formula,

N_{a} = (c x_{a}, c y_{a}, \frac{ω_{a}}{2}, \frac{h_{a}}{2})

and

N_{b} = (c x_{b}, c y_{b}, \frac{ω_{b}}{2}, \frac{h_{b}}{2})

represent the Gaussian distributions of the ground truth and predicted bounding boxes, respectively. However,

W_{2}^{2} (N_{a}, N_{b})

cannot be directly used to measure the similarity between bounding boxes, so normalization is required, as shown in Equation (16) as follows:

N W D (N_{a}, N_{b}) = \exp (- \frac{\sqrt{W_{2}^{2} (N_{a}, N_{b})}}{C})

(16)

However, since there are a certain number of medium and large targets in the traffic sign images in addition to the small targets, if NWD alone directly replaces the original CIoU to be used as the localization loss, it will lead to poor detection results. Therefore, in this paper, NWD and CIoU are combined in a certain ratio to design a new loss function NWD-CIoU to calculate the similarity between the bounding boxes without considering whether they overlap or not, and several experiments have been carried out on the ratio factor λ to determine the appropriate value, where λ represents the ratio of the entire NWD-CIoU loss function accounted for by CIoU. This loss function can be expressed in Equation (17) as follows:

l o s s_{b o x} = (1 - λ) (1 - N W D) + λ (1 - C I o U)

(17)

3.4. Construction of Foggy Traffic Signs Dataset

To improve the safety and driving standardization of autonomous driving systems, it is essential to accurately detect and identify traffic signs in foggy traffic driving scenarios. However, since it is not easy to obtain foggy traffic signs datasets, there is no publicly available foggy traffic signs dataset. Therefore, to improve the recognition accuracy of traffic signs on foggy roads and their resulting low-light scenarios, as well as to demonstrate the viability of the approach proposed in this paper, there are the following two components to the dataset used in this experiment: one part is collected from our data of foggy traffic signs in traffic scenarios, and the other part is derived from the fogged and brightness-adjusted version of the large-scale traffic sign recognition dataset Tsinghua-Tencent 100K (TT100K) [38], which was jointly created by Tsinghua University and Tencent. Some images from this dataset are shown in Figure 6.

Fog warning signals are divided into three levels, yellow, orange, and red, which indicate visibility in 200 to 500 m, 50 to 200 m, and less than 50 m of fog concentration, respectively. In this paper, firstly, according to the haze level issued by the Meteorological Bureau, under the above yellow and orange haze warnings, that is, visibility within the range of 50 to 200 m of the real foggy road environment, we drove the car to collect data on the driving state of the vehicle and on the traffic signs in different light intensities, angles, backgrounds, and shooting distances; the filming equipment used in this paper is the iPhone 15. Secondly, we performed fogging and brightness adjustments on the traffic sign public dataset TT100K to simulate foggy days and the low illumination traffic scenes caused by it and, lastly, to diversify the data and the data accuracy. We used different data enhancement techniques to generate the data, such as image random cropping, flipping, luminance adjustment, Copy-Paste [39], and other data enhancement methods, to construct a dataset for foggy traffic signs and named it Foggy-TT100k. The image after a series of data enhancement operations is shown in Figure 7.

The Foggy-TT100k dataset contains a total of 6710 images and 32 common traffic sign categories, which consists of the 2504 images we took of foggy roads and 4206 selected from the TT100K dataset. In addition, the traffic sign categories in the Foggy-TT100k dataset can be classified into three main categories, instructions, prohibitions, and warnings, and some of the categories of traffic signs are shown in the Figure 8 below. In this paper, we randomly divided the dataset, using 5367 images for training and 1343 images for testing. Some of the images from the dataset are shown in Figure 9.

4. Experiments

4.1. Implementation Details

The experiments in this article employ Windows 10 as the operating system, using the deep learning framework PyTorch 2.0.1. The GPU used for the experiments is an NVIDIA RTX 4090 (24 G) GPU device. The initial learning rate for training is set to 0.01, the weight decay parameter is 0.001, the batch size is set to 16, and the number of epochs is 300.

4.2. Evaluation Metrics

In order to objectively evaluate the model proposed in this paper, the experiment uses Precision, mAP (mean Average Precision), and F1-score as the evaluation metric. mAP represents the average precision across all categories and is a comprehensive indicator of the algorithm’s performance. The specific calculation formula is as follows:

P = \frac{T P}{T P + F P}

(18)

R = \frac{T P}{T P + F N}

(19)

m A P = \frac{1}{N} \sum_{i = 1}^{N} \int_{0}^{1} P (R) d R

(20)

F 1 - s c o r e = \frac{2 \times P \times R}{P + R}

(21)

In the formula, the number of correctly predicted positive samples is represented by TP (True Positive), the number of wrongly forecasted positive samples is represented by FP (False Positive), and the number of mistakenly predicted negative samples is represented by FN (False Negative). Additionally, mAP is divided into two evaluation metrics based on the IoU threshold standard—mAP@0.5 and mAP@0.5:0.95. mAP@0.5 represents the mAP when the IoU threshold is 0.5, mainly reflecting the algorithm’s recognition ability. mAP@0.5:0.95 represents the average mAP when the IoU threshold ranges from 0.5 to 0.95 with a step size of 0.05, mainly reflecting the algorithm’s ability to predict target location and boundary regression.

4.3. Exploration Experiment of Scale Factor λ

Considering that there are mostly small targets in the traffic sign images, there are also a certain number of medium and large targets. Therefore, the value of the scale factor λ also affects the model’s detecting effect. To ensure that the model can better detect traffic sign targets of different sizes, we have conducted several experiments on different scale factors λ in the NWD-CIoU loss function to determine the appropriate values. The experimental results are shown in Table 1.

Since λ represents the ratio of CIoU accounted for in the whole NWD-CIoU loss function, the smaller value of λ represents the larger proportion of NWD in the loss function. According to the experimental results synthesized in the above table, the model has a better performance when λ is set to 0.3, which is also in line with the characteristics of the traffic sign images mentioned above that there are more small targets but also a certain number of medium and large targets. Therefore, we set the scale factor λ to 0.3 according to the experimental results.

4.4. Ablation Experiment

We ran tests on the homemade foggy day dataset Foggy-TT100k to assess the improvement of the network performance by the various modules in order to validate the efficacy of the YOLO-TSF algorithm and its various improvement modules. The experimental results are plotted as shown in Figure 10:

From Figure 10, it is observed that the proposed YOLO-TSF algorithm and its various improved modules outperform the original YOLOv8s algorithm in terms of [email protected] and [email protected]:0.95 and have a better detection performance. In addition, to evaluate in detail the improvement effect on the network performance after the combination of different modules, we experimentally verified all the combination forms of the proposed modules, and a total of eight different experiments were designed, and the specific experimental results of the ablation experiments are shown in Table 2 below.

From the above table, it can be seen that, after adding LGFFM, the indices of mAP@0.5, mAP@0.5:0.95, Precision, and F1-score of the Baseline are improved by 2.4%, 3.4%, 4.7% and 1.9%, respectively, which proves that LGFFM effectively solves the problem of inconspicuous features due to the reduction in contrast in the foggy day images. This proves that LGFFM effectively solves the problem of inconspicuous features in the foggy images due to the reduction in contrast and improves the ability of the model to identify traffic signs on foggy days. Secondly, after using MASFFHead in the Baseline, the values of mAP@0.5, mAP@0.5:0.95, Precision, and F1-score increased by 5.9%, 5.6%, 4.7%, and 5.1%, respectively, which proves that it effectively solves the problem of small target feature loss in the cross-scale fusion of the original algorithm and improves the ability of the model to detect small targets. In addition, by using the NWD-CIoU loss function, the values of mAP@0.5, mAP@0.5:0.95, Precision, and F1-score of the model improved by 1.3%, 0.4%, 4.4%, and 0.8%, respectively, which proves that the combination of the CIoU and NWD with the NWD-CIoU loss function has a certain effect on the model in improving the detection of small targets.

Additionally, in order to prove the effectiveness and compatibility of the improvement modules proposed in this paper, we also divided LGFFM, MASFFHead, and NWD-CIoU into two groups on the Baseline to conduct combination experiments; as shown in the table above, all the indexes of the model after the combination of the modules are improved to a certain extent compared with the use of a single module. Finally, comparing with Baseline after combining all the modules, the values of mAP@0.5, mAP@0.5:0.95, Precision, and F1-score improved by 8.8%, 7.8%, 7.1%, and 8.0%, respectively, and they are all the optimal results, as even though the number of parameters is increased, a significant improvement in the detection effect is obtained, which indicates that the YOLO-TSF algorithm proposed in this paper is very effective in detecting small target traffic signs in foggy scenes with a visibility of between 50 and 200 m.

4.5. Comparison of Different Detectors

To prove the superiority of the algorithm proposed in this paper, we conducted comparative experiments between YOLO-TSF and other YOLO series algorithms and a variety of mainstream single-phase and two-phase target detection algorithms and attempted to compare the mainstream de-fogging network with the YOLOv8s network after combining it with the YOLO-TSF algorithm. The experimental results are shown in Table 3. Compared with these advanced detection algorithms, the YOLO-TSF algorithm proposed in this paper shows excellent performance, which proves the effectiveness and excellence of the improved method.

In addition, in order to demonstrate more intuitively the improved performance of the algorithm proposed in this paper for small target traffic sign detection in foggy scenarios, we compared the detection results of the original model as well as the two representative models, YOLOv5s and AODNet-YOLOv8s, in the above comparison experiments with the YOLO-TSF algorithm, as shown in Figure 11 below. The image on the left is the ground truth, the image in the center shows the detection results of the Baseline and the two representative models, and the image on the right shows the detection results of the YOLO-TSF algorithm. The wrongly detected targets are marked with yellow ellipses, while the missed targets are marked with red ellipses and zoomed in and annotated in the upper left corner of the Baseline image. From the figure, we can see that, in the foggy traffic scene, the original YOLOv8 algorithm as well as the other two comparison models miss and wrongly detect the small targets under the interference of fog. In contrast, the YOLO-TSF algorithm proposed in this paper effectively avoids the omission and misdetection of the small targets of the traffic signs, and it can be seen from the figure that the targets detected by the YOLO-TSF algorithm have higher confidence scores compared with the three experiments in the comparison, which intuitively reflects the improvement in the YOLO-TSF algorithm’s performance for detecting small target traffic signs in foggy scenes.

Finally, to further validate the YOLO-TSF algorithm’s ability to detect small traffic signs in foggy days, this study compares the Baseline and YOLO-TSF algorithms utilizing the Grad-CAM heat map analysis method, which shows the degree of attention to different regions by drawing heat maps of different regions with colors ranging from light to dark, and the change in color from blue to red corresponds to the change in the degree of attention from low to high, as shown in Figure 12 below. The left side is the original image, and we labeled the traffic sign targets in the figure with ellipses, while the middle and right sides are the heat maps generated by the Baseline algorithm and YOLO-TSF algorithm, respectively. From the figure, we can see that the YOLOv8 algorithm lacks attention to the target region and focuses more on the irrelevant background region. Even though the YOLOv8 algorithm has a higher focus on the target region, it has a large area of attention and contains most of the background region around the traffic sign, which fully demonstrates that the YOLOv8 algorithm is susceptible to the interference of the complex background, and the ability to locate the target of the traffic sign needs to be improved. In the heat map of the YOLO-TSF algorithm, it can be found that the YOLO-TSF algorithm not only pays more attention to the target area but also locates the target accurately, presents a deeper red color than the target area in the picture of YOLOv8s algorithm, and does not pay too much attention to the irrelevant background area and the area around the target. Thus, it shows that the YOLO-TSF algorithm focuses more on the traffic sign itself in foggy scenarios, by better recognition performance.

5. Conclusions

In this paper, we propose a foggy traffic sign detection algorithm based on YOLOv8s: YOLO-TSF, which addresses the problems of more small targets in road scenes and of misdetection, omission, and lower recognition accuracy in foggy scenes. First, a Local–Global Feature Fusion Module (LGFFM) is proposed to enhance the model’s recognition capacity of foggy traffic sign; second, a Multi-head Adaptive Spatial Feature Fusion Detector Head (MASFFHead) is designed to solve the cross-scale fusion feature loss problem and to extract small targets twice; moreover, NWD-CIoU is designed to improve the IoU for cross-scale fusion of small targets. MASFFHead is employed to resolve the feature loss problem of cross-scale fusion and to perform secondary extraction of small targets. In addition, NWD-CIoU is designed to improve the sensitivity of the IoU to the positional bias of small targets. Finally, experiments are conducted on our Foggy-TT100k dataset and compared with current mainstream target detection algorithms, as well as the combination of the defogging algorithm with the YOLOv8 algorithm, and the results show that the YOLO-TSF algorithm has excellent performance in detecting small target traffic signs in foggy scenarios with different fog concentrations.

We will carry out in-depth research in two aspects. On the one hand, we will optimize the dataset, classify the pictures according to different fog concentration levels through visibility analysis technology on the current basis, so as to explore the limit point of the detection algorithm for fog concentration, and then further improve the limit point to achieve the ability of accurately recognizing traffic signs in a dense fog. On the other hand, we will consider the optimization of the algorithm by introducing more powerful pre-processing techniques to cope with more extreme situations, and we will also optimize the algorithm in terms of accuracy and speed according to the requirements of traffic sign recognition speed in real driving scenarios, so as to further improve the detection performance of the algorithm in complex traffic scenarios.

Author Contributions

Conceptualization, R.L. and Y.C.; methodology, R.L. and Y.W.; software, R.L.; validation, R.L., Y.W. and C.S.; formal analysis, R.L., Y.W. and C.S.; investigation, R.L. and C.S.; resources, Y.C.; data curation, R.L.; writing—original draft preparation, R.L.; writing—review and editing, R.L.; visualization, R.L.; supervision, Y.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the China West Normal University Talent Fund (No. 463177).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data available on request due to restrictions (e.g., privacy, legal or ethical reasons). Since self-built traffic sign datasets may involve private information about the people and vehicles in the pictures, the data provided in this study can be provided upon request by the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Gong, C.; Li, A.; Song, Y.; Xu, N.; He, W. Traffic sign recognition based on the YOLOv3 algorithm. Sensors 2022, 22, 9345. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Li, X.; Xie, Z.; Deng, X. Traffic sign detection based on improved faster R-CNN for autonomous driving. J. Supercomput. 2022, 78, 7982–8002. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Annual Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Cao, J.; Chen, Q.; Guo, J.; Shi, R. Attention-guided context feature pyramid network for object detection. arXiv 2020, arXiv:2005.11475. [Google Scholar]
Zhao, L.; Wei, Z.; Li, Y.; Jin, J.; Li, X. Sedg-yolov5: A lightweight traffic sign detection model based on knowledge distillation. Electronics 2023, 12, 305. [Google Scholar] [CrossRef]
Wang, J.; Chen, Y.; Dong, Z.; Gao, M. Improved YOLOv5 network for real-time multi-scale traffic sign detection. Neural Comput. Appl. 2023, 35, 7853–7865. [Google Scholar] [CrossRef]
YOLOv5. 2021. Available online: https://github.com/ultralytics/yolov5 (accessed on 2 October 2022).
Saxena, S.; Dey, S.; Shah, M.; Gupta, S. Traffic sign detection in unconstrained environment using improved YOLOv4. Expert Syst. Appl. 2023, 238, 121836. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Wang, Y.; Bai, M.; Wang, M.; Zhao, F.; Guo, J. Multiscale traffic sign detection method in complex environment based on YOLOv4. Comput. Intell. Neurosci. 2022, 2022, 5297605. [Google Scholar] [CrossRef]
Yao, J.; Huang, B.; Yang, S.; Xiang, X.; Lu, Z. Traffic sign detection and recognition under low illumination. Mach. Vis. Appl. 2023, 34, 75. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, New Orleans, LA, USA, 18–24 June 2020; Springer International Publishing: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Springer International Publishing: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
Ge, Z. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Wei, X. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Jocher, G.; Chaurasia, A.; Qiu, J. YOLO by Ultralytics. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 9 September 2024).
Ma, Y.; Cai, J.; Tao, J.; Yang, Q.; Gao, Y.; Fan, X. Foggy image detection based on dehazenet with improved ssd. In Proceedings of the 2021 5th International Conference on Innovation in Artificial Intelligence, Virtually, 2–9 February 2021; pp. 82–86. [Google Scholar]
Cai, B.; Xu, X.; Jia, K.; Qing, C.; Tao, D. Dehazenet: An end-to-end system for single image haze removal. IEEE Trans. Image Process. 2016, 25, 5187–5198. [Google Scholar] [CrossRef]
Huang, S.C.; Le, T.H.; Jaw, D.W. DSNet: Joint semantic learning for object detection in inclement weather conditions. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 2623–2633. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Li, C.; Guo, C.; Guo, J.; Han, P.; Fu, H.; Cong, R. PDR-Net: Perception-inspired single image dehazing network with refinement. IEEE Trans. Multimed. 2019, 22, 704–716. [Google Scholar] [CrossRef]
Li, B.; Peng, X.; Wang, Z.; Xu, J.; Feng, D. Aod-net: All-in-one dehazing network. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4770–4778. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Wang, Z.; Wang, J.; Li, Y.; Wang, S. Traffic sign recognition with lightweight two-stage model in complex scenes. IEEE Trans. Intell. Transp. Syst. 2020, 23, 1121–1131. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Liu, S.; Huang, D.; Wang, Y. Learning spatial fusion for single-shot object detection. arXiv 2019, arXiv:1911.09516. [Google Scholar]
Wang, J.; Xu, C.; Yang, W.; Yu, L. A normalized Gaussian Wasserstein distance for tiny object detection. arXiv 2021, arXiv:2110.13389. [Google Scholar]
Zhu, Z.; Liang, D.; Zhang, S.; Huang, X.; Li, B.; Hu, S. Traffic-sign detection and classification in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2110–2118. [Google Scholar]
Ghiasi, G.; Cui, Y.; Srinivas, A.; Qian, R.; Lin, T.Y.; Cubuk, E.D.; Zoph, B. Simple copy-paste is a strong data augmentation method for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2918–2928. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
Lyu, C.; Zhang, W.; Huang, H.; Zhou, Y.; Wang, Y.; Liu, Y.; Chen, K. Rtmdet: An empirical study of designing real-time object detectors. arXiv 2022, arXiv:2212.07784. [Google Scholar]
Zhang, S.; Wang, X.; Wang, J.; Pang, J.; Lyu, C.; Zhang, W.; Chen, K. Dense distinct query for end-to-end object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7329–7338. [Google Scholar]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Shum, H.Y. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
Qin, X.; Wang, Z.; Bai, Y.; Xie, X.; Jia, H. FFA-Net: Feature fusion attention network for single image dehazing. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 11908–11915. [Google Scholar]

Figure 1. Overall structure of YOLO-TSF.

Figure 2. Structure of LGFFM.

Figure 3. Structure of Channel Attention.

Figure 4. Structure of Coordinate Attention.

Figure 5. Structure of MASFFHead.

Figure 6. Images from the TT100K dataset.

Figure 7. Image after data enhancement operation.

Figure 8. Traffic sign categories as part of the Foggy-TT100k dataset. (a) are some examples in the instruction category, (b) are some examples in the prohibit category, and (c) are some examples in the warning category.

Figure 9. Images from the Foggy-TT100k dataset.

Figure 10. Performance Curve Comparison Chart.

Figure 11. Visual comparison of detection results.

Figure 12. Comparison of heat maps between Baseline and YOLO-TSF.

Table 1. Exploration Experiment of Scale Factor λ. The best-performing methods are highlighted in bold.

λ	mAP(0.5)	mAP(0.5:0.95)
0.1	74.9%	54.5%
0.2	75.1%	54.8%
0.3	75.6%	55.4%
0.4	74.7%	55.2%
0.5	75.2%	55.3%
0.6	74.7%	55.3%
0.7	74.3%	55.0%
0.8	74.9%	55.2%
0.9	74.1%	55.0%
1.0	74.3%	54.9%

Table 2. Ablation experiments on Foggy-TT100k dataset. The best-performing methods are highlighted in bold.

Baseline	LGFFM	MASFFHead	NWD-CIoU	Params	mAP(0.5)	mAP(0.5:0.95)	P	F1-Score
✓				11.2 M	74.3%	54.9%	72.7%	71.6%
✓	✓			11.2 M	76.7%	58.3%	77.4%	73.5%
✓		✓		13.6 M	80.2%	60.5%	77.4%	76.7%
✓			✓	11.2 M	75.6%	55.3%	77.1%	72.4%
✓	✓	✓		13.6 M	82.6%	62.4%	77.9%	78.6%
✓	✓		✓	11.2 M	77.6%	58.5%	78.6%	75.9%
✓		✓	✓	13.6 M	81.8%	61.5%	79.3%	77.5%
✓	✓	✓	✓	13.7 M	83.1%	62.7%	79.8%	79.6%

Table 3. Comparison of results on the Foggy-TT100k dataset with different algorithms. The best result is indicated in bold.

Method	Params	[email protected]	[email protected]:0.95
Faster R-CNN [4]	41.4 M	63.7%	48.8%
Cascade R-CNN [40]	69.3 M	68.2%	52.6%
RTMDet [41]	52.3 M	69.1%	52.5%
DDQ-DETR [42]	53.4 M	69.0%	45.9%
DINO [43]	47.6 M	55.1%	36.8%
YOLOv5 [8]	7.1 M	71.4%	52.9%
YOLOv6 [24]	18.5 M	67.3%	49.9%
YOLOv7 [25]	6.1 M	72.8%	50.7%
YOLOv8 [26]	11.2 M	74.3%	54.9%
AOD [32] +YOLOv8	11.3 M	74.8%	53.7%
FFA [44] +YOLOv8	12.6 M	75.2%	55.4%
YOLO-TSF (Ours)	13.7 M	83.1%	62.7%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, R.; Chen, Y.; Wang, Y.; Sun, C. YOLO-TSF: A Small Traffic Sign Detection Algorithm for Foggy Road Scenes. Electronics 2024, 13, 3744. https://doi.org/10.3390/electronics13183744

AMA Style

Li R, Chen Y, Wang Y, Sun C. YOLO-TSF: A Small Traffic Sign Detection Algorithm for Foggy Road Scenes. Electronics. 2024; 13(18):3744. https://doi.org/10.3390/electronics13183744

Chicago/Turabian Style

Li, Rongzhen, Yajun Chen, Yu Wang, and Chaoyue Sun. 2024. "YOLO-TSF: A Small Traffic Sign Detection Algorithm for Foggy Road Scenes" Electronics 13, no. 18: 3744. https://doi.org/10.3390/electronics13183744

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YOLO-TSF: A Small Traffic Sign Detection Algorithm for Foggy Road Scenes

Abstract

1. Introduction

2. Related Works

2.1. Object Detection Models

2.2. Foggy Object Detection

3. Methods

3.1. LGFFM

3.2. MASFFHead

3.3. NWD-CIoU Loss Function

3.4. Construction of Foggy Traffic Signs Dataset

4. Experiments

4.1. Implementation Details

4.2. Evaluation Metrics

4.3. Exploration Experiment of Scale Factor λ

4.4. Ablation Experiment

4.5. Comparison of Different Detectors

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI