LLE-STD: Traffic Sign Detection Method Based on Low-Light Image Enhancement and Small Target Detection

Wang, Tianqi; Qu, Hongquan; Liu, Chang’an; Zheng, Tong; Lyu, Zhuoyang

doi:10.3390/math12193125

Open AccessArticle

LLE-STD: Traffic Sign Detection Method Based on Low-Light Image Enhancement and Small Target Detection

by

Tianqi Wang

^1,2,

Hongquan Qu

^2,*,

Chang’an Liu

²,

Tong Zheng

³

and

Zhuoyang Lyu

⁴

¹

School of Electrical and Control Engineering, North China University of Technology, Beijing 100144, China

²

School of Information Science and Technology, North China University of Technology, Beijing 100144, China

³

School of Computer and Artificial Intelligence, Beijing Technology and Business University, Beijing 100048, China

⁴

Computer Science and Applied Math, Brown University, Providence, RI 02912, USA

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(19), 3125; https://doi.org/10.3390/math12193125

Submission received: 5 September 2024 / Revised: 29 September 2024 / Accepted: 4 October 2024 / Published: 6 October 2024

Download

Browse Figures

Versions Notes

Abstract

:

With the continuous development of autonomous driving, traffic sign detection, as an essential subtask, has witnessed constant updates in corresponding technologies. Currently, traffic sign detection primarily confronts challenges such as the small size of detection targets and the complexity of detection scenarios. This paper focuses on detecting small traffic signs in low-light scenarios. To address these issues, this paper proposes a traffic sign detection method that integrates low-light image enhancement with small target detection, namely, LLE-STD. This method comprises two stages: low-light image enhancement and small target detection. Based on classic baseline models, we tailor the model structures by considering the requirements of lightweight traffic sign detection models and their adaptability to varying image qualities. The two stages are then coupled to form an end-to-end processing procedure. During experiments, we validate the performance of low-light image enhancement small target detection, and adaptability to images of different qualities using the public datasets GTSDB, TT-100K, and GLARE. Compared to classic models, LLE-STD demonstrates significant advantages. For example, the mAP results tested on the GLARE dataset show that LLE-STD outperforms RetinaNet by approximately 15%. This research can facilitate the practical application of deep learning-based intelligent methods in the field of autonomous driving.

Keywords:

traffic sign detection; low-light image enhancement; small target detection; image quality; deep learning

MSC:

68U10

1. Introduction

Autonomous driving is a crucial component in achieving high-level intelligent transportation systems. Among its various aspects, traffic sign detection plays a significant role in guiding vehicles to travel in an orderly manner, enhancing driving safety, and providing vital assistance for seamless vehicle control [1]. With the continuous advancement of deep learning, traffic sign detection research has also shifted towards data-driven methods that automatically extract features, attracting widespread attention from scholars [2].

However, the small size and complex backgrounds of traffic signs pose significant challenges in processing. In computer vision, small objects typically refer to those occupying minimal regions in an image, lacking the rich texture, color, and other detailed features of regular-sized objects, thus making detection more difficult. The MS COCO dataset [3] defines small objects as those with a resolution less than 32 px × 32 px. The TT-100K dataset [4], curated by Wei et al., is the most commonly used domestic traffic sign detection dataset, where traffic signs exhibit clear characteristics of small objects. As a result, small object detection has emerged as the primary challenge in traffic sign detection.

Small object detection algorithms can be categorized into two main types: single-stage and two-stage algorithms. The two-stage approach is characterized by high detection accuracy but relatively slow speed. Notable examples of two-stage small object detection algorithms include R-CNN [5], Fast R-CNN [6], and Faster R-CNN [7]. In 2017, Lin et al. proposed the feature pyramid network (FPN) algorithm [8], which fuses high-level semantic feature maps with low-level semantic feature maps. This fusion ensures that the output feature maps contain both high-level semantic information and information about small objects, making them suitable for small object detection. In 2018, Cai et al. introduced the Cascade R-CNN algorithm [9]. Moreover, Zhao proposed a multi-stage structure that progressively raises the intersection over union (IoU) threshold to obtain a high-quality detector for small object detection. The single-stage approach aims to balance detection speed and accuracy. Representative single-stage small object detection algorithms include the you only look once (YOLO) series proposed by J. Redmon et al. [10,11] and the single-shot multiBox detector (SSD) algorithm by W. Liu et al. [12]. However, due to the smaller size and limited information content of small objects, combined with the constraints of the YOLO network architecture, the original YOLO and YOLO 9000 algorithms are not suitable for small object detection tasks. In 2018, J. Redmon addressed this issue with YOLOv3 [13], which, through structural reconfiguration, successfully applies the YOLO series to small object detection while maintaining high detection accuracy. In 2020, A. Bochkovskiy et al. introduced YOLOv4 [14], which significantly improves detection accuracy while maintaining the same speed as YOLOv3. YOLOv4 achieves mAP of 43.5% on the COCO dataset. Additionally, fully convolutional one-stage object detection (FCOS) [15], based on the concept of fully convolutional networks (FCN) [16], is a single-stage anchor-free detector that uses pixel-level predictions to solve the object detection problem. By avoiding anchor box calculations and region of interest (RoI) generation, FCOS is also frequently used for small object detection tasks.

In nighttime conditions, the lack of sufficient lighting can lead to low visibility and color distortion in captured dark images, significantly impacting traffic sign detection. To address this issue, deep learning has emerged as the dominant approach for low-light image enhancement (LLE) [17]. In this paper, we propose to integrate lightweight low-light image enhancement techniques with small object detection methods specifically designed for traffic sign detection. The goal is to improve the detection of small traffic signs in complex low-light scenarios, thereby enhancing the overall performance of traffic sign detection systems under challenging nighttime conditions.

Based on the above, the main contributions of this paper are as follows:

(1): A cascaded model architecture combining LLE and small object detection is proposed to improve the detection performance of small traffic signs in complex low-light scenarios.
(2): By addressing the real-time requirements of autonomous driving scenarios, especially for traffic sign detection tasks, this paper investigates the use of blueprint separable convolutions (BSConv) [18] as a means to reduce model complexity by replacing standard convolutions.
(3): To address the issue of insufficient detection accuracy for small traffic signs, this paper introduces attention mechanisms and modifies activation functions during the object detection stage of the LLE-STD model. These enhancements effectively prevent feature loss and improve the overall detection accuracy.

In summary, aiming at the problems of small detection target size and complex detection scenarios faced by the traffic sign detection task, this paper proposes the LLE-STD model to balance the effectiveness and reliability of detection as much as possible.

The remainder of this paper is organized as follows: The related works in small object detection and LLE models are discussed in Section 2. The brief description of the proposed approach reviewed in Section 3. In Section 4, we show some assessment criteria used to evaluate the effectiveness of the proposed method. The results of our experiments are illustrated and compared with other state-of-the-art methods in Section 5. Finally, in Section 6 and Section 7, discussion and the conclusions are drawn.

2. Related Works

In intelligent driving systems, traffic sign detection plays a critical role in supporting real-time decision-making. However, traffic signs are typically smaller than other objects in natural scenes, and the actual traffic environment is complex and highly variable. Therefore, traffic sign detection faces challenges posed by small objects and complex scenes. This paper specifically focuses on the issue of detecting small traffic signs under low-light conditions.

2.1. LLE Models

The first CNN-based LLE model [19] utilizes an autoencoder to simultaneously learn denoising and enhancement. Inspired by the Retinex theory, several LLE models have been proposed [20,21,22,23,24], which typically decompose the low-light image input into reflectance and illumination maps, then adjust the illumination map to enhance the image intensity. Most of these methods integrate denoising modules into the reflectance map to suppress noise in the enhanced results. For instance, Zheng et al. [25] propose an unfolded total variation network to estimate the noise level for LLE. To improve generalization capabilities, unsupervised methods have also been introduced. The Enlightenment GAN [26], an attention-based U-Net, is trained using adversarial losses. Zero-DCE [27] and Zero-DCE++ [28] approach image enhancement as an image-specific curve estimation task, demonstrating robust enhancement effects. Given that autonomous driving systems require real-time information acquisition, processing, and decision-making, traffic sign detection must also satisfy real-time requirements. Consequently, optimizing network structures to reduce computational complexity is crucial.

The network architecture of Zero-DCE [27] is illustrated in Figure 1. It mainly comprises two parts: the deep curve estimation network (DEC-Net) and the light enhancement curve (LE-curve). Different from traditional supervised learning models, Zero-DCE learns a high-order curve through DEC-Net and applies this curve to perform a nonlinear dynamic range adjustment for each pixel in the input image. This adjustment expands the brightness range and increases the contrast, thereby enhancing the low-light image.

The BCE-Net takes low-light images as input and outputs a pixel-wise mapping of high-order curve parameters. This model consists of seven convolutional layers, with the first six layers each containing 32 convolutional kernels of size 3 × 3 and a stride of 1, followed by ReLU activation functions. The final layer contains 24 convolutional kernels of size 3 × 3 and a stride of 1, using a Tanh activation function. As a result, the model’s output, the curve parameter mapping, comprises 24 channels. These channels can be split into groups based on the RGB color channels, yielding eight sets of iterative curve parameter mappings. Each set of parameters corresponds to a different stage or iteration in the curve estimation process, enabling a more refined and adaptive adjustment of the image’s brightness and contrast.

In addition, the design of LE-curve includes the following:

(1): The intensity value of each pixel in the enhanced image should be within the range of [0, 1] to avoid information loss caused by overflow truncation.
(2): The LE-curve needs to maintain monotonicity to ensure stable contrast between adjacent pixels.
(3): The gradient form of the LE-curve should be as simple and differentiable as possible.

LE-curve can be designed as follows:

L E (I; α) = I + α \cdot I \cdot (1 - I)

(1)

where I is the input image;

L E (I; α)

is enhanced image;

α

is the parameter of curve learned by BCE-Net, whose range is [0, 1]. Performing multiple iterations on the LE-curve results in the following:

L E_{n} (x) = L E_{n - 1} (x) + A_{n} (x) \cdot L E_{n - 1} (x) \cdot (1 - L E_{n - 1} (x))

(2)

where n is the number of iterations. According to the structure of BCE-Net, n = 8.

A

is the parameter mapping of the same size as the given image. Applying the LE-curve separately to the RGB color channels can better preserve the inherent colors and reduce the risk of oversaturation.

2.2. Small Traffic Sign Detection Methods

Focusing specifically on traffic sign detection tasks, Huang et al. [29] employ Faster R-CNN as the detection model and utilized generative adversarial networks (GANs) for data augmentation. Zhang et al. [30] combine cascade R-CNN with sample balancing to detect traffic signs, achieving accuracies of 99.7% and 98.7% on the CCTSDB [31] and GTSDB [32] datasets, respectively. Zhao et al. [33] utilize the Libra R-CNN algorithm in conjunction with a balanced feature pyramid to detect traffic signs, reporting a detection accuracy of 77.3% in their experiments. The detection results of other typical traffic signs are shown in Table 1. Two-stage traffic sign detection algorithms generally exhibit higher detection accuracy than one-stage algorithms but are slower in speed. Currently, few studies explicitly address the challenge of small object detection as a crucial issue in traffic sign detection. Instead, the majority of research focuses on applying existing classical models. This paper, however, takes a different approach by enhancing the representation of small object features based on classic small object detection models as a baseline, aiming to improve the detection accuracy of traffic signs. By doing so, we aim to contribute to the development of more effective and efficient traffic sign detection systems.

Here, FCOS [15] is a single-stage anchor-free detection model proposed based on the idea of FCN. It primarily addresses the problem of object detection through pixel-level prediction, thereby avoiding the need for anchor box computations and RoI generation. This algorithm first feeds a preprocessed image into a backbone network for feature extraction, resulting in an initial feature layer. Then, this feature layer is input into a feature pyramid network (FPN) for multi-level predictions. Finally, the detection head outputs the detection results. The network structure is illustrated in Figure 2.

As depicted in Figure 2, FCOS processes the input image through a backbone network for feature extraction, yielding feature maps C3, C4, and C5. These feature maps are then passed through 1 × 1 convolutions to generate P3, P4, and P5 feature maps, which are connected in a top-down fashion. Additionally, P5 undergoes two downsampling steps to obtain P6 and P7 feature maps. Finally, each of these five feature maps with different scales is processed by a detection head, leveraging shared feature information to achieve multi-level object detection. For each pixel point (x, y) on the feature maps, it can be mapped back to a corresponding pixel point (x′, y′) in the original input image using the following formula:

(x^{'}, y^{'}) = (⌊\frac{s}{2}⌋ + x \cdot s, ⌊\frac{s}{2}⌋ + y \cdot s)

(3)

where s is step;

⌊\cdot⌋

is round down. If the pixel (x, y) falls within the target anchor box and corresponds to a specific category, then that pixel is considered a positive sample; otherwise, it is considered a negative sample. The regression status of each pixel can be expressed as follows:

l^{*} = x - x_{0}^{(i)}, t^{*} = y - y_{0}^{(i)}, r^{*} = x_{1}^{(i)} - x, b^{*} = y_{1}^{(i)} - y

(4)

where (x₀⁽ⁱ⁾, y₀⁽ⁱ⁾) and (x₁⁽ⁱ⁾, y₁⁽ⁱ⁾) are the coordinates of the pixels in the upper left and lower right corners of the bounding box, respectively.

l^{*}

,

t^{*}

,

r^{*}

, and

b^{*}

represent the distance of the pixel to the top, bottom, left, and right sides of the bounding box, respectively. In the hypothesis that a feature point on the feature map falls within a target anchor box, this feature point is deemed as an ambiguous sample. To determine which class this ambiguous sample belongs to, multi-level predictions are made through FPN, and the prediction box with the smallest area is selected as its regression target. When a feature point in feature map satisfies the condition in Equation (5), the sample is called a negative sample and does not require regression prediction.

\max (l^{*}, t^{*}, r^{*}, b^{*}) > m_{i} o r \max (l^{*}, t^{*}, r^{*}, b^{*}) < m_{i - 1}

(5)

where m_i is the maximum regression distance of the ith effective feature layer. In FCOS, we set m₂~m₇ as 0, 64, 128, 256, and ∞. If there are two target anchor boxes that meet the requirements, the one with the smallest area is selected for regression prediction. In addition, since the FCOS algorithm performs pixel-level regression operations, it generates a large number of predicted boxes, including some low-quality predicted boxes generated by pixels far from the target center. These predicted boxes will have a certain impact on the detection performance. To address this issue, the FCOS algorithm introduces a “Center-ness” to suppress these low-quality bounding boxes. The definition of Center-ness is:

C e n t e r - n e s s = \sqrt{\frac{\min (l^{*}, r^{*})}{\max (l^{*}, r^{*})} \times \frac{\min (t^{*}, b^{*})}{\max (t^{*}, b^{*})}}

(6)

where, the range of Center-ness is [0, 1]. It serves as a single-layer branch that runs in parallel with the classification branch.

3. Methodology

This section introduces LLE-STD, including its overall structure, core module structure, and the design of its loss function.

3.1. The Overall Structure

The overall structure of LLE-STD is shown in Figure 3. It can be intuitively seen that LLE-STD consists of two stages, namely, the low-light image enhancement stage and the small traffic sign detection stage. The baseline models for these two stages are Zero-DCE and FCOS, respectively. The obvious differences between it and the classic model are as follows:

(1): The two stages in the LLE-STD model are coupled and connected, enabling end-to-end learning. This means that during the inference stage, when a low-light image is input into LLE-STD, it can directly output the traffic sign detection results corresponding to the enhanced image. There is no need to separately perform low-light image enhancement and traffic sign detection tasks.
(2): Given the requirement for real-time processing in autonomous driving scenarios, especially for traffic sign detection tasks, the proposed method should aim for high computational efficiency and a small model size. Therefore, in the low-light image enhancement stage of the LLE-STD model, we utilize BSConv [18] to replace standard convolutions. This modification reduces the number of multiplication operations in the model, thereby lowering the computational complexity.
(3): To address the issue of insufficient detection accuracy for small traffic signs, we incorporate an attention mechanism into the object detection stage of the LLE-STD model. Additionally, we replace the classic ReLU and Sigmoid activation functions with the Swish activation function. This effectively avoids the loss of features and enhances the accuracy of object detection.

3.2. The Improved DCE-Net Model

As can be seen from Figure 3, the low-light image enhancement stage in LLE-STD uses Zero-DCE as the baseline model, but with improvements made to the DCE-Net component by replacing the classic convolutional processing with BSConv. The introduction of BSConv [18] is based on the theoretical foundation of strong correlation in the depth dimension. Through extensive experiments, the authors have verified that the proposed BSConv can significantly enhance the performance of MobileNet and other architectures based on depthwise separable convolutions, without introducing additional complexity. For fine-grained problems, the proposed method achieves a 13.7% performance improvement. In the ImageNet classification task, BSConv achieves a 9.5% performance improvement when “plugged and played” with ResNet.

The first layer of BSConv is a pointwise convolution, while the second layer is a depthwise convolution per channel. This order effectively exploits the correlations within the convolution kernels. The pointwise convolution uses a filter size of 1 × 1, which performs an element-wise weighting operation on each pixel in the input feature map. During the pointwise convolution, the number of input feature map channels needs to be the same as the number of channels in the convolution kernel, and the number of output feature map channels is determined by the number of convolution kernels. The depthwise convolution process per channel requires that the number of input feature map channels be the same as the number of convolution kernels, which also determines the number of output feature map channels.

After applying a standard convolution with a stride of 1 to the input feature map

U \in R^{M \times Y \times X}

, the resulting feature map is

V \in R^{N \times Y \times X}

. Here, the convolutional kernels are

F^{(1)}, F^{(2)}, \dots, F^{(N)}

, whose sizes are

M \times K \times K

. It is obvious that the number of trained parameters is

M \times N \times K^{2}

. For BSConv, the convolutional kernels are as follows:

F_{m,;,;}^{n} = ω_{n, m} \cdot B^{(n)}, m \in \{1, \dots, M\}, n \in \{1, \dots, N\}

(7)

where

ω

is weight;

B

is convolutional kernels with size of

K \times K

. In view of that, the output feature maps can be obtained with size of

N \times Y \times X

. The number of trained parameters is

M \times N + K^{2} \times N

in BSConv. There will be a significant reduction compared to standard convolution.

3.3. The Improved FCOS Model

Taking images from the classic public traffic sign dataset TT-100K [4] as an example, the image size is 2048 px × 2048 px. According to statistics, traffic signs generally occupy 0.2% to 1% of the entire image, indicating that traffic sign detection falls into the category of small object detection problems. The difficulty of small object detection lies in the small size of the objects, which may lead to the loss of object region features after multiple layers of processing, resulting in missed detections. To address this issue, LLE-STD introduces the convolutional block attention module (CBAM) [39] and employs the Swish activation function, which can both enhance feature representation and effectively prevent feature loss.

CBAM is an integration of channel attention and spatial attention. It generates attention feature map information on two dimensions, then multiplies these two types of feature map information with the input feature maps to complete adaptive feature refinement, and finally produces the refined feature maps. CBAM is a lightweight module that can be embedded into any backbone network to improve performance. The structure is illustrated in Figure 4.

CBAM is divided into a channel attention module and a spatial attention module. This process can be described mathematically as follows:

F^{'} = M_{C} (F) \otimes F

(8)

F ″ = M_{S} (F^{'}) \otimes F^{'}

(9)

where

M_{C} (\cdot)

and

M_{S} (\cdot)

are channel attention and spatial attention processing, respectively.

F

is the input feature maps of CBAM;

F^{'}

is the spatial attention processing input feature map;

F ″

is the output feature maps of CBAM;

\otimes

is multiplication pixel by pixel multiplication. The channel attention module aggregates spatial information from the input feature maps through max pooling and average pooling, generating two different spatial descriptors. Then, these two descriptors are passed through a shared network consisting of a multilayer perceptron (MLP) and a hidden layer to generate a channel attention map. After applying the shared network to each descriptor, an element-wise summation operation is performed. The process can be described as follows:

\begin{matrix} M_{C} & = σ (M L P (M e a n P o o l (F)) + M L P (M a x P o o l (F))) \\ = σ (W_{1} (W_{0} (F_{mean}^{c})) + W_{1} (W_{0} (F_{\max}^{c}))) \end{matrix}

(10)

where

σ (\cdot)

is the activation function, i.e., Sigmoid;

W_{1}

and

W_{0}

are two input shared weight coefficients;

M e a n P o o l (\cdot)

and

M a x P o o l (\cdot)

are mean- and max-pooling;

M L P (\cdot)

is MLP processing. It can be observed that after channel attention processing, although the spatial dimension is significantly compressed, the channel dimension remains unchanged.

The spatial attention module takes the feature maps output by the channel attention module as input. First, it performs average and max pooling along the channel dimension, then concatenates the results to generate an effective feature descriptor. Finally, a convolutional layer is utilized to generate a spatial attention map. The process can be described as follows:

\begin{matrix} M_{S} & = σ (f ([M e a n P o o l (F); M a x P o o l (F)])) \\ = σ (f ([F_{mean}^{S}; F_{\max}^{S}])) \end{matrix}

(11)

where f(·) is convolution process. After the operation of the spatial attention module, the channel dimension is compressed, while the spatial dimension remains unchanged.

In addition, the classical ResNet50 is selected as the backbone network for the LLE-STD object detection stage in this paper. To achieve better detection results, the Swish activation function is used to replace the ReLU activation in the classical ResNet50. The definition of Swish activation function is as below:

f (x) = x \cdot σ (β x)

(12)

where

β

is a trainable parameter. When

β = 0

, the Swish function becomes a linear function; when

β = 1

, the Swish function becomes SiLU function; when

β \to \infty

, the Swish function near to ReLU function. It is obvious that the Swish function is a smooth function lying between a linear kernel function and a ReLU function. The Swish function possesses the characteristics of having no upper bound but a lower bound, being smooth, and being non-monotonic. It helps reduce gradient explosion and gradient vanishing.

3.4. Loss Function

As mentioned earlier, LLE-STD enables end-to-end connection. Simultaneously, during the parameter training process, the backpropagation should propagate the gradients from the detection results section gradually towards the input end of the entire network. Based on this, the loss function for the LLE-STD model is designed in this paper. In view of that the loss function of LLE-STD is designed as below:

L_{t o t a l} = L_{L L E} + γ \cdot L_{S T D}

(13)

where

L_{L L E}

and

L_{S T D}

are used to drive the loss terms for low-light image enhancement and traffic sign recognition, respectively.

γ

is the weight used to adjust the influence of both on the total loss function. Drawing on the experience of designing the loss function in the Zero-DCE model [31],

L_{L L E}

is designed as follows:

L_{L L E} = L_{s p a} + L_{\exp} + W_{c o l} L_{c o l} + W_{A} L_{A}

(14)

where,

L_{s p a}

,

L_{\exp}

,

L_{c o l}

, and

L_{A}

are spatial consistency loss, exposure control loss, color constancy loss, and lighting smoothness loss, respectively.

W_{c o l}

and

W_{A}

are weight. Firstly, the spatial consistency loss term is used to measure the difference in intensity values between the low-light image before and after enhancement in their respective adjacent regions. The expression is as follows:

L_{s p a} = \frac{1}{K} \sum_{i = 1}^{K} \sum_{j \in Ω (i)} {(|Y_{i} - Y_{j}| - |I_{i} - I_{j}|)}^{2}

(15)

where K is the number of pre-allocated local areas; Y is enhanced image; I is original low-light image;

Ω (i)

represents the four (top, bottom, left, and right) adjacent regions centered on region i. Obviously, the smaller the difference in intensity values between the neighborhoods of the images before and after enhancement, the smaller the

L_{s p a}

value will be. In addition,

L_{\exp}

is used to evaluate the distance between the average intensity value of local regions and a threshold to avoid overexposure of the enhanced image. Its expression is the following:

L_{\exp} = \frac{1}{M} \sum_{k = 1}^{M} |\bar{Y_{k}} - E|

(16)

where M represents the number of non-overlapping regions, and in this paper, each region is set to have a size of 16 × 16;

\bar{Y}

is the average intensity of enhanced image in the part; E is the pre-set intensity threshold, and in this paper, E = 0.6.

L_{c o l}

is used to evaluate the degree of deviation between the three color channels. A smaller value indicates that the three color channels are more balanced. The definition is as follows:

L_{c o l} = \sum_{\forall (p, q) \in ε} {(J^{p} - J^{q})}^{2}, ε = \{(R, G), (R, B), (G, B)\}

(17)

where

J^{p}

represents the average intensity value of p color channel in the enhanced image.

L_{A}

is used to improve the parameter mapping of DCE-Net output to make it smoother. The expression is as below:

L_{A} = \frac{1}{N} \sum_{n = 1}^{N} \sum_{c \in ξ} {(|\nabla_{x} A_{n}^{c}| + |\nabla_{y} A_{n}^{c}|)}^{2}, ξ = \{R, G, B\}

(18)

where N is the number of iterations to improve DCE-Net, which is set to 8 in this paper.

\nabla_{x}

and

\nabla_{y}

are the horizontal and vertical gradients of the parameter mapping, respectively. The smaller the gradient value, the smaller

L_{A}

it is.

In addition, the loss function design of FCOS model [15] is mainly used for reference in setting target detection tasks, which is specifically expressed as follows:

L_{S T D} = \frac{1}{N_{p o s}} \sum_{x, y} L_{c l s} (p_{x, y}, c_{x, y}^{*}) + \frac{λ}{N_{p o s}} \sum_{x, y} χ (c_{x, y}^{*} > 0) L_{r e g} (t_{x, y}, t_{x, y}^{*})

(19)

where

N_{p o s}

is the positive sample number;

λ

is the weight parameter;

χ (\cdot)

is indicator function. That is, if the condition is met, the value is 1; otherwise, it is 0.

L_{c l s} (\cdot)

and

L_{r e g} (\cdot)

are FL loss term of RetinaNet and IoU loss term of UnitBox, respectively.

4. Assessment Criteria

In this experimental section, to quantitatively evaluate the enhancement effect of low-light images, we introduce the classical peak signal-to-noise ratio (PSNR), structure similarity index measure (SSIM) and learned perceptual image patch similarity (LPIPS). Their definitions are as follows:

(1): PSNR

PSNR is an objective and quantitative evaluation criterion for image quality. It compares the gray-scale differences between the processed image and the original image. The definition is as follows:

P S N R = 10 \times \log_{10} (\frac{{(2^{n} - 1)}^{2}}{M S E})

(20)

where n is the number of sampling points. In this paper, we process the RGB images, so n = 24. MSE stands for mean squared error, which is defined as follows:

M S E = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} {(X (i, j) - Y (i, j))}^{2}

(21)

where H × W is the number of pixels in the image; H and W are the length and width of the image; X is the enhanced image; Y is the real clear image.

(2): SSIM

SSIM evaluates the similarity between a processed image and the ground truth image by considering three aspects: image luminance, image contrast, and image structural information. The definition of SSIM is as follows:

S S I M = l (X, Y) \times c (X, Y) \times s (X, Y)

(22)

where

l (X, Y)

,

c (X, Y)

, and

s (X, Y)

are defined as follows:

\begin{array}{l} l (X, Y) = \frac{2 μ_{x} μ_{y} + C_{1}}{μ_{x}^{2} + μ_{y}^{2} + C_{1}} \\ c (X, Y) = \frac{2 δ_{x} δ_{y} + C_{2}}{δ_{x}^{2} + δ_{y}^{2} + C_{2}} \\ s (X, Y) = \frac{δ_{x y} + C_{3}}{δ_{x} δ_{y} + C_{3}} \end{array}

(23)

where

μ_{x}

is the mean of the enhanced image;

μ_{y}

is the mean of the clear image. Moreover,

δ_{x}^{2}

and

δ_{y}^{2}

are the variances of enhanced image and clear image, respectively.

δ_{x y}

is typically not used to directly represent the covariance between two images. C₁, C₂, and C₃ are constants. In order to avoid the situation where the denominator is zero, we usually set the following:

C_{1} = {(K_{1} \times L)}^{2}, C_{2} = {(K_{2} \times L)}^{2}, C_{3} = \frac{C_{2}}{2}

(24)

Let

K_{1} = 0.01

,

K_{2} = 0.03

, and

L = 255

. Clearly, the value of SSIM ranges from [0, 1], with a higher value indicating that the enhanced image is closer to the original clear image, resulting in better enhancement performance. Conversely, a lower SSIM value suggests a greater disparity between the two images, indicating less ideal enhancement results.

(3): LPIPS

The design philosophy of LPIPS is to measure the perceptual similarity between an enhanced image and its original, authentic counterpart during the process of inversely recovering the original image from the enhanced one. Its definition is formulated as follows:

L P I P S = \sum_{L} \frac{1}{H_{l} W_{l}} \sum_{h, w} {‖ω_{l} ⊙ (y^{l} - y_{o}^{l})‖}_{2}^{2}

(25)

where

y^{l}

is the lth feature maps. Normalize it with respect to the initial feature map

y_{o}^{l}

in the channel dimension using unit normalization and scale the number of activated channels using

ω_{l}

, then calculate the L2 distance value. Here,

⊙

is dot product operation. Finally, by averaging over the spatial dimension and summing over the channel dimension, the LPIPS metric can be obtained. A lower value of this metric indicates better image enhancement performance.

Furthermore, to quantitatively evaluate the effectiveness of object detection, this paper employs mAP and F1 measure as evaluation metrics. Among them, average precision (AP) is used to measure the precision for a specific type of object, representing the area under the precision–recall (PR) curve. mAP, typically serving as an indicator for quantitatively assessing the overall precision of a detection model, is the average result of AP across different categories and is one of the commonly used metrics for evaluating object detectors. The definition of mAP is as follows:

m A P = \frac{1}{C} \sum_{i = 1}^{C} A P_{i}

(26)

where C is the type number. Additionally, the F1 measure is the weighted harmonic mean of precision and recall. A higher F1 measure indicates better detection performance, and its definition is as follows:

F 1 - m e a s u r e = \frac{2 P R}{P + R}

(27)

where P is precision; R is recall. In addition, to evaluate the significant difference in the accuracy of training and testing detection results, we use t-test indicator. It is defined as follows:

t - t e s t = \frac{1}{N} \sum_{i = 1}^{N} \frac{m e a n (X_{i}) - m e a n (Y_{i})}{s t d (X_{i} - Y_{i})}

(28)

where X_i and Y_i are ith detection results in training and test stages, respectively; N is the number of samples; mean(·) is the average of detection results; std(·) is the standard deviation. If the value is in the range of [−2, 2], we judge that there is no overfitting phenomenon.

5. Experiment and Results

5.1. Experimental Data and Experimental Settings

This paper conducts experimental verification based on data from multiple public traffic sign datasets, mainly selecting three datasets: GTSDB [20], TT-100K [4], and GLARE [40]. The basic information of these three datasets is shown in Table 2. Additionally, to increase the proportion of low-light images, an exponential transformation function is applied to over half of the images in each dataset to synthesize low-light images. Specifically, each original image corresponds to four synthesized low-light images, and by adjusting the parameters of the exponential transformation, the image quality varies accordingly. Figure 5 shows an original high-quality image from each of the three datasets and its corresponding four synthesized low-light images. The synthesized images are merged with the original images to form new datasets, which are then divided into training, validation, and test data according to a ratio of 6:2:2.

In this experiment, two NVIDIA 3090 ti GPUs are selected. The synchronous stochastic gradient descent (SGD) optimizer is used with an initial learning rate set to 0.01. The batch size is set to 8. At iterations 60 K and 80 K, the learning rate is reduced by a factor of 10. The weight decay and momentum were set to 0.0001 and 0.9, respectively. In order to ensure the referenceability of results, these experiments use fourfold cross-validation. Moreover, the following model loss functions have reached a minimum and stable.

5.2. Performance Analysis of Low-Light Image Enhancement Module

Firstly, to validate the processing effect of the low-light image enhancement stage in the LLE-STD model, an experimental analysis of low-light enhancement effects was conducted on the merged traffic sign datasets GTSDB, TT-100K, and GLARE, which simulate low-light images. Here, classical models including LIME, RetinexNet, EnlightenGAN, RUAS, Zero-DCE, Zero-DCE++, Retinexformer, as well as the proposed method, were selected for quantitative and qualitative comparisons. Among them, the quantitative comparison results are shown in Table 3.

As can be seen from Table 3, the proposed method ensures that most image enhancement metrics are at a relatively high level when processing data from the three different datasets. In particular, it has significant advantages compared to the LIME, RetinexNet, and EnlightenGAN methods. Compared to the RUAS and Zero-DCE methods, the processing effect is roughly the same. Even for the GTSDB dataset, the image enhancement metrics of the proposed method are slightly better than those of the two aforementioned methods. Additionally, the low-light image enhancement effect can also be compared qualitatively. Comparing with the aforementioned classical models across multiple datasets, the enhanced images are shown in Figure 6. It can be observed that each low-light image enhancement method can produce enhanced images with more detailed information compared to the original low-light images. However, the processing results of the proposed method, as well as the RUAS and Zero-DCE methods, are closer to the original clear images.

Compared to RUAS and Zero-DCE, the method proposed in this paper has certain advantages in terms of computational resource utilization due to the replacement of classical convolutional processing with BSConv. Specifically, this can be compared by examining the model runtime (RT) and floating-point operations (FLOPs), as shown in Table 4. It is evident that the runtime of the proposed method is reduced by 0.0140 s and 0.0047 s, respectively, compared to RUAS and Zero-DCE, while the FLOPs are reduced by 347.21 G and 38.47 G, respectively. This demonstrates the significant lightweight effect of the proposed model. In other words, while ensuring that the low-light image enhancement performance remains at a near-optimal level, the model is lightweight and saves computational resources.

5.3. Performance Analysis of Small Target Detection Module

This section focuses on the performance of the object detection stage model in LLE-STD. To evaluate the reliability of the proposed method, it is compared with classical object detection models. The comparison models include Faster-RCNN [7], SSD [12], RetinaNet [42], Yolov3 [13], Yolov4-tiny [14], Yolov4 [14], and FCOS [15]. To ensure the persuasiveness of the comparison results, it is crucial to train different models using the same training data and test them on the same test data during the comparative experiments. Figure 7 first presents the detection results of eight methods on the same image from the TT-100K dataset. For intuitive comparison, this paper also enlarges the regions that mainly contain traffic signs. The detection types and their corresponding probability values are also clearly displayed.

The results in Figure 7 show that the traffic sign targets are very small, further indicating that traffic sign detection in intelligent transportation scenarios indeed belongs to the task of small object detection. Among the eight methods mentioned above, Faster-RCNN, SSD, and RetinaNet exhibit significant missed detections, which is due to the fact that the missed traffic signs are close to the edges of the entire image, potentially leading to information loss during the detection process. From the results of the other five detection methods, it can also be observed that the detected types are basically correct, but in terms of probability values, FCOS and the proposed method have obvious advantages, thereby demonstrating the superior detection performance of FCOS and the proposed method.

Furthermore, this paper also compares the object detection performance of some SOTA methods quantitatively by training and testing the models on three public datasets. The comparison results of the detection outcomes are shown in Table 5. Here, to verify the effectiveness of CBAM, the FCOS improved by self-attention model is also trained. The detection results are shown in Table 5, too. It can be clearly seen that the proposed method only underperforms FCOS in terms of mAP on the CLARE dataset and F1 measure on the GTSDB dataset, and the differences are relatively close. In other cases, the proposed method has significant advantages. Under the mAP metric, the proposed method achieves at least a 6.93% improvement over the classic Faster-RCNN model. Under the F1 measure metric, the proposed method achieves at least a 1.12% improvement over the classic Yolo series methods. In addition, we use t-test values to determine the differences in target detection results between the proposed method during the training and testing stages. The t-test values for the mAP metrics under the GTSDB, TT-100K, and GLARE datasets are 0.89, 1.32, and −0.44, respectively. The t-test values for the F1 measure metrics are 0.54, −1.00, and 0.87, respectively. Obviously, these values are all within the range of [−2, 2], indicating that there is no significant difference in the accuracy of training and testing object detection results, proving that the method does not exhibit significant overfitting.

In summary, the object detection stage in the LLE-STD method proposed in this paper exhibits strong adaptability to the task of small object detection for traffic signs, effectively enhancing detection accuracy and reducing missed detections.

5.4. Adaptability Analysis of the Proposed Model

The LLE-STD proposed in this paper does not simply cascade low-light image enhancement with small object detection but ensures end-to-end execution of the inference process through coupled connections. Meanwhile, during the backpropagation of parameter training, the gradient values at different stages will mutually influence the direction of parameters in both stages, thereby enhancing the adaptability of LLE-STD to images under different illumination conditions. This section verifies this adaptability through experiments.

Firstly, we use the Grad-CAM [46] to visualize the heatmap of the prediction process of the LLE-STD network, helping us better understand the key points of interest in LLE-STD. Figure 8 shows the heat maps of three groups of real traffic sign images. Obviously, LLE-STD can focus on the small target region of traffic signs in the image more specifically. Therefore, it is proved that the features learned by LLE-STD match the processing task.

Moreover, Figure 9 shows the detection results of an image from the TT-100K dataset after applying different levels of low-light simulation processing, using the proposed LLE-STD method. We also locally enlarge the regions containing traffic signs and their detection results. Comparing the results, it can be observed that as the image quality degrades, most of the detection results remain unchanged, with only slight decreases in probability values. The only missed detection occurs in the image with the weakest quality shown in Figure 9a, where the traffic sign corresponding to p11 is not detected. This is due to the reduced difference between small objects and the background caused by the deterioration of image quality. However, overall, the impact of low light on LLE-STD is within an acceptable range.

In addition, Figure 10 shows the LLE-STD detection results of three images with different light intensities in the real public dataset. It is evident that although light still has an impact on the detection of traffic sign targets. However, the LLE-STD method can ensure accurate detection results in real night scenes.

Furthermore, to further analyze the adaptability of LLE-STD to different image qualities from a quantitative perspective, this section also conducts statistics on mAP and F1 measure for data in the three datasets and compares them with three other classic object detection models. The results are shown in Figure 11. In these results, we need to pay attention not only to the detection results corresponding to different methods but also to the rate of change in detection results as image quality increases, i.e., the gradient situation. In Figure 11, levels 1 to 4 correspond to the low-light image synthesis levels described in Section 5.1, with image quality gradually improving. However, even at level 4, the image quality is still inferior to the original images in the public datasets.

As can be seen from the results in Figure 11, the proposed LLE-STD achieves the optimal values of mAP and F1 measure for data from all datasets. In particular, the advantage of LLE-STD is more pronounced under level 1 condition, with the mAP tested on the GLARE dataset being approximately 15% higher than that of RetinaNet. Additionally, as the horizontal axis moves to the right, the values of LLE-STD change most smoothly, indicating the smallest gradient. These results demonstrate that LLE-STD has the strongest adaptability to images of different qualities. This phenomenon occurs because the other three classic object detection models exhibit significant overfitting and can only adapt to high-quality original images. However, LLE-STD enhances adaptability by coupling the two stages of low-light image enhancement and small object detection.

6. Discussions

LLE-STD, proposed in this paper, can balance the effectiveness and reliability of detection as much as possible. In the Experiment and Results Section of this paper, comparative experiments are conducted on data from three public datasets: GTSDB, TT-100K, and GLARE. Firstly, in the performance analysis of the low-light enhancement module, the proposed method is qualitatively and quantitatively compared with five classic low-light image enhancement methods to demonstrate its advantages in image enhancement processing. It is also compared with two classic methods with better image enhancement effects in terms of runtime and FLOPs, proving that the proposed method is lightweight in the low-light image enhancement stage. Furthermore, in the performance analysis of the small object detection module, the performance of LLE-STD is compared with seven other classic object detection methods using both qualitative and quantitative methods. It is found that the proposed method achieves lower mAP values on the CLARE dataset and F1 measure values comparable to FCOS on the GTSDB dataset but outperforms other methods under most conditions. Specifically, under the mAP metric, the proposed method achieves at least a 6.93% improvement over the classic Faster-RCNN model across different datasets. Under the F1 measure metric, it achieves at least a 1.12% improvement over the classic Yolo series methods. Finally, in the analysis of the adaptability of the LLE-STD model to images, images of different qualities, no matter in simulation and actual images, are used for object detection tests, and the results of LLE-STD are compared with other classic models. It is found that the advantage of LLE-STD is more pronounced, especially in low-light images. For example, the mAP results tested on the GLARE dataset show that LLE-STD outperforms RetinaNet by approximately 15%, demonstrating that LLE-STD enhances image adaptability due to its coupled connection mode. In addition, we also conduct tests on the embedded segments of the proposed method. We selected the Zynq UltraScale+ MPSoCs CG series chip with model number XCZU2CG-1SFVC784E for testing. The proposed method can meet the real-time requirements of the autonomous driving field with a processing time of about 30 ms per frame image.

7. Conclusions

This paper focuses on the problem of small object detection in low-light complex scenes for traffic sign detection tasks and proposes a two-stage coupled model for low-light enhancement and small object detection, namely, LLE-STD. This method takes the classic low-light image enhancement model and small object detection model, Zero-DCE and FCOS, as baselines, and makes corresponding improvements by reducing model computation and enhancing small object feature representation. The two models are then coupled to form an end-to-end processing mode. Through the design of loss functions, the gradients of both stages jointly influence the training of model parameters, thereby enhancing the adaptability of the model to images of different qualities. Moreover, some experiments verify the effectiveness and reliability of LLE-STD. In summary, the proposed LLE-STD model in this paper can well adapt to low-light images and effectively detect small objects such as traffic signs, which can promote the application of related intelligent algorithms in the field of autonomous driving.

Author Contributions

Conceptualization, T.W. and T.Z.; methodology, H.Q.; software, C.L.; validation, T.Z.; formal analysis, T.W.; investigation, T.Z.; resources, Z.L.; data curation, C.L.; writing—original draft preparation, T.W.; writing—review and editing, T.Z.; visualization, T.W.; supervision, Z.L.; project administration, H.Q.; funding acquisition, T.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the R&D Program of Beijing Municipal Education Commission under Grant KM202410011002.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, M.; Gao, J.; Zhao, L.; Shen, X. Adaptive computing scheduling for edge-assisted autonomous driving. IEEE Trans. Veh. Technol. 2021, 70, 5318–5331. [Google Scholar] [CrossRef]
Wang, Z.; Zhu, H.; He, M.; Zhou, Y.; Luo, X.; Zhang, N. GAN and multi-agent DRL based decentralized traffic light signal control. IEEE Trans. Veh. Technol. 2022, 71, 1333–1348. [Google Scholar] [CrossRef]
Everingham, M.; Gool, L.V.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) challenge. Int. J. Comput. Vision 2010, 88, 303–338. [Google Scholar] [CrossRef]
Zhu, Z.; Liang, D.; Zhang, S.; Huang, X.; Li, B.; Hu, S. Traffic-sign detection and classification in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2110–2118. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast R-CNN. arXiv 2015, arXiv:1504.08083. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Lin, T.Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7371. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: A simple and strong anchor-free object detector. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 9627–9636. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Santiago, Chile, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Li, C.; Guo, C.; Han, L.; Jiang, J.; Cheng, M.M.; Gu, J.; Loy, C.C. Low-light image and video enhancement using deep learning: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 9396–9416. [Google Scholar] [CrossRef]
Li, Z.; Liu, Y.; Chen, X.; Cai, H.; Gu, J.; Qiao, Y.; Dong, C. Blueprint separable residual network for efficient image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 833–843. [Google Scholar]
Lore, K.G.; Akintayo, A.; Sarkar, S. LLNet: A deep autoencoder approach to natural low-light image enhancement. Pattern Recognit. 2017, 61, 650–662. [Google Scholar] [CrossRef]
Wei, C.; Wang, W.; Yang, W.; Liu, J. Deep retinex decomposition for low-light enhancement. arXiv 2018, arXiv:1808.04560. [Google Scholar]
Zhang, Y.; Zhang, J.; Guo, X. Kindling the darkness: A practical low-light image enhancer. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 1632–1640. [Google Scholar]
Wang, R.; Zhang, Q.; Fu, C.W.; Shen, X.; Zheng, W.S.; Jia, J. Underexposed photo enhancement using deep illumination estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6849–6857. [Google Scholar]
Yang, W.; Wang, W.; Huang, H.; Wang, S.; Liu, J. Sparse gradient regularized deep retinex network for robust low-light image enhancement. IEEE Trans. Image Process. 2021, 30, 2072–2086. [Google Scholar] [CrossRef] [PubMed]
Liu, R.; Ma, L.; Zhang, J.; Fan, X.; Luo, Z. Retinex-inspired unrolling with cooperative prior architecture search for low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10561–10570. [Google Scholar]
Zheng, C.; Shi, D.; Shi, W. Adaptive unfolding total variation network for low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4439–4448. [Google Scholar]
Jiang, Y.; Gong, X.; Liu, D.; Cheng, Y.; Fang, C.; Shen, X.; Yang, J.; Zhou, P.; Wang, Z. Enlightengan: Deep light enhancement without paired supervision. IEEE Trans. Image Process. 2021, 30, 2340–2349. [Google Scholar] [CrossRef]
Guo, C.; Li, C.; Guo, J.; Loy, C.C.; Hou, J.; Kwong, S.; Cong, R. Zero-reference deep curve estimation for low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, DC, USA, 13–June 2020; pp. 1780–1789. [Google Scholar]
Li, C.; Guo, C.; Loy, C.C. Learning to enhance low-light image via zero-reference deep curve estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 4225–4238. [Google Scholar] [CrossRef] [PubMed]
Huang, W.; Huang, M.; Zhang, Y. Detection of traffic signs based on combination of GAN and faster-RCNN. J. Phys. Conf. Ser. 2019, 1069, 012159. [Google Scholar] [CrossRef]
Zhang, J.; Xie, Z.; Sun, J.; Zou, X.; Wang, J. A cascaded R-CNN with multiscale attention and imbalanced samples for traffic sign detection. IEEE Access 2020, 8, 29742–29754. [Google Scholar] [CrossRef]
Zhang, J.; Zou, X.; Kuang, L.D.; Sherratt, J.; Yu, X. CCTSDB 2021: A more comprehensive traffic sign detection benchmark. Hum.-Centric Comput. Inf. Sci. 2022, 12, 023. [Google Scholar] [CrossRef]
Houben, S.; Stallkamp, J.; Salmen, J.; Schlipsing, M.; Igel, C. Detection of traffic signs in real-world images: The German traffic sign detection benchmark. In Proceedings of the 2013 International Joint Conference on Neural Networks (IJCNN), Dallas, TX, USA, 4–9 August 2013; pp. 1–8. [Google Scholar]
Zhao, Z.; Li, X.; Liu, H.; Xu, C. Improved target detection algorithm based on libra R-CNN. IEEE Access 2020, 8, 114044–114056. [Google Scholar] [CrossRef]
Aghdam, H.H.; Heravi, E.J.; Puig, D. A practical approach for detection and classification of traffic signs using Convolutional Neural Networks. Rob. Autom. Syst. 2016, 84, 97–112. [Google Scholar] [CrossRef]
Chen, T.; Lu, S. Accurate and efficient traffic sign detection using discriminative adaboost and support vector regression. IEEE Trans. Veh. Technol. 2016, 65, 4006–4015. [Google Scholar] [CrossRef]
Liu, C.; Chang, F.; Chen, Z.; Li, S. Rapid traffic sign detection and classification using categories-first-assigned tree. J. Comput Inf. Syst. 2013, 9, 7461–7468. [Google Scholar]
Liu, C.; Chang, F.; Chen, Z. Rapid multiclass traffic sign detection in high-respolution image. IEEE Trans. Intell. Transp. Syst. 2014, 15, 2394–2403. [Google Scholar] [CrossRef]
Zang, D.; Zhang, J.; Zhang, D.; Bao, M.; Cheng, J.; Tang, K. Traffic sign detection based on cascaded convolutional neural networks. In Proceedings of the 2016 17th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), Shanghai, China, 30 May–1 June 2016; pp. 201–206. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Sweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–18. [Google Scholar]
Gray, N.; Moraes, M.; Bian, J.; Wang, A.; Tian, A.; Wilson, K.; Huang, Y.; Xiong, H.; Guo, Z. GLARE: A dataset for traffic sign detection in sun glare. IEEE Trans. Intell. Transp. Syst. 2023, 24, 12323–12330. [Google Scholar] [CrossRef]
Guo, X. LIME: A method for low-light image enhancement. In Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; pp. 87–91. [Google Scholar]
Cai, Y.; Bian, H.; Lin, J.; Wang, H.; Timeofte, R.; Zhang, Y. Retinexformer: One-stage retinex-based transformer for low-light image enhancement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 12504–12513. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Yao, Y.; Han, L.; Du, C.; Xu, X.; Jiang, X. Traffic sign detection algorithm based on improved YOLOv4-Tiny. Signal Process. Image Commun. 2022, 107, 116783. [Google Scholar] [CrossRef]
Li, Y.; Li, J.; Meng, P. Attention-YOLOV4: A real-time and high-accurate traffic sign detection algorithm. Multimed. Tools Appl. 2023, 82, 7567–7582. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]

Figure 1. The overall structure of Zero-DCE [27].

Figure 2. The overall structure of FCOS [15].

Figure 3. The overall structure of LLE-STD.

Figure 4. The structure of CBAM [39].

Figure 5. Comparison between original images and synthesized low-light images.

Figure 6. Qualitative contrast image enhancement effect.

Figure 7. Qualitative comparison of detection results of different detection models: (a) Faster-RCNN, (b) SSD, (c) RetinaNet, (d) Yolov3, (e) Yolov4-tiny, (f) Yolov4, (g) FCOS, and (h) the proposed model.

Figure 8. Grade-CAM results of actual traffic sign images.

Figure 9. Comparison of LLE-STD detection results of the same image under different illumination conditions: (a–d) the illumination of the simulated low illumination image gradually increases, (e) the original image.

Figure 10. LLE-STD detection results of actual traffic sign images in different illumination conditions.

Figure 11. The change in detection results with the change of image quality under different methods: (a,b) mAP and F1 measure conditions under the GTSDB dataset, (c,d) mAP and F1 measure conditions under the TT-100K dataset, and (e,f) mAP and F1 measure conditions under the GLARE dataset.

Table 1. The existing traffic sign detection methods.

Method	Dataset	Accuracy (%)	Time
Huang et al. [29]	TT-100K	89.65	-
Zhang et al. [30]	CCTSDB/GTSDB	99.7/98.7	-
Zhao et al. [33]	MS COCO	77.3	-
Aghdam et al. [34]	GTSDB	99.89	26.506 ms
Chen et al. [35]	STSD	80.85	-
Houben et al. [36]
HOG + LDA	GTSDB	69.2	-
Hough-like	GTSDB	65.1	-
Viola-Jones	GTSDB	67.3	-
Liu et al. [37]	GTSDB	93.5	-
Liu et al. [38]	GTSDB	98.57	192 ms
Zang et al. [39]	GTSDB	96.50	35 ms

Table 2. The condition of public traffic sign datasets.

Dataset	Image Number	Resolution	Types	Country	Year
GTSDB [20]	900	1360 × 1024	43	Germany	2013
TT-100K [4]	100,000	2048 × 2048	221	China	2016
CLARE [41]	2157	720 × 480~1920 × 480	41	USA	2022

Table 3. Quantitative comparison of enhancement results in low-light images.

Method	PSNR	SSIM	LPIPS
LIME [41]	14.84	0.55	0.61
RetinexNet [42]	15.53	0.50	0.58
EnlightenGAN [26]	15.88	0.51	0.55
RUAS [24]	16.16	0.54	0.54
Zero-DCE [27]	16.09	0.55	0.51
Zero-DCE++ [28]	16.11	0.54	0.52
Retinexformer [43]	16.09	0.53	0.53
The proposed	16.18	0.55	0.52
(a) GTSDB
LIME [41]	14.88	0.44	0.65
RetinexNet [42]	15.02	0.45	0.69
EnlightenGAN [26]	15.22	0.49	0.71
RUAS [24]	15.30	0.48	0.68
Zero-DCE [27]	15.44	0.50	0.65
Zero-DCE++ [28]	15.20	0.50	0.68
Retinexformer [43]	15.35	0.49	0.66
The proposed	15.40	0.51	0.66
(b) TT-100K
LIME [41]	13.15	0.41	0.66
RetinexNet [42]	13.89	0.45	0.65
EnlightenGAN [26]	13.95	0.50	0.60
RUAS [24]	14.11	0.52	0.55
Zero-DCE [27]	14.52	0.52	0.56
Zero-DCE++ [28]	14.14	0.50	0.58
Retinexformer [43]	14.30	0.49	0.57
The proposed	14.38	0.50	0.55
(c) GLARE

Table 4. Comparison of running time and floating point computation of different networks.

Methods	RT (s)	FLOPs (G)
RUAS [24]	0.0186	411.35
Zero-DCE [27]	0.0093	102.61
The proposed	0.0046	64.14

Table 5. Quantitative comparison of detection results of different detection models.

Detection Method	mAP (%)			F1-Measure (%)
Detection Method	GTSDB	TT-100K	GLARE	GTSDB	TT-100K	GLARE
Faster-RCNN [7]	70.85	71.28	68.51	58.86	70.59	67.51
SSD [13]	70.47	72.22	67.94	69.05	70.41	65.38
RetinaNet [44]	72.55	78.51	68.88	71.10	75.38	68.00
Yolov3 [13]	73.28	81.05	71.89	71.11	76.88	69.95
Yolov4-tiny [14]	75.85	81.10	72.04	72.94	80.94	71.05
Yolov4 [14]	75.41	81.38	72.28	72.85	81.18	70.98
AFPN [45]	78.06	81.75	75.45	75.87	82.05	72.22
Attention-Yolov4 [46]	78.55	81.59	75.40	75.81	81.95	72.08
FCOS [15]	78.08	81.41	75.33	75.75	81.88	72.10
FCOS-self-attention	78.15	81.84	75.40	75.30	82.08	72.19
The proposed	78.88	82.23	75.44	75.44	82.30	72.31

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, T.; Qu, H.; Liu, C.; Zheng, T.; Lyu, Z. LLE-STD: Traffic Sign Detection Method Based on Low-Light Image Enhancement and Small Target Detection. Mathematics 2024, 12, 3125. https://doi.org/10.3390/math12193125

AMA Style

Wang T, Qu H, Liu C, Zheng T, Lyu Z. LLE-STD: Traffic Sign Detection Method Based on Low-Light Image Enhancement and Small Target Detection. Mathematics. 2024; 12(19):3125. https://doi.org/10.3390/math12193125

Chicago/Turabian Style

Wang, Tianqi, Hongquan Qu, Chang’an Liu, Tong Zheng, and Zhuoyang Lyu. 2024. "LLE-STD: Traffic Sign Detection Method Based on Low-Light Image Enhancement and Small Target Detection" Mathematics 12, no. 19: 3125. https://doi.org/10.3390/math12193125

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LLE-STD: Traffic Sign Detection Method Based on Low-Light Image Enhancement and Small Target Detection

Abstract

1. Introduction

2. Related Works

2.1. LLE Models

2.2. Small Traffic Sign Detection Methods

3. Methodology

3.1. The Overall Structure

3.2. The Improved DCE-Net Model

3.3. The Improved FCOS Model

3.4. Loss Function

4. Assessment Criteria

5. Experiment and Results

5.1. Experimental Data and Experimental Settings

5.2. Performance Analysis of Low-Light Image Enhancement Module

5.3. Performance Analysis of Small Target Detection Module

5.4. Adaptability Analysis of the Proposed Model

6. Discussions

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI