3L-YOLO: A Lightweight Low-Light Object Detection Algorithm

Han, Zhenqi; Yue, Zhen; Liu, Lizhuang

doi:10.3390/app15010090

Open AccessArticle

3L-YOLO: A Lightweight Low-Light Object Detection Algorithm

by

Zhenqi Han

^1,2,*

,

Zhen Yue

²

and

Lizhuang Liu

^2,*

¹

School of Information Science and Technology, Fudan University, Shanghai 200438, China

²

Shanghai Advanced Research Institute, Chinese Academy of Sciences, Shanghai 201210, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(1), 90; https://doi.org/10.3390/app15010090

Submission received: 16 November 2024 / Revised: 24 December 2024 / Accepted: 24 December 2024 / Published: 26 December 2024

Download

Browse Figures

Versions Notes

Abstract

:

Object detection in low-light conditions presents significant challenges due to issues such as weak contrast, high noise, and blurred boundaries. Existing methods often use image enhancement to improve detection, which results in a large amount of computational resource consumption. To address these challenges, this paper proposes a detection method, 3L-YOLO, based on YOLOv8n, which eliminates the need for image enhancement modules. First, we introduce switchable atrous convolution (SAConv) into the C2f module of YOLOv8n, improving the model’s ability to efficiently capture global contextual information. Second, we present a multi-scale neck module that aggregates shallow features and incorporates a channel attention mechanism to prioritize the most relevant features. Third, we introduce a dynamic detection head, which employs a cascade of spatial, scale, and channel attention mechanisms to enhance detection accuracy and robustness. Finally, we replace the original loss function with MPDIoU loss, improving bounding box regression and overall reliability. Additionally, we create a synthetic low-light dataset to evaluate the performance of the proposed method. Extensive experiments on the ExDark, ExDark+, and DARK FACE datasets demonstrate that 3L-YOLO outperforms YOLOv8n in low-light object detection, with improvements in mAP@0.5 of 2.7%, 4.3%, and 1.4%, respectively, across the three datasets. In comparison to the LOL-YOLO low-light object detection algorithm, 3L-YOLO requires 16.9 GFLOPs, representing a reduction of 4 GFLOPs.

Keywords:

low-light object detection; YOLO; atrous convolution; attention mechanism

1. Introduction

Low-light object detection is a critical and fundamental task in vision, with a wide range of applications in areas such as security surveillance systems [1,2], autonomous driving [3], agricultural production [4,5], and other fields. Current research predominantly focuses on object detection in high-quality images under good lighting conditions, leading to the development of both one-stage [6,7,8,9,10,11,12] and two-stage methods [13,14,15]. In practical applications, poor lighting conditions are common, including low-light environments, nighttime, and underexposure. As a result of hardware limitations, the quality of the acquired images often deteriorates, leading to issues such as low contrast, high noise, and blurred boundaries. These factors weaken the geometric and texture features of the object, resulting in degraded object detection performance. Therefore, improving recognition capabilities under low-light conditions is crucial for advancing the practical applications of object detection.

To address these challenges, researchers have adopted hardware-based approaches to improve object detection performance under low-light conditions. For example, Altay et al. [16] employed thermal cameras for pedestrian detection and introduced a novel detection network that integrates saliency maps derived from thermal images. Similarly, Xie et al. [17] proposed a feature fusion framework that combines thermal and visible light data to improve detection performance in low-light conditions. While these methods enhance detection performance, they require costly data processing and specialized imaging equipment, limiting their widespread adoption. Therefore, high-performance low-light object detection algorithms have attracted significant attention from researchers recently.

In low-light object detection tasks, most current methods decompose the problem into two stages: image enhancement and object detection. Image enhancement serves as a preprocessing step to improve the visibility and sharpness of images captured under low-light conditions before they are passed into a detector for object detection. Image enhancement techniques can be broadly classified into conventional methods and deep learning-based methods. Conventional image enhancement methods include histogram equalization [18], gamma correction [19], adaptive contrast enhancement [20], and Retinex [21]. In contrast, deep learning-based approaches [22,23,24] focus on end-to-end image enhancement. Object detection using deep learning-enhanced images has gained popularity due to its superior performance. For instance, Vinoth et al. [25] designed a PWDN network for image enhancement, followed by using YOLOv8 for object detection. Cui et al. [26] proposed an illumination adaptive transformer (IAT) network to enhance images, followed by YOLOv3 for object detection. Guo et al. [27] designed a lightweight network to predict dynamic range and high-order curves for image enhancement, followed by the use of DSFD [28] for face detection, thereby addressing the challenge of face detection under low-light conditions. However, these image enhancement-based detection methods require substantial computational resources, making them challenging to deploy on devices with limited computing power. Furthermore, they are not easily trained end-to-end with object detection models.

To reduce the complexity of two-stage low-light object detection, many researchers have focused on end-to-end networks. These architectures integrate image enhancement directly into the object detection network, improving overall system efficiency by simplifying the processing flow and reducing computational load. These methods provide an effective solution for object detection in low-light environments. For example, Yin et al. [29] proposed a pyramid-enhanced network integrated with YOLOv3, creating an end-to-end dark object detection framework. Building on YOLOv8, Jiang et al. [30] introduced a self-correcting photo module to improve the quality of low-light images and employed a dynamic feature extraction method to capture contextual information, thereby enhancing detection accuracy and robustness. Liu et al. [31] proposed a data-driven, stylization-based, neural–image–adaptive YOLO, which improves model robustness by adaptively enhancing image quality and learning relevant information related to extreme weather conditions through neural style transfer. Hashmi et al. [32] developed FeatEnHancer, a general-purpose plug-and-play module. FeatEnHancer uses CNNs to generate multi-scale feature representations, which are then fused through scale-aware attentional feature aggregation and skip connections. It can be incorporated into any low-light vision pipeline. While end-to-end architectures offer a promising solution for object detection in low-light environments, these algorithms still face the challenge of real-time processing to meet the efficiency demands of practical applications. Therefore, enhancing the feature extraction capabilities of end-to-end networks for weak signal targets and reducing reliance on image enhancement modules is critical for achieving lightweight, low-light object detection networks. Furthermore, numerous studies [33,34] have shown that image enhancement does not always improve object detection performance. This is often due to the introduction of noise or distortion during the enhancement process, which can interfere with the feature extraction capabilities of detection algorithms.

In summary, while the methods discussed above have improved the detection accuracy of low-light objects, several major challenges remain: (1) Many approaches focus on enhancing image quality in low-light conditions, a process that consumes significant computational resources and is difficult to deploy on devices with limited memory and processing power. (2) The inability to integrate image context and details limits model performance. To address these challenges, this paper proposes 3L-YOLO, an end-to-end lightweight low-light object detector based on the YOLOv8n architecture. Unlike traditional methods, 3L-YOLO eliminates the need for an image enhancement module and directly performs object detection on images captured in low-light environments. Specifically, we propose the following improvements to enhance low-light object detection performance. First, the switchable atrous convolution is integrated into the C2f module of YOLOv8 to improve the algorithm’s ability to extract contextual features of low-illumination objects. Second, a multi-scale feature fusion neck module is introduced, combined with a channel attention mechanism, to facilitate bidirectional feature flow (bottom–up and top–down) and local cross-channel information interaction. This enables the network to better capture features at various scales and improves its ability to detect objects of different sizes and complexities. Third, a dynamic head combined with deformable convolution is introduced for the detection head. The proposed improvements in 3L-YOLO provide both important research contributions and practical solutions to the challenge of low-light object detection.

The remainder of this paper is organized as follows: Section 2 presents the proposed 3L-YOLO network. Section 3 provides an extensive experimental evaluation of the feasibility and effectiveness of the 3L-YOLO network. Section 4 concludes the paper.

2. Methods

2.1. YOLOv8n Model

YOLOv8 (You Only Look Once Version 8) [35] is a one-stage, real-time object detector released by Ultralytics in 2023. Its speed, accuracy, and ease of use make it an excellent choice for object detection. YOLOv8 is available in five versions, differentiated by network depth and feature map width: YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x. Among these, YOLOv8n has the smallest number of parameters, making it suitable for deployment on devices with limited computational power, while YOLOv8x has the largest number of parameters and offers higher detection accuracy. Compared to YOLOv5 and YOLOv7, YOLOv8 enhances both object detection accuracy and processing speed through refined improvements in its network architecture. Specifically, YOLOv8n offers improved accuracy over YOLOv5n while being more efficient in terms of parameter count and computational cost compared to YOLOv7-tiny. Therefore, this paper selects YOLOv8n as the benchmark model to meet the real-time detection requirements on edge devices. Its architecture consists of a backbone, neck, and prediction head, as shown in Figure 1.

The backbone module mainly extracts rich, shallow feature information from images through convolution operations. The backbone includes a convolution block (Conv), a CSP bottleneck with two convolutions (C2f) modules, and a spatial pyramid pooling (SPPF) module. The Conv block consists of 2D convolution layers (Conv2d), a batch normalization (BN) layer, and a sigmoid linear unit (SiLU). Its primary function is to further extract features and perform downsampling to obtain shallow features at multiple scales. Meanwhile, the residual module facilitates model convergence and mitigates the issue of gradient vanishing. The C2f module integrates the Conv block, split layer, bottleneck block, and concatenation layer. Its multi-branch design enhances the model’s feature extraction capabilities while maintaining a lightweight structure. The SPPF module combines the Conv block, pooling layer, and concatenation layer. The use of multiple small pooling layers reduces computational load and fuses feature information at different scales, improving the model’s robustness to object scale variations.

The neck module primarily fuses multi-scale features extracted by the backbone. It includes an upsampling layer, the C2f module, and a concatenation layer. The neck adopts the FAN [36] and PAN [37] structures, enhancing feature map representation through bottom–up and top–down feature aggregation.

The head module is responsible for final object detection and classification. To improve efficiency, a decoupled head structure is employed, with separate detection and classification heads. This design allows the tasks of detection and classification to be handled independently. By leveraging feature maps at different scales, the model predicts object categories and locations at multiple scales. The decoupled head design enhances both training and inference efficiency, thereby improving the overall performance of YOLOv8n.

2.2. 3L-YOLO Model

To enhance the feature extraction capability for low-light objects while maintaining a lightweight network, this paper proposes the 3L-YOLO low-light object detection algorithm based on YOLOv8n. The architecture of 3L-YOLO, as illustrated in Figure 2, primarily consists of a backbone network for multi-scale feature extraction, a neck network for multi-scale feature fusion, and an object detection head. The 3L-YOLO algorithm introduces three key innovations: an improved C2f module with switchable atrous convolution, a neck module that combines multi-scale features with a channel attention mechanism, and an enhanced dynamic detection head utilizing deformable convolution. The rationale behind these design choices is discussed in detail in the following section.

In the C2f module of YOLOv8, switchable atrous convolution is incorporated to expand the receptive field, facilitating the effective extraction of detailed contextual features from the image. In the neck module, inspired by BIFPN [38], multi-scale feature fusion is achieved by integrating shallow features, while an efficient channel attention mechanism (ECA) [39] is employed to enhance feature fusion across different levels. The ECA, based on the channel attention mechanism in SENets [40], adaptively adjusts the weight of channel features. This allows the network to focus on more important features and suppress less relevant ones. This ensures that, during training, each layer’s feature map focuses on the most important features. The head module employs deformable convolution and various attention mechanisms to extract features at different scales, helping to suppress background noise and improve the extraction of weak signals in the object region.

2.2.1. Improved C2f Module by Switchable Atrous Convolution

In low-light environments, the object signal intensity in images is typically weak and often submerged in the background, making it challenging to extract rich local features. Therefore, leveraging global context information is crucial. Using convolutional kernels with a large receptive field helps enhance the model’s ability to capture global information, thereby compensating for the lack of local details. To address this, we integrate switchable atrous convolution (SAConv) [41] into the C2f module of YOLOv8n to enhance its ability to extract global context features. Specifically, in the C2f module, all convolutions, except for the first and last, are replaced with SAConv, while the remaining structure of the C2f module remains unchanged. SAConv captures multi-scale contextual information by applying convolution kernels with varying dilation rates, thereby expanding the receptive field without increasing the number of parameters. The spatial attention mechanism improves the signal from low-light object regions, strengthens object features, and suppresses background noise. The enhanced C2f module is shown in Figure 3.

SAConv [41] consists of three components: the pre-global context component, the switchable atrous convolution component, and the post-global context component, as shown in Figure 3. In the global context component, the input features undergo a global pooling operation to produce a feature map of size 1 × 1 × C. This is followed by a 1 × 1 convolution for feature channel fusion, after which the features are sent to a residual module for further fusion. In the switchable atrous convolution component, a switching function is generated using an average pooling layer with a 5 × 5 kernel, followed by a 1 × 1 convolution. Two atrous convolutions with different dilation rates are applied to the input features, and the resulting features are combined. The switching function assigns different weightings to the two atrous convolutions, allowing for adaptive feature extraction at multiple scales. The principle of SAConv can be expressed by the formula:

y = S (x) C o n v (x, w, 1) + (1 - S (x) C o n v (x, w + ∆ w, r))

(1)

where

r

is a hyperparameter of SAC, set to

r = 3

in the experiment.

∆ w

is a trainable weight, and

S (x)

is the switch function.

2.2.2. Neck Module Based on Multi-Scale Features and Channel Attention Mechanism

The neck network of YOLOv8 combines the path aggregation network (PAN) [37] and feature pyramid network (FPN) [36], which enhances the model’s ability to extract fine features through rich hierarchical information flow. Specifically, FPN transmits deep feature information to the shallow layers and injects key semantic information in a top-down manner. Conversely, the PAN structure uses a bottom–up approach to transfer shallow position information to the deeper layers, providing location cues for the deep layer features. However, in low-light images, the object signal features are weak, lacking rich color and texture details. The neck module of YOLOv8 does not sufficiently fuse shallow features, limiting the detection accuracy for low-light objects, particularly small objects, as shallow features retain more important position information. To address this limitation, enhancing the fusion of shallow features in YOLOv8 is essential.

To improve the utilization of shallow features in the PAN-FPN structure, we propose a multi-scale feature fusion neck module inspired by BiFPN [38], as shown in Figure 4. This module extends the FPN + PAN architecture by adding two horizontal connection paths to integrate the original shallow features extracted from the backbone network. Specifically, compared to YOLOv8, we fuse P2 scale features into the P3 scale features in the FPN, preserving rich low-light object location information with minimal convolution operations, and merge P3 and P4 scale features into the P4 and P5 scale features in PANet, respectively. Unlike BiFPN’s feature weighting approach, our method employs feature concatenation for fusion.

To further refine the feature fusion process, the ECA module is applied after each feature concatenation. As illustrated in Figure 5, the ECA module first performs global average pooling on the feature map and then applies a one-dimensional convolution to enable local cross-channel interactions without dimensionality reduction. The size of the one-dimensional convolution kernel is determined adaptively. Afterward, the sigmoid function normalizes the input into a

1 \times 1 \times C

vector, where each element represents the weight for the corresponding channel. Subsequently, the weight of each channel is multiplied by the features of the corresponding channel in the input feature map, thereby performing channel-wise weighting. Given the channel dimension

C

, the adaptive convolution kernel size k is calculated using the following formula:

k = {|\frac{\log_{2} (C)}{r} + \frac{b}{r}|}_{o d d}

(2)

where

b

and

r

represent the control coefficient, and

C

is the number of channels.

{|t|}_{o d d}

indicates the nearest odd number

t

. In this paper, we set

r

and

b

to 2 and 1, respectively, in all experiments.

2.2.3. Improved Dynamic Detection Head Based on Deformable Convolution

In low-light environments, objects exhibit variations in class, location, and scale. To effectively distinguish objects from low-contrast backgrounds and improve detection accuracy, the network must leverage spatial, scale, and channel information. Inspired by the dynamic head [42] and based on deformable convolution (DCNv3) [43], we propose an object detection head that cascades spatial, scale, and channel attention mechanisms, as illustrated in Figure 6. We first apply spatial attention, scale attention, and channel attention mechanisms to the input features and then perform object detection and recognition using the decoupled head of YOLOv8n.

Given the feature tensor

F \in R^{L \times S \times C}

, the formulation of three sequential attention is as follows:

W (F) = π_{C} (π_{L} (π_{S} (F) \cdot F) \cdot F) \cdot F

(3)

where

π_{C} (\cdot)

,

π_{L} (\cdot)

and

π_{S} (\cdot)

are channel attention functions, scale attention, and spatial attention, respectively.

In the spatial attention module, deformable convolution (DCNv3) [43] is employed to aggregate the spatial features of objects at the same scale. In comparison with DCNv2 [44] and DCN [45], DCNv3 employs separable convolution, a multi-grouping mechanism, and normalizing modulation scalars along sampling points, thereby enhancing feature extraction capabilities and stabilizing the model’s training process. By learning offsets for irregular samples, deformable convolution enables the model to focus on object regions with variations in scale, aspect ratio, and rotation while suppressing background noise. The mathematical formulation of spatial attention is as follows:

π_{S} (F) \cdot F = \sum_{l = 1}^{L} \sum_{k = 1}^{K} w_{l} \cdot m_{l, k} \cdot F_{l} ({p + p}_{k} + ∆ p_{l, k})

(4)

where

L

is the level of feature map.

K

is the number of sparse sampling locations.

∆ p_{l, k}

is a shifted location of

p_{k}

in the

l

-th feature map.

m_{l, k}

is a self-learned importance scalar at location

p_{k}

in the

l

-th feature map.

In the scale attention module, global pooling, 1 × 1 convolution, and the sigmoid function are used to calculate the weights of different scale levels and perform weighted aggregation of features across these levels.

π_{S} (F) \cdot F = σ (f (\frac{1}{S C} \sum_{S, C} F)) \cdot F

(5)

where

f (\cdot)

is a

1 \times 1

convolutional layer, and

σ (x) = \max (0, \min (1, \frac{x + 1}{2}))

is a hard-sigmoid function.

In the channel attention module, the threshold for controlling feature channel switching is derived by first performing global pooling, followed by two fully connected layers, and concluding with a normalization layer:

π_{S} (F) \cdot F = m a x (α_{1} (F) \cdot F_{c} + β_{1} (F), α_{2} (F) \cdot F_{c} + β_{2} (F))

(6)

where

F_{c}

is the feature slice at the c-th channel and

{[α_{1}, α_{2}, β_{1}, β_{2}]}^{T}

denotes a hyperfunction that learns to control the activation thresholds derived from the output of the normalization layer.

By cascading the spatial, scale, and channel attention modules, the feature map is enhanced, successfully aggregating spatial, scale, and channel information for objects in low-illumination environments. Finally, we use decoupled heads in YOLOv8 to map the enhanced features to object locations and classes.

2.2.4. MPDIoU Loss

YOLOv8 uses CIoU to calculate the overall overlap between the predicted and ground truth bounding boxes. However, CIoU loss is difficult to optimize when the predicted and ground truth boxes have the same aspect ratio but different sizes. To improve the positioning accuracy in low-light target detection, MPDIoU loss [46] was employed. Unlike traditional IoU, MPDIoU not only measures the overlap between the predicted and ground truth boxes but also considers the center point distance and the deviation in width and height. This approach minimizes the distance between the top-left and bottom-right corners of the predicted and ground truth bounding boxes, thereby enhancing the accuracy and efficiency of bounding box regression. The expression of the MPDIoU loss is as follows:

\begin{array}{l} L_{M P D I o U} = 1 - M P D I o U \\ = 1 - (\frac{A \cap B}{A \cup B} - \frac{d_{1}^{2}}{w^{2} + h^{2}} - \frac{d_{2}^{2}}{w^{2} + h^{2}}) \end{array}

(7)

where A and B are predicted and ground truth bounding boxes, respectively.

w

and

h

represent the image width and height.

d_{1}

is the distance between the upper left corners of the predicted box and the ground truth box, while

d_{2}

is the distance between the lower right corners of these boxes.

d_{1}

and

d_{2}

are calculated as follows:

\begin{matrix} d_{1}^{2} = {(x_{1}^{g t} - x_{1}^{p r d})}^{2} + {(y_{1}^{g t} - y_{1}^{p r d})}^{2} \\ d_{2}^{2} = {(x_{2}^{g t} - x_{2}^{p r d})}^{2} + {(y_{2}^{g t} - y_{2}^{p r d})}^{2} \end{matrix}

(8)

where

(x_{1}^{p r d}, y_{1}^{p r d})

and

(x_{2}^{p r d}, y_{2}^{p r d})

are the coordinates of the upper left and lower right corners of the predicted box.

(x_{1}^{g t}, y_{1}^{g t})

and

(x_{2}^{g t}, y_{2}^{g t})

are the coordinates of the upper left and lower right corners of the ground truth box.

3. Experimental Results

3.1. Experimental Datasets

3.1.1. ExDark Dataset

The Exclusive Dark (ExDark) dataset [47] is widely used in computer vision for training and evaluating network models in low-light environments. Specifically designed to address a range of low-light scenarios, this dataset includes 10 different conditions, such as extremely low light, morning, and dusk. It covers 12 object categories and contains a total of 7363 images. The ExDark dataset offers a diverse range of low-light samples to improve the detection and recognition capabilities of models under low-light conditions. We divided the dataset into training, validation, and test sets using an 8:1:1 ratio, resulting in 5884, 741, and 738 samples, respectively.

3.1.2. DARK FACE Dataset

The DARK FACE dataset [48] consists of 6000 low-light images captured in real-world environments at night, including school buildings, streets, bridges, overpasses, and parks. All images are annotated with bounding boxes around faces. We divided the dataset into training, validation, and test sets using an 8:1:1 ratio, resulting in 4800, 600, and 600 samples, respectively.

3.1.3. Low Light Synthesis Dataset

To evaluate the generalization capability of the model, we extend the ExDark dataset by incorporating the Microsoft Common Objects in Context (COCO) dataset [49]. The COCO dataset is a large-scale computer vision dataset containing over 300,000 images and annotations across 80 object classes. However, most of the images in the COCO dataset were captured under standard lighting conditions. To address this, we design a method for synthesizing low-light images by manipulating pixel intensity and applying gamma transformations to COCO images. Under low-light conditions, the noise introduced by the image acquisition sensor primarily follows Gaussian and Poisson distributions [50]. Therefore, the process of synthesizing low-light images involves four key steps:

Initial Evaluation: Calculate the average gray value of the image and evaluate whether it needs to be darkened. Only process the image whose brightness is higher than the threshold value.
Brightness Reduction: Randomly decrease the image brightness to 60% to 80% of its original level.
Gamma Correction: Utilize gamma transformation to simulate the dark light effect, with a random gamma parameter ranging between 2.0 and 5.0.
Noise Addition: Introduce random Gaussian and Poisson noise. The Gaussian noise has a mean of 0, with a standard deviation randomly varying between 0.1 and 0.3.

Since the COCO dataset includes more categories than ExDark, we select the relevant labels based on ExDark’s category list and remove any unrelated ones to create a subset. This subset is then darkened and merged with ExDark to form a new ExDark+ dataset. The extended dataset consists of 12 categories, 11,287 images, and 30,415 labels, significantly enhancing the diversity of sample characteristics. Figure 7 presents the results of synthesizing low-light images from the COCO dataset. The synthesized images closely resemble low-light images captured in real-world environments, exhibiting characteristics such as uneven lighting, low contrast, and blurred boundaries. The label distribution in the Exdark+ dataset is highly imbalanced, as shown in Figure 8. The most frequent label is “people” with 8279 instances, while the least frequent label is “bus” with only 1190 instances.

3.2. Evaluation Metrics

This study uses precision (P), recall (R), and mean average precision (mAP) as performance evaluation metrics for the object detection task while employing floating-point operations (FLOPs) and the number of parameters to assess the model’s efficiency and lightweight characteristics.

Precision reflects the proportion of correctly predicted positive samples out of all samples with positive predictions, while recall indicates the proportion of correctly predicted positive samples out of all actual positive samples. The formulas for precision and recall are given in Equations (9) and (10), respectively. Here,

T P

denotes the number of true positive samples,

F P

represents the number of false positives, and

F N

represents the number of false negatives.

P = \frac{T P}{T P + F P}

(9)

R = \frac{T P}{T P + F N}

(10)

The mean average precision (mAP) is a comprehensive performance metric that combines precision and recall at different confidence levels. It is computed by constructing the precision-recall curve, calculating the area under the curve to obtain the average precision (AP), and then averaging the AP for each class to compute the mAP. In this study, mAP with an Intersection over Union (IoU) threshold of 50%, denoted as mAP@0.5, was used as the primary performance evaluation metric.

m A P = \frac{\sum_{i = 1}^{N} \int_{0}^{1} P (R) d R}{N}

(11)

3.3. Implementation Details

The PyTorch [51] framework stochastic gradient descent (SGD) optimizer was used to train the network. The specific settings for the learning rate, weight decay, momentum coefficient, and hardware configuration are detailed in Table 1. The model processes 24 samples per batch and achieves convergence after 200 iterations of training. All image samples are maintained at their original aspect ratio, with the longer side scaled to 640 pixels and the shorter side padded to 640 pixels, resulting in images of resolution 640 × 640 pixels.

3.4. Results and Analysis

3.4.1. Ablation Experiment

In this paper, YOLOv8n has improved in three key aspects: the improved C2f module by switchable atrous convolution (C2f_SAConv), the neck module based on multi-scale features and the channel attention mechanism (MSFCA_Neck), and the enhanced dynamic detection head based on deformable convolution (DCNv3_Dyhead). We conducted ablation experiments on the ExDark dataset to validate the effectiveness of the three proposed improvements: C2f_SAConv, MSFCA_Neck, and DCNv3_Dyhead. As shown in Table 2, using the original YOLOv8n model as a baseline, the effectiveness of each improvement method was evaluated individually by adding one improvement method at a time. The symbol “√” denotes that an improvement module has been incorporated into the model. The evaluation indexes of detection performance include P, R, and mAP@0.5. Params and GFLOPs were used as the evaluation indexes for the lightweight of the model.

As can be seen from Table 2, by incorporating switchable atrous convolution into the C2f module of YOLOv8n, the mAP@0.5 on the ExDark dataset improves to 66.6%, which is a 0.5% increase compared to the 66.1% achieved by YOLOv8n. Enhancing the neck module with multi-scale features and a channel attention mechanism results in a 0.7% improvement in mAP@0.5. Similarly, incorporating the deformable convolution-based dynamic detection head leads to a 3.2% increase in mAP@0.5. Ultimately, by integrating all three improved modules, the model achieves a mAP@0.5 of 68.8% on the ExDark dataset, representing a 4.1% improvement over the baseline model. The ablation experiment results demonstrate that the three proposed improvements—C2f_SAConv, MSFCA_Neck, and DCNv3_Dyhead—are effective and can collaborate synergistically to enhance low-light object detection performance.

3.4.2. Comparison Experiment on Exdark

In Table 3, we compared general object detection algorithms such as YOLOv5n, YOLOv7-tiny, and YOLOv8n, as well as other state-of-the-art low-light object detection algorithms, including Zero_DCE + YOLOv8n [27] and LOL-YOLO [30]. All methods were retrained and tested on the ExDark dataset. We used precision, recall, and mAP as evaluation metrics to assess the performance of each method. Bold values indicate the best results for each metric among low-light object detection algorithms.

As shown in Table 3, YOLOv5n has the fewest parameters and the lowest computational load, but it also exhibits the lowest detection accuracy. In contrast, although the proposed 3L-YOLO model is similar in size to YOLOv7-tiny, with 5.89 million parameters and 16.6 GFLOPs of computation, it outperforms other metrics. Specifically, it achieves a mAP@0.5 of 0.688, a mAP@0.5:0.95 of 0.42, and a recall of 0.76. Compared to YOLOv8n and YOLOv7-tiny, mAP@0.5 has improved by 2.7% and 5.3%, respectively. In contrast to Zero-DCE and LOL-YOLO, which rely on image enhancement modules, the proposed method performs object detection directly without the need for such modules. As a result, computational power is reduced by 57 GFLOPs and 4 GFLOPs, respectively, while the mAP@0.5 increases by 4.9% and 0.7%, respectively. The proposed method achieves the highest accuracy while maintaining a lightweight model. These results demonstrate that the improved algorithm is highly effective for low-light object detection.

3.4.3. Comparison Experiment on Exdark+

To further verify the generalization ability of the 3L-YOLO, experiments were conducted on the synthesized low-light ExDark+ dataset. The ExDark+ dataset presents greater challenges due to its complex lighting conditions, significant brightness fluctuations, and noise. The 3L-YOLO method is compared with YOLOv5n [52], YOLOv5s [52], YOLOv7-tiny [11], and YOLOv8n [35].

As shown in Table 4, the 3L-YOLO model performs well on the ExDark+ dataset, a synthetic low-light dataset. Specifically, 3L-YOLO achieves a mAP@0.5 score of 0.673 and a precision of 0.701 on the ExDark+ dataset, representing improvements of 4.3% and 3.9%, respectively, compared to YOLOv8n.

To fully highlight the advantages of the 3L-YOLO network in low-light image target detection, Figure 9 presents a comparison of detection performance in representative complex scenes. In the second column, the bounding box predicted by YOLOv8n is significantly larger than the actual bounding box, whereas the bounding box predicted by our method closely matches the ground truth. In the last column, YOLOv8n not only mispositions the bounding box but also misclassifies the target, while our method achieves correct predictions and classifications. Under low-light conditions and across targets of varying scales, the proposed 3L-YOLO network outperforms YOLOv8n in classification accuracy, localization precision, and confidence.

3.4.4. Comparison Experiment on DARK FACE

To further validate the robustness and generalizability of the 3L-YOLO detection network in real low-light scenarios, we conducted experiments using the DARK FACE dataset. The DARK FACE dataset provides 6000 real-world low-light images captured during the nighttime. The performance of the 3L-YOLO method was compared with YOLOv8n, and the results are presented in Table 5.

In the DARKFACE dataset, the 3L-YOLO model achieves a mean average precision (mAP@0.5) of 0.47, which is 1.4% higher than that of the YOLOv8n model. Precision improved by 0.6%, from 70.5% to 70.1%, and recall increased by 1.7%, from 40.2% to 41.9%. These results demonstrate that the proposed method exhibits strong generalization and robustness in low-light object detection.

4. Conclusions

To address the challenges of feature extraction and high computational resource requirements caused by low contrast and unclear boundaries in low-light object detection, this paper proposes a low-light object detection method based on an improved YOLOv8n model. First, global context information is captured using switchable atrous convolution. Second, a neck module that integrates multi-scale features and a channel attention mechanism is introduced to leverage shallow features and enhance model performance. Finally, a dynamic detection head is used to enhance the model’s performance in low-light object detection. This dynamic detection head integrates multiple attention mechanisms, including spatial, scale, and channel attention. Compared with the YOLOv8n, the improved 3L-YOLO algorithm achieves an average detection accuracy of 68.8%, 67.3%, and 47% on the ExDark, ExDark+, and DARK FACE datasets, respectively, representing improvements of 2.7%, 4.3%, and 1.4%. Compared to image-enhanced low-light object detection methods, the 3L-YOLO model significantly reduces computational power requirements while improving detection accuracy, making it more suitable for deployment on edge devices. In future work, we will investigate more efficient convolution techniques and enhance the model’s ability to capture small target features to improve its recall rate. Additionally, we will further optimize the model parameters and explore real-time deployment across a broader range of low-light environments.

Author Contributions

Conceptualization, Z.H. and L.L.; methodology, Z.H. and Z.Y.; software, Z.H.; validation, Z.H., Z.Y. and L.L.; investigation, Z.Y.; resources, Z.H.; data curation, Z.Y.; writing—original draft preparation, Z.H. and Z.Y.; writing—review and editing, Z.H. and L.L.; visualization, Z.Y.; supervision, Z.H.; project administration, Z.H.; funding acquisition, L.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Shanghai Science and Technology Innovation Project (Grant No. XTCX-KJ-2024-24).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Based on the MSCOCO dataset and ExDark, we synthesized a low-light dataset named ExDark+, designed to closely simulate real low-light conditions. The ExDark+ dataset comprises 12 categories, 11,287 images, and 30,415 tags. To facilitate future research, this dataset is made publicly available on Google Drive at https://drive.google.com/drive/folders/1gHBV-U1NyLUZ2C15lgbtw2YyDkP0Vczh?usp=sharing (accessed on 23 December 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Abba, S.; Bizi, A.M.; Lee, J.-A.; Bakouri, S.; Crespo, M.L. Real-time object detection, tracking, and monitoring framework for security surveillance systems. Heliyon 2024, 10, e34922. [Google Scholar] [CrossRef] [PubMed]
Akhtar, M.J.; Mahum, R.; Butt, F.S.; Amin, R.; El-Sherbeeny, A.M.; Lee, S.M.; Shaikh, S. A Robust Framework for Object Detection in a Traffic Surveillance System. Electronics 2022, 11, 3425. [Google Scholar] [CrossRef]
Wang, Z.; Zhao, D.; Cao, Y. Visual Navigation Algorithm for Night Landing of Fixed-Wing Unmanned Aerial Vehicle. Aerospace 2022, 9, 615. [Google Scholar] [CrossRef]
Mujkic, E.; Christiansen, M.P.; Ravn, O. Object Detection for Agricultural Vehicles: Ensemble Method Based on Hierarchy of Classes. Sensors 2023, 23, 7285. [Google Scholar] [CrossRef]
Wosner, O.; Farjon, G.; Bar-Hillel, A. Object detection in agricultural contexts: A multiple resolution benchmark and comparison to human. Comput. Electron. Agric. 2021, 189, 106404. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the Computer Vision–ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Altay, F.; Velipasalar, S. The Use of Thermal Cameras for Pedestrian Detection. IEEE Sens. J. 2022, 22, 11489–11498. [Google Scholar] [CrossRef]
Xie, Y.; Zhang, L.; Yu, X.; Xie, W. Detection Algorithm with Visible Infrared Feature Interaction and Fusion. Control Theory Appl. 2024, 41, 914–922. [Google Scholar]
Bai, L.; Zhang, W.; Pan, X.; Zhao, C. Underwater Image Enhancement Based on Global and Local Equalization of Histogram and Dual-Image Multi-Scale Fusion. IEEE Access 2020, 8, 128973–128990. [Google Scholar] [CrossRef]
Li, C.; Tang, S.; Yan, J.; Zhou, T. Low-Light Image Enhancement via Pair of Complementary Gamma Functions by Fusion. IEEE Access 2020, 8, 169887–169896. [Google Scholar] [CrossRef]
Zhang, W.; Zhuang, P.; Sun, H.H.; Li, G.; Kwong, S.; Li, C. Underwater Image Enhancement via Minimal Color Loss and Locally Adaptive Contrast Enhancement. IEEE Trans. Image Process. 2022, 31, 3997–4010. [Google Scholar] [CrossRef]
Land, E.H.; McCann, J.J. Lightness and retinex theory. Josa 1971, 61, 1–11. [Google Scholar] [CrossRef]
Guo, X.; Li, Y.; Ling, H. LIME: Low-Light Image Enhancement via Illumination Map Estimation. IEEE Trans. Image Process. 2017, 26, 982–993. [Google Scholar] [CrossRef]
Wei, C.; Wang, W.; Yang, W.; Liu, J. Deep Retinex Decomposition for Low-Light Enhancement. arXiv 2018, arXiv:1808.04560. [Google Scholar]
Jiang, Y.; Gong, X.; Liu, D.; Cheng, Y.; Fang, C.; Shen, X.; Yang, J.; Zhou, P.; Wang, Z. EnlightenGAN: Deep Light Enhancement Without Paired Supervision. IEEE Trans. Image Process. 2021, 30, 2340–2349. [Google Scholar] [CrossRef]
Vinoth, K.; Sasikumar, P. Lightweight object detection in low light: Pixel-wise depth refinement and TensorRT optimization. Results Eng. 2024, 23, 102510. [Google Scholar] [CrossRef]
Cui, Z.; Li, K.; Gu, L.; Su, S.; Gao, P.; Jiang, Z.; Qiao, Y.; Harada, T. You only need 90k parameters to adapt light: A light weight transformer for image enhancement and exposure correction. arXiv 2022, arXiv:2205.14871. [Google Scholar]
Guo, C.; Li, C.; Guo, J.; Loy, C.C.; Hou, J.; Kwong, S.; Cong, R. Zero-Reference Deep Curve Estimation for Low-Light Image Enhancement. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1777–1786. [Google Scholar]
Li, J.; Wang, Y.; Wang, C.; Tai, Y.; Qian, J.; Yang, J.; Wang, C.; Li, J.; Huang, F. DSFD: Dual Shot Face Detector. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 5055–5064. [Google Scholar]
Yin, X.; Yu, Z.; Fei, Z.; Lv, W.; Gao, X. PE-YOLO: Pyramid Enhancement Network for Dark Object Detection. In Proceedings of the Artificial Neural Networks and Machine Learning–ICANN 2023, Heraklion, Greece, 26–29 September 2023; pp. 163–174. [Google Scholar]
Jiang, C.; He, X.; Xiang, J. LOL-YOLO: Low-Light Object Detection Incorporating Multiple Attention Mechanisms. Comput. Eng. Appl. 2024, 60, 177–187. [Google Scholar] [CrossRef]
Liu, W.; Ren, G.; Yu, R.; Guo, S.; Zhu, J.; Zhang, L. Image-Adaptive YOLO for Object Detection in Adverse Weather Conditions. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021. [Google Scholar]
Hashmi, K.A.; Kallempudi, G.; Stricker, D.; Afzal, M.Z. FeatEnHancer: Enhancing Hierarchical Features for Object Detection and Beyond Under Low-Light Vision. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 6702–6712. [Google Scholar]
Xiao, Y.; Jiang, A.; Ye, J.; Wang, M.W. Making of Night Vision: Object Detection Under Low-Illumination. IEEE Access 2020, 8, 123075–123086. [Google Scholar] [CrossRef]
Hong, Y.; Wei, K.; Chen, L.; Fu, Y. Crafting Object Detection in Very Low Light. In Proceedings of the British Machine Vision Conference, Online, 22–25 November 2021. [Google Scholar]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8. Available online: https://github.com/ultralytics/ultralytics (accessed on 16 November 2024).
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10778–10787. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11531–11539. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Qiao, S.; Chen, L.C.; Yuille, A. DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 10208–10219. [Google Scholar]
Dai, X.; Chen, Y.; Xiao, B.; Chen, D.; Liu, M.; Yuan, L.; Zhang, L. Dynamic Head: Unifying Object Detection Heads with Attentions. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 7369–7378. [Google Scholar]
Wang, W.; Dai, J.; Chen, Z.; Huang, Z.; Li, Z.; Zhu, X.; Hu, X.; Lu, T.; Lu, L.; Li, H.; et al. InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 14408–14419. [Google Scholar]
Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable ConvNets V2: More Deformable, Better Results. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 9300–9308. [Google Scholar]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Ma, S.; Xu, Y. MPDIoU: A Loss for Efficient and Accurate Bounding Box Regression. arXiv 2023, arXiv:2307.07662. [Google Scholar] [CrossRef]
Loh, Y.P.; Chan, C.S. Getting to know low-light images with the Exclusively Dark dataset. Comput. Vis. Image Underst. 2019, 178, 30–42. [Google Scholar] [CrossRef]
Yang, W.; Yuan, Y.; Ren, W.; Liu, J.; Scheirer, W.J.; Wang, Z.; Zhang, T.; Zhong, Q.; Xie, D.; Pu, S.; et al. Advancing Image Understanding in Poor Visibility Environments: A Collective Benchmark Study. IEEE Trans. Image Process. 2020, 29, 5737–5752. [Google Scholar] [CrossRef]
Lin, T.-Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common Objects in Context. arXiv 2014, arXiv:1405.0312. [Google Scholar] [CrossRef]
Wei, K.; Fu, Y.; Yang, J.; Huang, H. A Physics-Based Noise Formation Model for Extreme Low-Light Raw Denoising. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 2755–2764. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An imperative style, high-performance deep learning library. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Curran Associates Inc.: Red Hook, NY, USA, 2019; p. 721. [Google Scholar]
Jocher, G. Ultralytics YOLOv5. Available online: https://github.com/ultralytics/yolov5 (accessed on 16 November 2024).

Figure 1. YOLOv8n network architecture.

Figure 2. The architecture of the proposed 3L-YOLO.

Figure 3. Improved C2f module by switchable atrous convolution. Except for the first and last, all convolutions are replaced by switchable atrous convolutions.

Figure 4. Neck module structure design: (a) FPN structure in YOLOv8n; (b) PANet structure in YOLOv8n; (c) BiFPN structure; (d) our neck module with multi-scale features and channel attention mechanism.

Figure 5. Efficient channel attention mechanism (ECA).

Figure 6. Dynamic detection head based on DCNv3. The input features are fed into the detect head of YOLOv8n after the application of spatial attention, scale attention, and channel attention mechanisms. DCNv3 is employed as a spatial attention mechanism for aggregating spatial features.

Figure 7. Synthesized low-light image samples.

Figure 8. Distribution of class labels in the ExDark+ dataset.

Figure 9. Comparison of object detection effect between 3L-YOLO and YOLOv8n on the ExDark+ dataset. The color of the box corresponds to the predicted category.

Table 1. Experimental environment and configuration. The processors used for the experiments and the parameters employed during training.

Category	Item	Params
Hardware	CPU	Intel(R) Xeon(R) Platinum 8336C CPU @ 2.30 GHz
Hardware	GPU	NVIDIA GeForce RTX 4090
Training	Optimizer	SGD
	Learning rate	$1.0 \times 10^{- 2}$
	Weight decay	$5.0 \times 10^{- 4}$
	Momentum coefficient	0.937

Table 2. Ablation experiment of 3L-YOLO. The symbol “√” denotes that an improvement module has been incorporated into the model.

C2f_SAConv	MSFCA_Neck	DCNv3_Dyhead	mAP@0.5 (%)	P (%)	R (%)	Params (M)	GFLOPs
			66.1	71.3	58.8	3.01	8.1
√			66.6	73.2	60	3.31	7.4
	√		66.6	71.4	58.3	3.24	8.5
		√	68.2	71.8	60.5	5.19	17.6
√	√		68.1	68.0	63.1	3.97	12.0
√	√	√	68.8	76.0	59.3	5.89	16.6

Table 3. Performance of different models on the ExDark dataset. Bold values indicate the best results for each metric among low-light object detection algorithms.

Method	mAP@0.5 (%)	mAP@0.5:0.95 (%)	P (%)	R (%)	Params (M)	GFLOPs
YOLOv5n	65.1	38.2	68.5	58.0	2.5	7.1
YOLOv7-tiny	63.5	35.5	68.8	56.6	6.04	13.3
YOLOv8n	66.1	39.6	71.3	58.8	3.01	8.1
Zero_DCE + YOLOv8n	63.9	38.5	72.3	55.6	3.08	73.7
LOL-YOLO [30]	68.1	42.3	70.9	62.5	5.66	20.6
3L-YOLO (Ours)	68.8	42.0	76.0	59.3	5.89	16.6

Table 4. Comparison results of different models on the ExDark+ dataset.

Method	mAP@0.5 (%)	P (%)	R (%)
YOLOv5n	58.0	63.3	53.9
YOLOv8n	63.0	58.2	66.6
YOLOv7-tiny	63.4	66.2	59.8
YOLOv5s	64.5	59.5	64.5
3L-YOLO (Ours)	67.3	70.1	61.8

Table 5. Comparing 3L-YOLO with YOLOv8n on the DARKFACE dataset.

Method	mAP@0.5 (%)	P (%)	R (%)
YOLOv8n	45.6	70.5	40.2
3L-YOLO (Ours)	47.0	70.1	41.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Han, Z.; Yue, Z.; Liu, L. 3L-YOLO: A Lightweight Low-Light Object Detection Algorithm. Appl. Sci. 2025, 15, 90. https://doi.org/10.3390/app15010090

AMA Style

Han Z, Yue Z, Liu L. 3L-YOLO: A Lightweight Low-Light Object Detection Algorithm. Applied Sciences. 2025; 15(1):90. https://doi.org/10.3390/app15010090

Chicago/Turabian Style

Han, Zhenqi, Zhen Yue, and Lizhuang Liu. 2025. "3L-YOLO: A Lightweight Low-Light Object Detection Algorithm" Applied Sciences 15, no. 1: 90. https://doi.org/10.3390/app15010090

APA Style

Han, Z., Yue, Z., & Liu, L. (2025). 3L-YOLO: A Lightweight Low-Light Object Detection Algorithm. Applied Sciences, 15(1), 90. https://doi.org/10.3390/app15010090

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

3L-YOLO: A Lightweight Low-Light Object Detection Algorithm

Abstract

1. Introduction

2. Methods

2.1. YOLOv8n Model

2.2. 3L-YOLO Model

2.2.1. Improved C2f Module by Switchable Atrous Convolution

2.2.2. Neck Module Based on Multi-Scale Features and Channel Attention Mechanism

2.2.3. Improved Dynamic Detection Head Based on Deformable Convolution

2.2.4. MPDIoU Loss

3. Experimental Results

3.1. Experimental Datasets

3.1.1. ExDark Dataset

3.1.2. DARK FACE Dataset

3.1.3. Low Light Synthesis Dataset

3.2. Evaluation Metrics

3.3. Implementation Details

3.4. Results and Analysis

3.4.1. Ablation Experiment

3.4.2. Comparison Experiment on Exdark

3.4.3. Comparison Experiment on Exdark+

3.4.4. Comparison Experiment on DARK FACE

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI