Next Article in Journal
Safe Control of an Autonomous Ship in a Port Based on Ant Colony Optimization and Linear Matrix Inequalities
Previous Article in Journal
A Multidimensional Framework Incorporating 2D U-Net and 3D Attention U-Net for the Segmentation of Organs from 3D Fluorodeoxyglucose-Positron Emission Tomography Images
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Research on Low-Light Environment Object Detection Algorithm Based on YOLO_GD

College of Electronic Information and Artificial Intelligence, Shaanxi University of Science and Technology, Computer Science Department, Xi’an 710021, China
*
Author to whom correspondence should be addressed.
Electronics 2024, 13(17), 3527; https://doi.org/10.3390/electronics13173527
Submission received: 21 June 2024 / Revised: 31 August 2024 / Accepted: 31 August 2024 / Published: 5 September 2024

Abstract

:
In low-light environments, the presence of numerous small, dense, and occluded objects challenges the effectiveness of conventional object detection methods, failing to achieve desirable results. To address this, this paper proposes an efficient object detection network, YOLO_GD, which is designed for precise detection of targets in low-light scenarios. This algorithm, based on the foundational framework of YOLOv5s, implements a cross-layer feature fusion method founded on an information gathering and distribution mechanism. This method mitigates the issue of information loss during inter-layer feature exchange and, building on this, constructs a Bi-level routing spatial attention module to reduce computational redundancy caused by the self-attention mechanism, thereby enhancing the model’s detection accuracy for small objects. Furthermore, through the introduction of a novel deformable convolution, a cross-stage local feature fusion module is established, enabling the model to capture the complex features of input data more accurately and improve detection precision for dense objects. Lastly, the introduction of a probabilistic distance metric in the bounding box regression loss function enhances the network model’s generalization capability, further increasing detection accuracy in occluded scenarios. Experimental results on the ExDark dataset demonstrate that compared to YOLOv5, there is a 5.97% improvement in mean average precision (mAP), effectively enhancing object detection performance in low-light conditions.

1. Introduction

In the field of computer vision, object detection is regarded as one of the most challenging tasks. With the rapid advancement of deep learning, methods based on deep learning have come to dominate this area. Low-light object detection has a wide range of applications, including night-time surveillance, night-time driving assistance, nocturnal reconnaissance and attack by drones, and robotic operations during the night. Although traditional object detection methods have achieved satisfactory results on datasets with normal lighting conditions, their direct application in low-light scenarios still leads to issues such as missed detections and false positives. Furthermore, the presence of small, dense, and occluded objects further decreases the detection accuracy of models under low-light conditions. Therefore, designing an efficient and accurate method for low-light object detection is of particular importance.
In recent years, many researchers have applied the YOLO series of networks to object detection in low-light environments. In 2021, Ge et al. [1] proposed a low-light object detection model based on the Dark-YOLO network, using the CSPDarkNet-53 backbone network to extract features from low-light images and introduced a path aggregation enhancement module to further enhance feature representation capability, improving detection effects for low-light objects but still facing issues with the missed detection of dense and small objects. Addressing the complexities of low-light environments and the scarcity of training datasets, Shu et al. [2] synthesized low-light images through gamma transformation and the addition of Gaussian noise to expand the dataset, improving the model’s generalization ability. Incorporating a feature localization module in the network’s neck improved the detection performance for low-light images, though the model’s average detection accuracy required further enhancement. With the continual iteration and updates of deep learning algorithm models, YOLOv6, v7, and v8 [3] have rapidly evolved, increasing the accuracy of low-light object detection but still encountering missed detections in occluded scenarios.
Considering that YOLOv5 demonstrates excellent performance in terms of speed and efficiency, it is particularly important for object detection in low-light environments. Objects in low-light conditions often require higher resolution and more complex feature extraction, which imposes higher demands on real-time processing. The architecture of YOLOv5 has been optimized to achieve fast inference speed while ensuring high accuracy. This makes it highly suitable for applications requiring real-time processing. Therefore, addressing these challenges, this paper proposes a low-light object detection method named YOLO_GD. Initially, the algorithm builds on the basic framework of YOLOv5s, designing a cross-layer information fusion method based on an information gathering and distribution mechanism to mitigate the problem of information loss during inter-layer feature interaction. On this basis, a Bi-level routing spatial attention module is constructed to reduce the computational redundancy caused by the self-attention mechanism, improving the detection of small objects. Secondly, by employing novel deformable convolutions, a cross-stage local feature fusion module is established, enabling the model to more accurately capture the complex features of input data, enhancing the detection precision for dense objects. Lastly, the introduction of a probabilistic distance metric in the bounding box regression loss function enhances the model’s generalization capability during the bounding box regression loss calculation, further improving detection accuracy in occluded scenarios. This method realizes end-to-end low-light object detection and has been empirically tested on the ExDark low-light object detection dataset, demonstrating high detection accuracy and effective detection performance.
In this paper, we propose YOLO_GD, which is an innovative approach for object detection in low-light environments. The key contributions include a cross-layer information fusion method based on an information gathering and distribution mechanism to mitigate inter-layer feature loss, a Bi-level routing spatial attention module to reduce computational redundancy and improve small object detection, novel deformable convolutions to enhance feature capture and dense object detection, and a probabilistic distance metric in the bounding box regression loss function to improve generalization and detection accuracy in occluded scenarios. Empirical tests on the ExDark dataset demonstrate high detection accuracy and effective performance.

2. Related Works

2.1. Object Detection Methods

Object detection methods are primarily divided into two-stage and one-stage detection methods. Currently, mainstream one-stage object detection methods are based on a regression concept, directly predicting the category probabilities and bounding box offsets in images to obtain classification and regression results. For example, in 2020, Du et al. [4] proposed SpineNet, enhancing the backbone network’s computational efficiency and accuracy by designing adaptive spines and axes, further reducing model parameters and training time. Tan et al. [5] proposed EfficientDet, which uses EfficientNet [6] as its backbone network and introduced the weighted bi-directional feature pyramid network (BiFPN), achieving better detection precision and reduced computational requirements under resource constraints through a compound scaling method. In 2021, Chen et al. [7] introduced YOLOF, employing dilated encoders and unified matching to significantly enhance the model’s performance. Chen et al. [8] enhanced object detection in low-light environments by first identifying the most suitable image enhancement algorithms for feature retrieval, then applying various object detection models to the enhanced images, and finally comparing the results using mean average precision (mAP) and suggesting directions for future research. In 2022, Cheng et al. [9] introduced MaskFormer, achieving new breakthroughs in instance segmentation by combining Transformer with convolutional neural networks. However, under low-light conditions, both one-stage and two-stage detection methods’ performance can be significantly affected or even fail. Alexey et al. [10] proposed the YOLOv7 algorithm, incorporating more data augmentation methods to some extent, improving the model’s robustness. Wu [11] proposed an edge-computing-driven, end-to-end framework for image enhancement and object detection in low-light environments, combining cloud-based image enhancement with edge-based object detection, which significantly improves detection performance with low latency on edge devices. In 2023, Hashmi et al. [12] proposed a novel module named FeatEnHancer, which optimizes low-light image representations by hierarchically combining multi-scale features using multi-headed attention guided by task-related loss functions, significantly enhancing performance across various low-light vision tasks. Wang et al. [13] proposed a detection model named DK_YOLOv5, which improves the YOLOv5 architecture, optimizes the low-light image enhancement algorithm, introduces the R-SPPF and C2f_SKA modules, and employs a decoupled detection head suitable for low-light object detection, significantly enhancing detection accuracy in low-light scenarios. Lu et al. [14] proposed a framework named HDNet for detecting salient objects in low-light images, which combines a foreground highlight sub-network (HNet) and an appearance-aware detection sub-network (DNet). They also contributed the first annotated dataset for salient object detection in low-light images (SOD-LL). Experimental results demonstrated the effectiveness and generalization capability of their method. In 2024, Lu et al. [15] proposed a framework named HDNet for detecting salient objects in low-light images, which combines a foreground highlight sub-network (HNet) and an appearance-aware detection sub-network (DNet). They also contributed the first annotated dataset for salient object detection in low-light images (SOD-LL). Experimental results demonstrated the effectiveness and generalization ability of their method. Cui et al. [16] proposed a novel method that extends illumination-based enhancers into a scene decomposition module, utilizing the removed illumination to assist in extracting detection-friendly features. Additionally, they introduced a semantic aggregation module to integrate multi-scale scene-related semantic information, significantly improving object detection performance in low-light scenarios. Li et al. [17] proposed a parallel dual-stream backbone network named FISNet, which combines Swin Transformer and local convolution in a feature interaction structure (FIS) to effectively analyze, utilize, and integrate local and global information from low-light images. Yao et al. [18] proposed an end-to-end pipeline named LAR-YOLO, which leverages convolutional networks to extract image transformation parameters and applies the Retinex theory to enhance low-light image quality. By employing cross-domain learning to supplement the low-light model with knowledge from normal light scenarios, this approach significantly improves object detection accuracy in low-light environments. Zhen et al. [19] proposed a low-light target detection network named NLE-YOLO, based on YOLOv5, which enhances detection accuracy and performance in low-light environments through improved preprocessing techniques, the C2fLEFEM feature extraction module, the AMC2fLEFEM multi-scale feature extraction module, and the AMRFB attention mechanism receptive field module. However, under low-light conditions, both one-stage and two-stage detection methods’ performance can be significantly affected or even fail. Therefore, to enhance the accuracy of model detection, this paper proposes a low-light object detection algorithm, YOLO_GD.

2.2. Attention Mechanism

The attention mechanism simulates the human brain’s focus on key information, enabling networks to concentrate on relevant content and disregard irrelevant information. It plays a crucial role in tasks such as image classification, image segmentation, object detection, and image enhancement, enhancing the model’s performance and robustness. For instance, in 2018, Hu et al. [20] proposed the SENet image classification network model, which compresses the spatial dimensions of feature maps and learns attention weights through a multi-layer perceptron, annotating channels that contribute significantly to the outcomes, thus improving the model’s accuracy. In the same year, Sanghyun et al. [21] introduced CBAM, which effectively combines channel and spatial attention to annotate input features for adaptive feature refinement. Coordinate attention decomposes channel attention into two one-dimensional feature encoding processes, aggregating features along two spatial directions respectively. In 2021, Wu et al. [22] proposed the FSOD-UP (universal-prototype augmentation for few-shot object detection) method, utilizing the knowledge of universal prototypes and applying channel attention mechanisms on conditional universal prototypes and candidate boxes to improve the quality of candidate box generation, enhancing the model’s detection performance. The introduction of these attention mechanisms provides an effective means to improve the performance and accuracy of object detection algorithms.

3. The YOLO_GD Network Model

Addressing the issue of low detection accuracy in low-light environments caused by small, dense, and occluded objects, this paper presents an efficient object detection algorithm, YOLO_GD. The network structure of this model is mainly divided into four parts: Input, Backbone, Neck, and Head. The Backbone part is composed of Conv, C3, and SPPF modules, which are utilized for extracting multi-scale deep feature representations from the input image. Different from the work of Qi et al. [23], the Neck part takes this as input and employs a feature fusion method based on an information gathering and distribution mechanism, combined with a Bi-level routing spatial attention module for feature alignment and fusion, which is then distributed to the corresponding network layers. To enhance the model’s detection capability for multi-scale targets, this process is divided into two branches: Low-GD and High-GD, each consisting of a Feature Alignment Module (FAM) and an Integration Feature Module (IFM). Multi-scale feature maps from the Backbone are input into Low-GD and High-GD to obtain corresponding fused feature representations, which are further locally fused by a Cross-Stage Local Feature Fusion Module (SF_DCNv3). Finally, the fused feature representations output by the Neck part are handed over to the Head part, where the position and category information of the target bounding boxes are regressed through an anchor box loss function incorporating a minimum point distance mechanism. In summary, this paper, based on the YOLO_GD study, utilizes the YOLOv5s framework and enhances the detection of small objects and complex features through techniques such as information aggregation and distribution mechanisms, a dual-stage routing spatial attention module, and deformable convolutions. Additionally, it employs a minimum point distance loss function to improve accuracy in occluded scenarios. On the other hand, Qi et al. optimized the attention learning mechanism in attention-based object detection through neural attention learning, utilizing multi-head attention and improved feature extraction methods to handle complex backgrounds and occluded targets, enhancing regression accuracy and detection robustness by optimizing traditional loss functions. The overall algorithm structure is illustrated in Figure 1.

3.1. The Bi-Level Routing Spatial Attention Module

Our module is different from the multi-head attention module improved by Qi et al. [24]. They employed a 1D convolutional neural network, utilizing multi-kernel temporal blocks (MKTBs) and a global refinement block (GRB) to effectively model multi-scale temporal features and global temporal features, improving the accuracy and efficiency of gesture recognition. In this paper, a Bi-level Routing Spatial Attention Module (BS_Attention) is constructed based on a dynamic sparse attention mechanism through a Bi-level routing method, addressing the challenge of detecting small objects in low-light environments. This includes the Bi-level Attention module [25] and the spatial attention module. Feature inputs sequentially pass through these two modules with residual connections introduced between the output of each module and the original input.
The Bi-level Routing Attention Module divides the input feature map W R H × W × C into an S × S grid of non-overlapping regions. After this step, the original feature map is transformed into X r R S 2 × H W S 2 × C , and through a linear mapping process, it results in the corresponding matrices labeled as Q , K , V . This procedure can be represented by Equation (1):
Q = X r W , K = X r W , V = X r W
Subsequently, for the matrices Q , K , an operation to calculate the mean value for each sector within the S × S grid is conducted, which serves as the token representation for that sector, resulting in matrices Q r ,   K r . These matrices, Q r ,   K r , are then subjected to a transpose and multiplication operation to derive the corresponding adjacency matrix A r . This methodology can be encapsulated in Equation (2):
A r = Q r ( K r ) T
Regarding the previously derived adjacency matrix, the TopK operation is applied to ascertain the indices of the top k regions with the highest degree of association for each area, yielding index matrix I r . Subsequently, matrix I r is employed in a Gather operation with the K , V matrices, which are thereafter designated as matrices K g   a n d   V g , respectively. This procedure is formalized in Equation (3):
K g = g a t h e r ( K , I r ) , V g = g a t h e r ( V , I r )
Finally, matrices Q , K g , V g are utilized as inputs for attention computation, and a Local Context Enhancement (LCE) unit [4] is introduced for adjustment. The computation process can be encapsulated in Equation (4):
O =   Attention   Q , K g , V g + L C E V
The spatial attention module first applies a convolution operation to the output feature map X R H × W × C from the Bi-level Routing Attention Module to generate its corresponding attention weight map A . The Softmax activation function is then used to ensure that the attention weight values are within the range of 0 to 1 with the sum of all weights equaling 1. Following this, the input feature map X is combined with the attention weight map A through a weighted summation to yield the final spatially attended feature representation Y . This computation process can be formalized as shown in Equation (5):
Y = i = 1 H j = 1 W A ij X ij
Overall, the Bi-level Routing Spatial Attention Module, upon receiving input feature information, initially filters out irrelevant key-value pairs at a coarse regional level. It then applies fine-grained dynamic attention within the union of the remaining candidate regions (the routing regions). This approach not only achieves the objective of reducing computational costs but also preserves detailed information, enhancing detection accuracy for small objects in low-light conditions. The structure of this module is illustrated in Figure 2.
To validate the effectiveness of the Bi-level Routing Spatial Attention Module, it is necessary to demonstrate the distribution of attention across different regions when processing input images. Therefore, the distribution of attention information can be visualized. The results are illustrated in Figure 3.
The areas of high intensity in red indicate that the network pays closer attention to these regions during processing. These areas typically contain information about the targets being detected. As shown in Figure 3, the Bi-level Routing Spatial Attention Module developed in this study exhibits a more concentrated focus on small targets.

3.2. Cross-Layer Feature Fusion Method Based on Information Aggregation and Distribution Mechanism

Our mechanism is different from that in the work of Qi et al. [26]. It achieves an efficient synthesis of unpaired medical images through a lightweight patch classifier and multi-head self-attention layers, based on a CycleGAN framework that includes two generators, GXY and GYX, as well as two discriminators, DX and DY, which are trained using adversarial loss and cycle consistency loss. In this paper, a feature fusion method based on the Gather–Distribute [27] mechanism is designed in the network’s Neck part, circumventing the issue of traditional Feature Pyramid Networks (FPNs) indirectly fusing features across different network layers in an iterative manner during cross-layer information interaction. Simultaneously, to enhance the model’s capability to detect targets of varying scales, it is divided into two branches: low-GD and high-GD. Into these two branches, a Bi-level routing spatial attention module is introduced, respectively named Low-BIFM and High-BIFM (as shown in Figure 4), addressing the issue of large model parameter size and prolonged training and inference time caused by traditional multi-head attention mechanisms.
The Low-BIFM consists of multiple layers of Reparameterized Convolution Modules (RepBlock) [28], Bi-level Routing Spatial Attention Module, and Feature Decomposition Module. Firstly, the Low-FAM module aligns the multi-scale features B 2 , B 3 , B 4 , B 5 R H × W × C output from the Backbone part (if the input image size is 640 × 640 , then the sizes of B 2 , B 3 , B 4 , B 5 are 160 × 160 × 64 ,   80 × 80 × 128 ,   40 × 40 × 256 ,   20 × 20 × 512 ), denoted as F a l i g n , with the channel number equal to the sum of channels in B 2 , B 3 , B 4 , B 5 . This process is represented by Equation (6):
F a l i g n = L o w F A M B 2 , B 3 , B 4 , B 5
Then, the multi-layer Reparameterized Convolution Modules combined with the Bi-level Routing Spatial Attention Module BS_Attention fuse the aligned features F a l i g n to obtain the fused feature F f u s e with the number of channels equal to the sum of channels in B 4 , B 5 . This process is represented by Equation (7):
F f u s e = B S _ A t t e n t i o n ( R e p B l o c k ( F a l i g n ) )
Finally, the Feature Decomposition Module decomposes the fused feature F f u s e into F i n j _ P 3 and F i n j _ P 4 , which are then injected into the corresponding network levels by the Feature Injection Module. This process can be represented by Equation (8):
F i n j _ P 3 , F i n j _ P 4 = S p l i t ( F f u s e )
High-BIFM consists of a Transformer module combined with the Bi-level Routing Spatial Attention Module and the Feature Decomposition Module. Firstly, the High-FAM module aligns the features P 3 , P 4 R H × W × C processed by the Low-BIFM module and the features P 5 R H × W × C from the Backbone section (assuming the input image size is 640 × 640 , resulting in P 3 , P 4 , P 5 having sizes of 80 × 80 × 64 ,   40 × 40 × 128 ,   20 × 20 × 256 ), which is denoted as F a l i g n . The channel number of F a l i g n equals the sum of channels in P 3 , P 4 , P 5 . This process is represented by Equation (9):
F a l i g n = H i g h F A M P 3 , P 4 , P 5
Then, the Transformer module fuses the aligned features F a l i g n and reduces the output channel number to the sum of the channels in P 4 and P 5 through a 1 × 1 convolution, obtaining the fused feature F f u s e . This process is represented by Equation (10):
F f u s e = B S _ A t t e n t i o n F a l i g n
Finally, the Feature Decomposition Module decomposes F f u s e into F i n j _ N 4 and F i n j _ N 5 , which are then injected into the corresponding network levels by the Feature Injection Module. This process can be represented by Equation (11):
F i n j _ N 4 , F i n j _ N 5 = S p l i t ( C o n v 1 × 1 ( F f u s e ) )
The above feature sampling process is shown in Figure 5.
To assess the necessity of employing Low-BIFM and High-BIFM, a set of experiments will be conducted for comparison. These experiments will employ images containing multiple small target objects to evaluate the model’s detection performance in complex low-light environments. The comparative results are depicted in Figure 6.
From the comparison chart, it is evident that the feature fusion methods of Low-BIFM and High-BIFM demonstrate remarkable detection capabilities for small targets in low-light conditions. They are more suitable for detection tasks in complex low-light environments compared to FPN approaches.

3.3. Cross-Stage Local Feature Fusion Module

Inspired by Deformable Convolutional Networks version 3 (DCNv3) [29], this paper enhances the classic BottleNeck structure by introducing the DF_BottleNeck architecture. It addresses the shortcomings of traditional convolution employing fixed grid operations in complex scenarios characterized by dense and irregular target distributions, which fail to effectively capture the intricate features of input data.
We denote the number of channels in the input feature map as C . The DF_BottleNeck architecture initially employs a 1 × 1 convolution for feature compression, reducing the channel count of the feature map to 1 2 C , thereby effectively lowering the model’s parameter count. This is followed by feature extraction through a 3 × 3 convolution, where the deformable convolutional network version 3 (DCNv3) enables the model to more accurately capture the complex features of the input data. This process introduces a series of learnable sampling points, allowing the convolution kernel’s sampling locations to dynamically adjust according to the target’s shape, thus enhancing the network’s adaptability to various target forms. Furthermore, by incorporating a residual structure that adds the output features to the original input, issues such as gradient vanishing and explosion that may occur during model training are effectively mitigated. The structure is illustrated in Figure 7.
To augment the network’s depth and receptive field, thus boosting its capability for feature extraction, a cross-stage local feature fusion module, designed based on the DF_BottleNeck structure, has been developed and is designated as SF_DCNv3. This module utilizes a bifurcated architecture, with one branch incorporating a sequence of standard convolution structures and DF_BottleNeck layers, while the other branch is processed solely through standard convolutional structures. Subsequently, the outputs from both branches are merged through a concatenation operation. All the convolutional elements engaged in this procedure employ convolution kernels of size 1 × 1 , with stride and padding parameters set to 1 and 0, respectively. The architecture is depicted in Figure 8.
To verify the effectiveness of the SF_DCNv3 module in scenarios with dense targets, a series of comparative experiments will be conducted. These experiments will utilize traditional Feature Pyramid Networks (FPNs) and the Cross-Stage Local Feature Fusion module proposed in this paper to compare the model’s detection performance in complex low-light environments. The comparative results are illustrated in Figure 9.

3.4. Anchor Box Loss Function with Integrated Minimum Point Distance Mechanism

Our mechanism is different from the multi-head attention module improved by Qi et al. [30]. They use multi-head attention and improved feature extraction methods to handle complex backgrounds and occluded objects, enhancing regression accuracy and detection robustness through optimized traditional loss functions. This paper introduces a new anchor box loss function based on the minimum point distance mechanism (MPDLoss) to address the issue of missed detections caused by the heavy occlusion and overlap of multiple targets in low-light conditions. When two bounding boxes overlap, the traditional Intersection over Union (IoU) algorithm tends to produce a high score, which can mislead the model and hinder the improvement of model detection accuracy. Inspired by the MPDIoU [31] loss function, this paper incorporates the concept of probabilistic distance measurement in the calculation of anchor box loss, enabling a more accurate assessment of the discrepancy between predicted and actual bounding boxes. Specifically, for two convex polygon regions A , B S R n with widths and heights denoted by: w , h , and their top-left coordinates denoted by ( x 1 A , y 1 A ) and ( x 1 B , y 1 B ) , and bottom-right coordinates denoted by ( x 2 A , y 2 A ) and ( x 2 B , y 2 B ) , the corresponding distance probabilities are denoted by d 1 2 and d 2 2 . The calculation process is expressed in Equations (12) and (13):
d 1 2 = ( x 1 B x 1 A ) 2 + ( y 1 B y 1 A ) 2
d 2 2 = ( x 2 B x 2 A ) 2 + ( y 2 B y 2 A ) 2
The corresponding anchor box regression loss L , can be derived from the aforementioned probabilistic distance measurements. This computation is formally represented in Equation (14):
L = 1 ( A B A B d 1 2 d 2 2 w 2 + h 2 )
The loss function employed in this work computes the loss value based on the maximum probabilistic distance between the predicted and true bounding boxes. Thus, even in cases of overlapping bounding boxes, if there are variations in their positions or shapes, the loss function is adept at detecting such differences and will allocate a higher loss value. This methodology significantly improves the detection precision for occluded objects.

4. Experimental Results

4.1. Experimental Datasets

In this study, we conducted experiments on the low-light object detection dataset ExDark to evaluate the YOLO_GD method. The ExDark dataset comprises a total of 7363 images, which are divided into a training–validation set and a test set at a ratio of 8:2. From the training–validation set, 10% of the images are allocated as a validation set. Consequently, the dataset is segmented into a training set, a validation set, and a test set, containing 5301, 589, and 1473 images, respectively. The dataset annotates a total of 12 predefined categories, including Bicycle, Boat, Bottle, Bus, Car, Cat, Chair, Cup, Dog, Motorbike, People, and Table.

4.2. Experimental Environment and Parameter Settings

To validate the effectiveness of the proposed algorithm, this study conducts a comparative analysis with several existing widely used algorithms. The experiments are carried out in an Ubuntu 16.04 environment on a Tesla V100 GPU with 32 GB of VRAM. During the training process, the input image size is set to 640 × 640, and the batch size is configured at 64. The training employs an SGD optimizer, with an initial learning rate of 0.01 and a weight decay of 0.0005. The total number of training epochs is set to 100.

4.3. Evaluation Metrics

This paper utilizes precision (P), parameter count (Params), and mean average precision (mAP) across all samples as evaluation metrics. Precision is defined as the proportion of correctly predicted positive samples out of all samples predicted as positive, as demonstrated in Equation (15):
P = T P T P + F P
True Positives (TPs) are defined as instances where the predicted and true values both agree as positive cases. False Positives (FPs) refer to instances where the prediction is positive while the actual value is negative. The parameter count (Params) denotes the total number of learnable parameters, including weights and biases, within the algorithm. The volume of parameters serves as an indicator of the model’s complexity and its storage requirements.
Average precision (AP) is determined by the area under the precision–recall (P-R) curve, which is bordered by the coordinate axes. This metric assesses the detection accuracy across different classes within the dataset. The primary measure for evaluating overall detection performance is the mean average precision (mAP), which is calculated by averaging the AP values across all detected classes, as indicated in Equations (16) and (17). A higher mAP value signifies superior performance in object detection.
A P = 0 1 P R d R
m A P = 1 c l a s s _ n u m 0 1 P R d R

4.4. Model Training Results

Figure 10 shows the loss of the improved model on the training set and validation set, where box_loss is the localization loss, which indicates the error between the prediction box and the calibration box; dfl_loss is the distance field loss used to determine the difference between the predicted distance field of the model and the actual distance scenario; and cls_loss is the classification loss, which indicates whether the anchor box is correctly categorized with the corresponding calibration.
From the training results in Figure 10, it can be seen that all types of loss curves tend to stabilize, and the model converges stably.

4.5. Quantitative Analysis

In this section, experiments were conducted using the ExDark dataset to evaluate the effectiveness of the YOLO_GD method and to perform an objective comparison and analysis with several current mainstream object detection methods in terms of evaluation metrics and visual results. Due to the prevalence of small, dense, and occluded targets in low-light environments, the detection accuracy of existing object detection methods is not optimal. The experimental results, as shown in Table 1, indicate that apart from YOLOv5, YOLOv7, and YOLOv8, the mAP scores of most object detection methods are below 70%. In contrast, the YOLO_GD method proposed in this paper outperforms other methods in detection performance. Compared to the high-accuracy YOLOv5, YOLOv7, and YOLOv8 methods, the improvements in mAP are, respectively, 5.97%, 5.35%, and 5.22%, and the precision (P) is enhanced by 0.33%, 0.98%, and 0.80%, respectively. The experimental outcomes demonstrate that the YOLO_GD method proposed in this paper achieves the best detection effect, validating its effectiveness and feasibility.

4.6. Qualitative Analysis

The comparative visualization results of the YOLO_GD, YOLOv3, YOLOv5, YOLOv7, and YOLOv8 methods are illustrated in Figure 11. The visualization selects images under varying low-light conditions with detection boxes annotating the target category and confidence. It can be observed that in the first row, methods from Figure 11a to Figure 11d all miss detections due to the dense distribution of objects, leading to inaccurate feature localization between the targets and the edges of railings, resulting in the inability of these methods to detect most of the bicycles. Only the YOLO_GD method successfully detects the majority of targets; in the second row, due to occlusions between targets, methods from Figure 11a to Figure 11d fail to detect the occluded person, and although the method in Figure 11a can detect the occluded person, it suffers from false detections. Only the YOLO_GD method accurately detects all targets and avoids false detections; in the third row, under conditions where it is difficult for the human eye to discern, the method in Figure 11a mistakenly identifies two targets as three due to their small size. Due to the low distinction between targets and the background, methods from Figure 11b to Figure 11d fail to detect people in the distance. Only the YOLO_GD method can fully detect the targets in the image and accurately annotate the bounding boxes. Compared with the visualization results of other methods, the YOLO_GD method demonstrates superior performance in low-light condition target detection tasks, effectively addressing the issues of missed and false detections, improving the model’s detection precision and target localization accuracy.
Compared with the visualization results of the YOLOv5s method, the YOLO_GD method can more effectively complete low-light object detection tasks, significantly addressing issues of missed and false detections. This improvement enhances the model’s detection precision and the accuracy of target localization.

4.7. Ablation Experiments

Through comparative analysis between the YOLO_GD algorithm and various improvement strategies, it is evident that all five enhancement approaches have improved the effectiveness of low-light object detection. √ indicates being used. As presented in Table 2, in comparison with the original algorithm, Experiment 1 shows an increase of 4.55% in mean average precision (mAP), though there is a slight decrease of 0.07% in detection precision (P), indicating that the information gathering and distribution mechanism effectively addresses the issue of information loss inherent in traditional feature pyramids. Consequently, Experiments 2 and 3 were conducted, both of which resulted in improvements in average precision (mAP) with Experiment 3 exhibiting a greater increase in mAP than Experiment 2, alongside a 0.2% increase in detection precision (P). Building upon this, Experiments 4 and 5 were performed, demonstrating that the novel low-light detection algorithm, YOLO_GD, surpasses the original YOLOv5s algorithm in terms of both average precision (mAp) and detection precision (P), with respective increases of 0.33% and 5.97% compared to the baseline model. These results validate the effectiveness of the cross-stage local feature fusion module and MPDLoss, yielding superior detection outcomes.

4.8. Supplementary Experiments on the UAVDT Dataset

To validate the robustness and generalization ability of the model, Table 3 lists the performance of the proposed method and other comparative algorithms on the UAVDT dataset. The experimental results demonstrate that the proposed method achieves a mean average precision (mAP) of 48.7% on the UAVDT dataset, which is superior to other comparative methods in terms of average detection precision. In summary, the proposed method exhibits high detection accuracy and strong generalization ability.
To better analyze the generalization ability of the model in low-light environments, we randomly selected a subset of low-light images from the UAVDT dataset for detection. The results are shown in Figure 12.

5. Conclusions and Future Studies

In low-light environments, due to the low detection accuracy of general object detection algorithms, this paper introduces a new low-light object detection algorithm, YOLO_GD, focusing on three major issues: the missed detection of small objects, the false detection of dense objects, and the missed detection of occluded objects in low-light conditions. Firstly, a cross-layer feature fusion method is designed to mitigate the issue of information loss during inter-layer feature interaction, and based on this, a Bi-level routing spatial attention module is constructed to reduce the computational redundancy brought about by the self-attention mechanism, enhancing the model’s detection accuracy for small objects. Secondly, through a novel deformable convolution, a cross-stage local feature fusion module is built, enabling the model to more accurately capture the complex features of input data and improve the detection precision for dense objects. Lastly, the bounding box regression loss function is improved with a probabilistic distance metric, endowing the network model with better generalization capabilities and further increasing the detection accuracy in occluded scenarios. Compared to mainstream object detection methods on the ExDark dataset, the YOLO_GD method achieves higher detection precision in low-light scenarios, though the detection speed requires improvement. Future work will investigate lightweight network structural models and explore methods for data sharing between modules to reduce parameters and computational requirements, ultimately enhancing object detection speed. However, there is still room for optimization of the network in real-world scenarios. Future work will explore new modules and structures to further improve the model’s real-time performance and detection accuracy under more complex conditions, such as extremely low-light environments.

Author Contributions

Conceptualization, X.W. and Q.C.; methodology, X.W.; software, X.W.; validation, Y.W., H.C. and X.W.; formal analysis, J.L.; investigation, J.L.; resources, J.L.; data curation, X.W.; writing—original draft preparation, X.W.; writing—review and editing, X.W.; visualization, X.W.; supervision, J.L.; project administration, J.L.; funding acquisition, H.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Key project of National Natural Science Foundation of China, Grant Number 62306172. The fund sponsor is Haifeng Chen.

Data Availability Statement

Datasets (ExDark) used in this work are public available in the Internet.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

  1. Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
  2. Shu, Z.; Zhang, Z.; Song, Z.; Wu, M.; Yuan, X. Low-Light Image Object Detection Based on Improved YOLOv5 Algorithm. Laser Optoelectron. Prog. 2023, 60, 67–74. [Google Scholar] [CrossRef]
  3. Liu, K.; Sun, Q.; Sun, D.; Peng, L.; Yang, M.; Wang, N. Underwater target detection based on improved YOLOv7. J. Mar. Sci. Eng. 2023, 11, 677. [Google Scholar] [CrossRef]
  4. Du, X.; Lin, T.Y.; Jin, P.; Ghiasi, G.; Tan, M.; Cui, Y.; Le, Q.V.; Song, X. Spinenet: Learning scale-permuted backbone for recognition and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11592–11601. [Google Scholar]
  5. Tan, M.X.; Pang, R.M.; Le, Q.V. EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; IEEE Computer Society Press: Los Alamitos, CA, USA, 2020; pp. 10778–10787. [Google Scholar]
  6. Tan, M.; Le, Q. Efficient Net: Rethinking model scaling for con-volutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; PMLR: New York, NY, USA, 2019; pp. 6105–6114. [Google Scholar]
  7. Chen, Q.; Wang, Y.M.; Yang, T.M.; Zhang, X.; Cheng, J.; Sun, J. You only look one-level feature. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; IEEE Computer Society Press: Los Alamitos, CA, USA, 2021; pp. 13034–13043. [Google Scholar]
  8. Chen, W.; Shah, T. Exploring low-light object detection techniques. arXiv 2021, arXiv:2107.14382. [Google Scholar]
  9. Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1290–1299. [Google Scholar]
  10. Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
  11. Wu, Y.; Guo, H.; Chakraborty, C.; Khosravi, M.R.; Berretti, S.; Wan, S. Edge computing driven low-light image dynamic enhancement for object detection. IEEE Trans. Netw. Sci. Eng. 2022, 10, 3086–3098. [Google Scholar] [CrossRef]
  12. Hashmi, K.A.; Kallempudi, G.; Stricker, D.; Afzal, M.Z. Featenhancer: Enhancing hierarchical features for object detection and beyond under low-light vision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 6725–6735. [Google Scholar]
  13. Wang, J.; Yang, P.; Liu, Y.; Shang, D.; Hui, X.; Song, J.; Chen, X. Research on improved yolov5 for low-light environment object detection. Electronics 2023, 12, 3089. [Google Scholar] [CrossRef]
  14. Qiu, Y.; Lu, Y.; Wang, Y.; Jiang, H. IDOD-YOLOV7: Image-dehazing YOLOV7 for object detection in low-light foggy traffic environments. Sensors 2023, 23, 1347. [Google Scholar] [CrossRef] [PubMed]
  15. Lu, X.; Yuan, Y.; Liu, X.; Wang, L.; Zhou, X.; Yang, Y. Low-Light Salient Object Detection by Learning to Highlight the Foreground Objects. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 7712–7724. [Google Scholar] [CrossRef]
  16. Cui, X.; Ma, L.; Ma, T.; Liu, J.; Fan, X.; Liu, R. Trash to treasure: Low-light object detection via decomposition-and-aggregation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 26–27 February 2024; Volume 38, pp. 1417–1425. [Google Scholar]
  17. Wen, M.A.I.; Hao, L.I.; Yan, K. Low-Light Object Detection Based on Feature Interaction Structure. J. Comput. Eng. Appl. 2024, 60. [Google Scholar]
  18. Yao, M.; Lu, Y.; Mou, J.; Yan, C.; Liu, D. End-to-end adaptive object detection with learnable Retinex for low-light city environment. Nondestruct. Test. Eval. 2024, 39, 142–163. [Google Scholar] [CrossRef]
  19. Peng, D.; Ding, W.; Zhen, T. A novel low light object detection method based on the YOLOv5 fusion feature enhancement. Sci. Rep. 2024, 14, 4486. [Google Scholar] [CrossRef] [PubMed]
  20. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; IEEE Computer Society Press: Los Alamitos, CA, USA, 2018; pp. 7132–7141. [Google Scholar]
  21. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM:convolutional block attention module. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; Springer: Heidelberg, Germany, 2018; pp. 3–19. [Google Scholar]
  22. Wu, A.; Han, Y.; Zhu, L.; Yang, Y. Universal-prototype augmentation for few- shot object detection. arXiv 2021, arXiv:2103.01077. [Google Scholar]
  23. Ge, C.; Song, Y.; Ma, C.; Qi, Y.; Luo, P. Rethinking attentive object detection via neural attention learning. IEEE Trans. Image Process. 2023, 33, 1726–1739. [Google Scholar] [CrossRef] [PubMed]
  24. Jiang, S.; Qi, Y.; Zhang, H.; Bai, Z.; Lu, X.; Wang, P. D3d: Dual 3-d convolutional network for real-time action recognition. IEEE Trans. Ind. Inform. 2020, 17, 4584–4593. [Google Scholar] [CrossRef]
  25. Zhu, L.; Wang, X.; Ke, Z.; Zhang, W.; Lau, R. BiFormer: Vision Transformer with Bi-level Routing Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 10323–10333. [Google Scholar]
  26. Phan, V.M.H.; Xie, Y.; Zhang, B.; Qi, Y.; Liao, Z.; Perperidis, A.; Phung, S.L.; Verjans, J.W.; To, M.-S. Structural Attention: Rethinking Transformer for Unpaired Medical Image Synthesis. arXiv 2024, arXiv:2406.18967. [Google Scholar]
  27. Wang, C.; He, W.; Nie, Y.; Guo, J.; Liu, C.; Han, K.; Wang, Y. Gold-YOLO:Efficient Object Detector via Gather-and-Distribute Mechanism. arXiv 2023, arXiv:2309.11331. [Google Scholar]
  28. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
  29. Wang, W.; Dai, J.; Chen, Z.; Huang, Z.; Li, Z.; Zhu, X.; Hu, X.; Lu, T.; Lu, L.; Li, H.; et al. Internimage: Exploring large-scale vision foundation models with deformable convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
  30. Yi, Y.; Ni, F.; Ma, Y.; Zhu, X.; Qi, Y.; Qiu, R.; Zhao, S.; Li, F.; Wang, Y. High Performance Gesture Recognition via Effective and Efficient Temporal Modeling. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, Macao, 10–16 August 2019; pp. 1003–1009. [Google Scholar]
  31. Ma, S.; Xu, Y. MPDIoU: A Loss for Efficient and Accurate Bounding Box Regression. arXiv 2023, arXiv:2307.07662. [Google Scholar]
  32. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
  33. Zhao, L.; Li, S. Object detection algorithm based on improved YOLOv3. Electronics 2020, 9, 537. [Google Scholar] [CrossRef]
  34. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Part I 14. Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
  35. Dai, J.; Li, Y.; He, K.; Sun, J. R-fcn: Object detection via region-based fully convolutional networks. Adv. Neural Inf. Process. Syst. 2016, 29. [Google Scholar]
  36. Li, C.; Yang, T.; Zhu, S.; Chen, C.; Guan, S. Density map guided object detection in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 13–19 June 2020; pp. 190–191. [Google Scholar]
Figure 1. YOLO_GD algorithm structure.
Figure 1. YOLO_GD algorithm structure.
Electronics 13 03527 g001
Figure 2. The Bi-level Routing Spatial Attention Module.
Figure 2. The Bi-level Routing Spatial Attention Module.
Electronics 13 03527 g002
Figure 3. The attention distribution map.
Figure 3. The attention distribution map.
Electronics 13 03527 g003
Figure 4. The structure diagrams of Low-BIFM and High-BIFM.
Figure 4. The structure diagrams of Low-BIFM and High-BIFM.
Electronics 13 03527 g004
Figure 5. Feature sampling process.
Figure 5. Feature sampling process.
Electronics 13 03527 g005
Figure 6. Comparison of results graph.
Figure 6. Comparison of results graph.
Electronics 13 03527 g006
Figure 7. DF_BottleNeck structure diagram.
Figure 7. DF_BottleNeck structure diagram.
Electronics 13 03527 g007
Figure 8. SF_DCNv3 structure diagram.
Figure 8. SF_DCNv3 structure diagram.
Electronics 13 03527 g008
Figure 9. Comparison of results graph.
Figure 9. Comparison of results graph.
Electronics 13 03527 g009
Figure 10. Loss situation.
Figure 10. Loss situation.
Electronics 13 03527 g010
Figure 11. Visualization comparison chart.
Figure 11. Visualization comparison chart.
Electronics 13 03527 g011
Figure 12. The detection results on the UAVDT dataset.
Figure 12. The detection results on the UAVDT dataset.
Electronics 13 03527 g012
Table 1. Comparison of results from different detection algorithms.
Table 1. Comparison of results from different detection algorithms.
ModelBackbonePmAPParametersFLOPs
Faster R-CNN [32]ResNet-5070.97%63.52%43.10 M251.40 G
EfficientDet [5]Efficient-B066.73%60.04%77.0 M515.40 G
YOLOv3 [33]Darknet-5374.58%67.80%61.53 M67.80 G
YOLOv5DarkNet-5378.17%71.05%7.02 M15.80 G
YOLOv7 [10]ELAN77.52%71.67%6.79 M15.70 G
YOLOv8DarkNet-5377.70%71.80%3.9 M10.10 G
SSD [34]VGG-1669.87%61.75%41.10 M387.00 G
Dark-YOLOv8DarkNet-5377.90%72.10%8.53 M14.51 G
Dark-YOLOCSPDarkNet-5375.04%74.76%63.9 M60.9 G
Night-YOLOXCSPDarkNet74.70%74.10%121.90 M148.26 G
YOLO_GDDarkNet-5378.50%77.02%6.29 M15.61 G
Table 2. Comparison of detection performance.
Table 2. Comparison of detection performance.
ModelGather-
Distribute
BS_AttentionSF_DCNv3MPDLossPmAP
YOLOv5s 78.17%71.05%
Experiment 1 78.10%75.60%
Experiment 2 78.30%76.40%
Experiment 3 77.70%76.50%
Experiment 4 77.75%76.57%
Experiment 578.50%77.02%
Table 3. Comparison of results from different detection algorithms.
Table 3. Comparison of results from different detection algorithms.
ModelInputBackbonemAP
YOLOv8640 × 640DarkNet-5333.80%
YOLOv7640 × 640ELAN26.80%
SSD512 × 512VGG-1621.40%
R-FCN [35]600 × 100ResNet-10117.50%
DMNet [36]600 × 1000ResNet5024.60%
YOLO_GD640 × 640DarkNet-5348.70%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, J.; Wang, X.; Chang, Q.; Wang, Y.; Chen, H. Research on Low-Light Environment Object Detection Algorithm Based on YOLO_GD. Electronics 2024, 13, 3527. https://doi.org/10.3390/electronics13173527

AMA Style

Li J, Wang X, Chang Q, Wang Y, Chen H. Research on Low-Light Environment Object Detection Algorithm Based on YOLO_GD. Electronics. 2024; 13(17):3527. https://doi.org/10.3390/electronics13173527

Chicago/Turabian Style

Li, Jian, Xin Wang, Qi Chang, Yongshan Wang, and Haifeng Chen. 2024. "Research on Low-Light Environment Object Detection Algorithm Based on YOLO_GD" Electronics 13, no. 17: 3527. https://doi.org/10.3390/electronics13173527

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop