Advancing Nighttime Object Detection through Image Enhancement and Domain Adaptation

Zhang, Chenyuan; Lee, Deokwoo

doi:10.3390/app14188109

Open AccessArticle

Advancing Nighttime Object Detection through Image Enhancement and Domain Adaptation

by

Chenyuan Zhang

and

Deokwoo Lee

^*

Department of Computer Engineering, Keimyung University, Daegu 42601, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(18), 8109; https://doi.org/10.3390/app14188109

Submission received: 16 August 2024 / Revised: 4 September 2024 / Accepted: 5 September 2024 / Published: 10 September 2024

(This article belongs to the Special Issue AI, Machine Learning and Deep Learning in Signal Processing, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

Due to the lack of annotations for nighttime low-light images, object detection in low-light images has always been a challenging problem. Achieving high-precision results at night is also an issue. Additionally, we aim to use a single nighttime dataset to complete the knowledge distillation task while improving the detection accuracy of object detection models under nighttime low-light conditions and reducing the computational cost of the model, especially for small targets and objects contaminated by special nighttime lighting. This paper proposes a Nighttime Unsupervised Domain Adaptation Network (NUDN) based on knowledge distillation to address these issues. To improve the detection accuracy of nighttime images, high-confidence bounding box predictions from the teacher and region proposals from the student are first fused, allowing the teacher to perform better in subsequent training, thus generating a combination of high-confidence and low-confidence pseudo-labels. This combination of feature information is used to guide model training, enabling the model to extract feature information similar to that of source images in nighttime low-light images. Nighttime images and pseudo-labels undergo random size transformations before being used as input for the student, enhancing the model’s generalization across different scales. To address the scarcity of nighttime datasets, we propose a nighttime-specific augmentation pipeline called LightImg. This pipeline enhances nighttime features, transforming them into daytime features and reducing issues such as backlighting, uneven illumination, and dim nighttime light, enabling cross-domain research using existing nighttime datasets. Our experimental results show that NUDN can significantly improve nighttime low-light object detection accuracy on the SHIFT and ExDark datasets. We conduct extensive experiments and ablation studies to demonstrate the effectiveness and efficiency of our work.

Keywords:

knowledge distillation; object detection; low-light images; cross-domain

1. Introduction

In recent years, significant progress has been made in object detection technology. However, under complex lighting conditions, particularly in nighttime environments, object detection still faces significant limitations. Nighttime object detection is crucial in many applications [1,2,3]. With the development of deep learning technology, the ability to learn object recognition features in a single domain has made significant advances [4,5,6]. However, due to the substantial domain shift between daytime and nighttime environments, models trained during the day often do not generalize well to other domains [5,6,7]. This is often due to unavoidable environmental factors such as insufficient lighting and high dynamic range, as well as technical limitations such as inadequate exposure time and low gain, which pose great challenges to nighttime recognition. Additionally, images usually cannot be captured under optimal lighting conditions and are affected by unavoidable environmental factors such as backlighting and uneven illumination, which often lead to significant degradation in the model’s actual detection performance [7,8,9,10]. Furthermore, objects in nighttime images are often smaller and harder to recognize, increasing the difficulty of detection [11,12]. Collecting sufficient nighttime data for manual annotation via cameras is also costly. The results of the LightImg image enhancement, compared to the unenhanced images, are shown in Figure 1. This has prompted researchers to focus on improving the model’s adaptability across domains. Domain adaptation is an effective solution that allows us to use existing datasets.

Unsupervised domain adaptation (UDA) methods attempt to adapt the model learned in the source domain to the target domain without annotated labels. Most existing cross-domain methods directly learn in the target domain by using labeled images from the source domain or pseudo-label estimates for unlabeled images from the target domain [11,12,13,14]. These methods do improve the accuracy of models in the target domain [2,5,10]. However, the learned models still struggle to generalize to unseen domains and require a large number of unlabeled target domain images for adaptation [15,16,17,18], which is clearly impractical in the context of day-to-night adaptation. In the source domain, a large amount of labeled image data is required, while in the target domain, the style or resolution of the images may differ. Adversarial learning is then used with image and instance-level classifiers. However, these methods isolate the domain adaptation task purely within the feature extractor, suppressing the features of the target data to maintain domain invariance. Recent unsupervised domain adaptation methods have utilized knowledge distillation. Since the student undergoes supervised learning during the pre-training phase, it is sensitive to data variations. To address this issue, adversarial learning methods are used. In day-to-night tasks, methods that make daytime image features more similar to nighttime features are also used. However, in day-to-night tasks, such methods can be contaminated by a large number of low-quality pseudo-labels generated by the teacher. In our investigation, the problem is mainly due to the significant changes between day and night environments. These erroneous pseudo-labels are repeatedly used as benchmarks during training, leading to poor object detection performance.

To address this problem, we propose the NUDN method, as shown in Figure 2, a nighttime unsupervised cross-domain object detection approach based on knowledge distillation. Our network is the first to use enhanced images based on nighttime environmental images for cross-domain training. Our NUDN first converts nighttime images to daytime images. Then, NUDN merges the bounding boxes of pseudo-labels produced by the teacher with the regions proposed by the student’s Region Proposal Network (RPN). The teacher then uses the merged results to generate new pseudo-labels. These pseudo-labels are then matched with the student’s predictions. The knowledge conveyed by Teacher RPN to students is shown in Figure 2. We can then use a weighted approach to ensure that our network is based on high-confidence pseudo-labels [19,20].

To enhance nighttime images, daytime images exhibit various features absent in nighttime scenes, such as rich colors, detailed lighting, and low noise levels. Since the student is trained on daytime images, nighttime images cannot be overly enhanced to prevent the network from overly favoring daytime features. At the same time, existing low-light image enhancement methods are based on Retinex theory [21], using either network or model-based approaches. However, both network-based and model-based methods require significant computation time, which is intolerable. To address this issue, we propose a nighttime enhancement method called LightImg (Figure 3). LightImg is designed to brighten images quickly, flexibly, and robustly in real low-light scenarios. As a fundamental module, it significantly reduces computational costs, enabling us to conduct cross-domain research using only nighttime datasets.

To address the issue of misidentifying small and medium-sized objects at night, we have designed a technique that can randomly scale images across various sizes. This technique resizes images and pseudo-labels to random sizes, enabling the model to robustly detect objects of various sizes and preventing overfitting at any single scale. To avoid affecting the teacher’s performance, the images used by the teacher retain their original size. This allows large objects to be disguised as smaller ones, thereby improving the student network’s ability to detect small and medium-sized objects. Overall, we achieve qualitative improvements in our results.

In summary, the contributions of this paper include:

We propose NUDN, a teacher-student learning method with co-learning capabilities. NUDN utilizes a combination of high-confidence teacher labels and low-confidence student proposals. This strategy significantly enhances the utilization of pseudo-labels.
To enable cross-domain research using existing nighttime datasets, we propose the image enhancement technique LightImg, which can convert nighttime features into daytime features.
The use of random image scaling mitigates data limitations.
We conducted comprehensive evaluations on two nighttime datasets, demonstrating that our NUDN exhibits strong discriminative capabilities in the source domain, robust adaptability to single-target domains, and strong generalization abilities to unseen domains.

2. Related Work

2.1. Image Enhancement

Image enhancement is a key task in the field of computer vision, primarily aiming to improve the visual quality of images, making them more suitable for human observation or computer processing [22,23,24]. In recent years, deep learning methods have made significant progress in the field of image enhancement, achieving this by learning the mapping relationships from low-quality to high-quality images through deep neural networks. Common visual disturbances include low light, glare, blur, and rain streaks.

For low-light problems, traditional low-light enhancement methods mainly include histogram equalization [25,26] and retina model-based methods [27,28,29]. Histogram equalization enhances contrast by adjusting the grayscale distribution of an image, while the retina model simulates the visual processing mechanisms of the human retina to improve image quality. However, these methods often have limited effectiveness in handling complex nighttime lighting environments. With the popularity of neural networks, some models enhance images in an end-to-end manner, while others use unsupervised methods to adapt to complex nighttime lighting. Additionally, many models have been designed based on physically interpretable retina theory. For example, Sharma et al. [30] used camera response function (CRF) estimation and HDR imaging methods to suppress glare; Jin et al. [31] performed unsupervised learning through integrated layer decomposition and light effect suppression. Although these methods improve image quality to some extent, they often perform poorly in nighttime scenes where low light and glare occur simultaneously.

Under low-light conditions, nighttime images may become blurry, especially in adverse weather such as rainy days, where rain streaks and puddle reflections can cause significant interference. To address these issues, some methods attempt to enhance complex rainy scenes [21,30,31,32,33,34,35,36]. For example, deep learning methods can effectively remove rain streaks and suppress reflections. However, although these methods can improve human visual quality, they do not consider the significant differences between human perception and machine vision, nor the applicability of these methods in other visual tasks.

Existing methods for processing images under low-light conditions still have some shortcomings. Specifically, they find it difficult to quickly process global illumination while using minimal computational resources, and they may not effectively suppress noise under complex lighting conditions. Furthermore, the applicability of these methods in different visual tasks is also limited.

To solve these problems, we propose a lightweight image enhancement method. This method not only quickly processes global illumination but also effectively suppresses noise under complex lighting conditions while using minimal computational resources. Through this method, we can achieve efficient and robust image enhancement in practical applications, thereby improving the performance and reliability of nighttime object detection.

2.2. Unsupervised Domain Adaptation (UDA) for Object Detection

To address the fundamental problem of domain shift, past work has primarily focused on exploring the transfer of rich knowledge from the source domain to the target domain, which helps reduce domain shift at the feature level [1,10,12,18]. Common methods in this category include adversarial feature learning and semi-supervised Mean Teacher (MT) frameworks [37]. Adversarial feature learning aims to generate domain-invariant features through a min-max game to produce domain-invariant feature maps. Chen et al. [1] first proposed adversarial training for image-level and instance-level features of multiple bounding boxes. Some methods enforce consistency regularization on instances of multiple categories. This approach has been applied to image classification. Our focus is on nighttime object detection, which is more complex as it involves low-light image recognition. The semi-supervised Mean Teacher framework [37] improves the performance of semi-supervised learning models by distilling pseudo-labels from unlabeled data [2,4]. The quality of pseudo-labels is crucial. Zhu et al. [2] proposed a domain-adaptive object detection algorithm based on CycleGAN to help improve the quality of pseudo-labels. Zheng et al. [32] proposed a hierarchical feature alignment method, which effectively addresses domain adaptability issues in cross-domain object detection through step-by-step feature adaptation from global to local levels. Knowledge distillation techniques have been utilized to improve detection accuracy by transferring knowledge from a larger, well-trained teacher model to a smaller student model [38,39,40]. Mengzhe He et al. [33] proposed a dual branch distillation method for cross-domain object detection, which is particularly effective for nighttime scenarios [33]. Another notable approach is the two-phase consistency training for day-to-night domain adaptation by Kennerley et al. [41]. Jin et al. [4] improved the generalization performance of object detection models in unknown target domains by increasing sample diversity between the source and target domains (Diversify) and enforcing distribution matching in feature space (Match) [42]. Research on nighttime object detection has also made significant progress. Li et al. [34] proposed a dynamic convolutional neural network for nighttime object detection, improving detection accuracy and robustness [7,8,16,18]. Wang et al. [35] used unsupervised image translation techniques to enhance object detection in nighttime traffic scenes [10,14,19,43]. Moreover, low-light image enhancement techniques have played a key role in nighttime object detection [44,45,46,47]. Some methods also combine adversarial feature learning to improve the quality of pseudo-labels. Although these methods have achieved promising results in most domain adaptation tasks, their adaptability is limited when domain shift is significant or when datasets are small. This is mainly because it is difficult to obtain high-quality pseudo-labels in extremely adverse environments [3,6,17].

In Unsupervised Domain Adaptation (UDA) methods, the goal of UDA is to learn transferable features so that the model can exhibit better adaptability and performance in tasks [1,12]. Unlike traditional UDA methods that minimize the feature distribution discrepancy between the source and target domains, most cross-domain methods use “labeled images” in the target domain for supervised learning, as different domains often have entirely different characteristics. For example, CycleGAN translates source domain images into the target domain [20], and the style-translated labeled images are then used for supervised learning in the target domain. Other methods iteratively estimate pseudo-labels for unlabeled target images through clustering and learn the model in the target domain through supervised learning via a self-training scheme. Some methods use additional labels or external models for learning. For instance, TJ-AIDL and MMFA use additional personnel attributes to learn attribute semantics and identity-discriminative feature representation spaces transferable to the target domain, while EANet uses partial segmentation constraints from external pose estimation models to enhance alignment. Although effective in the target domain, they still require a large amount of unlabeled data for adaptation.

3. Methodology

This section provides a detailed description of the proposed algorithm, the Nighttime Unsupervised Domain Adaptation Network (NUDN) for object detection based on distillation. Let

D

denote the daytime source data

D_{d} = {I_{d}, C_{d}, B_{d}}

, where C_d represents source data images,

C_{d}

denotes source data class labels, and B_d denotes bounding box labels. The subscript d indicates the daytime source. The nighttime target data is denoted by

D_{n}

, where

D_{n} = \{I_{n}\}

, as the target domain contains unlabeled samples. The subscript indicates the nighttime target.

The architecture of our NUDN is illustrated in Figure 2. Our network comprises a student network and a teacher network. The student network is a multi-domain model trained with LightImg-enhanced daytime images and unlabeled nighttime images. The teacher network only needs to learn from nighttime images and subsequently generate pseudo-labels for the student, with weights being the exponential moving average (EMA) of the student [8,13]. Following the pre-training phase, the teacher begins generating pseudo-labels while the student initializes feature extractors and detectors.

Model initialization is achieved through pre-training, which involves generating candidate object proposals via the Region Proposal Network (RPN). The feature vectors obtained from ROI pooling are used by the ROI-level classifier to predict category labels. The training loss is composed of the losses from both the RPN and the ROI classifier:

L_{P r e} = L_{r p n} + L_{r o i}

(1)

During each iteration, the teacher first generates pseudo-labels from the nighttime images. These pseudo-labels are filtered through a confidence threshold to ensure the student learns from high-quality pseudo-labels. Subsequently, the bounding boxes in the pseudo-labels are combined with regions proposed by the student’s Region Proposal Network (RPN). The combined region proposals are then used to generate predictions from the student’s RoI network.

3.1. Consistency Learning

Due to the cross-domain nature of the learning process, there exists a significant gap between source and target domain images, hindering the teacher’s ability to perform its tasks effectively. This phenomenon is prevalent throughout the network, particularly in challenging areas such as low light, glare, and flare. However, the teacher network tends to produce high-confidence labels that correspond to daytime features, gradually biasing towards daytime characteristics. This cyclical process results in the teacher generating simple samples with daytime attributes, thereby preventing the student from learning from more challenging areas.

Due to the limited understanding of positive samples, the teacher starts predicting highly confident but false-positive pseudo-labels. When the teacher provides these false-positive pseudo-labels to the student, a negative feedback loop begins, causing the teacher to subsequently update with incorrect knowledge. Consequently, errors propagate and accumulate throughout the training process. Our study shows that these errors are particularly pronounced in areas with nighttime features and small objects.

To address this issue, we designed a consistent learning approach. This approach allows for the coexistence of high-confidence and low-confidence pseudo-labels [6,20,48]. The teacher first predicts pseudo-labels from the unlabeled nighttime images. These pseudo-labels are filtered through a predefined confidence threshold, retaining only those with confidence higher than this threshold as the high-confidence pseudo-label set. At the initial stage, the confidence threshold was determined experimentally, with the initial value selected based on our performance on the validation set. A higher confidence threshold helps ensure the accuracy of the retained pseudo-labels, thereby reducing the risk of error propagation. As the model’s capability improves, the threshold can be gradually lowered, allowing the model to access a more diverse set of data samples, including those with lower confidence. Moreover, low-confidence pseudo-labels are not discarded; instead, at the initial stage, they are assigned a very low weight and continue to participate in the training process. As the model’s performance improves, the weight of low-confidence labels can be gradually increased, allowing the model to better manage complex nighttime scenes. The bounding boxes of the pseudo-labels are then used as input for the student network. These bounding boxes are merged with the candidate boxes generated by the student’s RPN module to form new candidate boxes:

{C B}_{p} = {R P N}_{s t u d e n t} \cup B_{p}

(2)

Here,

{C B}_{p}

represents the combined candidate boxes, which are then sent to the RoI module to predict the category

(C_{s t u d e n t})

and the corresponding bounding box

(B_{s t u d e n t})

of each candidate. Next, the same candidate boxes

{C B}_{p}

are used as input for the teacher’s RoI module, which, through regression, generates the pseudo-label information:

\{C_{s}, B_{s}\} = {R o l}_{t e a c h e r} ({C B}_{p})

(3)

The fully connected layer (FC) then obtains feature maps from the ROI Pooling, flattening them into a one-dimensional vector x. A weight matrix

W

and a bias vector

b

are applied to compute the linear transformation

Z

. The result

z

of the linear transformation is passed through an activation function, yielding the output

Y

after the nonlinear transformation.

Z = W_{x} + b

(4)

Our fully connected layer (FC) contains two fully connected layers, denoted as FC1 (Equation (5)) and FC2 (Equation (6)). The computation for each fully connected layer is as follows:

Z_{1} = W_{1} x + b_{1}

(5)

Y_{1} = M A X (0, Z_{1})

(6)

The feature vector output from the fully connected layer (FC) is sent to two different branches for different tasks. The first branch applies a Softmax activation function for the classification task:

cls_prob = S o f t m a x (W_{c l s} Y_{2} + b_{c l s})

(7)

The second branch uses a regression layer to obtain the bounding box predictions for each candidate region:

bbox_pred = W_{b b o x} Y_{2} + b_{b b o x}

(8)

To introduce more diversity and complexity, aiding the model’s adaptation to nighttime features, we use low-confidence pseudo-labels during training while controlling their impact with a weighting mechanism. Each pseudo-label is assigned a weight based on its confidence. High-confidence pseudo-labels have a higher weight and greater impact on training, while low-confidence pseudo-labels have a lower weight and smaller impact. We employ the Total Variation (TV) Distance as our consistency loss:

L_{c o n s} = W \cdot T V (C_{s t u d e n t}, C_{s})

(9)

Here,

W

denotes the highest confidence of

C_{s}

, represented as

W = m a x (C_{s})

, and

T V (\cdot)

is the TV Distance function.

L_{s u m} = L_{p r e} + L_{c o n s}

(10)

Our total loss is represented by Equation (10), where

L_{p r e}

is given by Equation (1) and

L_{c o n s}

is given by Equation (9).

3.2. Label Randomization

Our experiments revealed that detecting small objects at night is highly challenging. This challenge arises from the fact that small object features are easily obscured by night-specific environmental factors such as glare or low illumination. To enable the student network to overcome this obstacle, we enhance the input images and pseudo-labels at multiple scales.

C_{i l p}

represents the original size of the input image. At the start of training, we concentrate on small features and then determine a new scaling ratio

W_{u n}

based on the number of network iterations. Additionally, to prevent the student network from overfitting due to prolonged exposure to one scale, we empirically allow the actual scaling ratio to fluctuate within 0.15 of

W_{u n}

:

C_{s t u d e n t} = C_{i l p} + W_{u n} (- 0.15,0.15)

(11)

Simultaneously, to prevent imbalances caused by excessively small sizes and misleading models due to excessively large sizes, labels with sizes below or above the threshold will be discarded.

3.3. LightImg

Because the purpose of this study is cross-domain recognition, we propose LightImg, as shown in Figure 3, a module that can directly convert existing nighttime datasets into daytime datasets, enabling the use of existing nighttime datasets to complete tasks. This is a lightweight module that does not require excessive computational resources [14,19]. The enhancement by LightImg aims to transform the features of nighttime images into those of daytime images. Daytime images generally have higher brightness, clearer light sources, and more detailed backgrounds compared to nighttime images. The dynamic range (HDR) of daytime images may also be higher. Additionally, outside city centers, the influence of artificial light sources is also reduced.

Let

Y (x, y)

be the original nighttime image, where

L (x, y)

and

R (x, y)

represent the luminance and reflectance components, respectively. The relationship is given by

Y (x, y) = L (x, y) R (x, y)

(12)

We know that the relationship between the input nighttime image

y

and the output clear image

R

is given by

y = Z ⨂ R

, where

Z

represents the illumination component. Generally, illumination is considered the core component that needs to be optimized for low-light image enhancement. According to the Retinex theory [21], the input image consists of two parts: the luminance component

L

and the reflectance component

R

.

The enhanced results at different scales are linearly combined, taking into account both local and overall illumination components. This multi-scale algorithm ultimately produces images with good dynamic range compression, color stability, and color restoration. First, we divide the nighttime image

Y (x, y)

into several scale levels based on grayscale values and separate the three color components. Then, different Gaussian surround functions

G_{k} (x, y)

are constructed according to the ratios of 15:80:250. These Gaussian surround functions are used to perform convolution filtering on the R, G, and B color channels, and the illumination components are obtained by weighted averaging:

L (x, y) = \sum_{k = 1}^{N} ω_{k} (I_{i} (x, y) * G_{k} (x, y))

(13)

Here,

\sum_{k = 1}^{N} ω = 1

,

I_{i} (x, y)

is the pixel value of the original image, with

N = 3

in the RGB color space, and iii represents the three color channels. Then, we take the logarithm and subtract the illumination component from the original image:

\log R_{i} = \sum_{k = 1}^{N} ω_{k} (\log (I_{i} (x, y))) - L (x, y)

(14)

The previous steps may result in local detail color distortion and fail to reveal the true colors of objects. Therefore, we add a color restoration step to address this issue:

C_{i} (x, y) = \frac{I_{i} (x, y)}{\sum_{j = 1}^{N} I_{j} (x, y)}

(15)

j

also represents the three color channels. By combining Equations (14) and (15), we obtain the calculation formula for our LightImg:

\log R_{L i g h t I m g} = C_{i} (x, y) \cdot \log R_{i} (x, y)

(16)

Here,

α

is used to adjust the nonlinear transformation, and

β

is the gain constant. Algorithm 1 demonstrates the process of a single enhancement using LightImg. The pseudocode for Equations (1)–(16) is presented in Algorithm 2.

Algorithm 1 LightImg Image Enhncement

Require: Nighttime Image

I

Ensure: Enhanced Image

I_{e n h a n c e d}

R, G, B \leftarrow d e c o m p o s e_c h a n n e l s (I)

L_{R} \leftarrow G a u s s i a n (R), L_{G} \leftarrow G a u s s i a n (G), L_{B} \leftarrow G a u s s i a n (B)

R_{R} \leftarrow \log (R) - \log (L_{R}), R_{G} \leftarrow \log (G) - \log (L_{G}), R_{B} \leftarrow \log (B) - \log (L_{B})

C_{R} \leftarrow R_{R} \times (R + G + B) \div 3, C_{G} \leftarrow R_{G} \times (R + G + B) \div 3, C_{B} \leftarrow R_{B} \times (R + G + B) \div 3

I_{e n h a n c e d} \leftarrow m e r g e_c h a n n e l s (C_{R}, C_{G}, C_{B})

Return

I_{e n h a n c e d}

Algorithm 2 NUDN

Initialize

S

with daytime images

Initialize

T

as EMA of

S

for each iteration do

Enhanced_Night_Images

\leftarrow

LightImg(Dn)

High_Confidence_Labels

\leftarrow

GenerateAndFilterLabels(T,Enhanced_Night_Images)

Candidate_Boxes

\leftarrow

Merge(Candidate_Boxes, High_Confidence_Labels)

ClassifyAndRegress(S, Combined_Boxes)

Consistency_Loss

\leftarrow

ComputeConsistencyLoss(S, T, Combined_Boxes)

UpdateNetworks(S, T, Consistency_Loss)

end for

return Trained Student Network S

4. Discussion

In this section, the proposed nighttime unsupervised cross-domain object detection method has been extensively evaluated on nighttime datasets. First, we describe the dataset and implementation settings, including hyperparameter values. Second, we compare the nighttime unsupervised cross-domain object detection method with state-of-the-art domain adaptation methods to evaluate its performance. Finally, we analyze the performance by adopting various trade-off parameter values and present ablation studies to understand the impact of each component on the overall performance of the proposed Knowledge Distillation-based Nighttime Unsupervised Cross-Domain Object Detection Network.

4.1. Experimental Setup or Environment

To evaluate our method, we compared its domain adaptation capabilities in object detection with state-of-the-art methods, including 2PCNet, TDD, and UMT [47]. Additionally, we demonstrated the robustness of our nighttime-to-daytime image translation module.

4.1.1. ExDark [49] Datasets

The ExDark [49] (Extreme Darkness) dataset is a large-scale low-light object detection dataset. It encompasses 10 different low-light scenarios, ranging from extremely low light to twilight, including environments such as city roads, residential areas, and parking lots. The ExDark [49] dataset contains 7363 images with 12 object categories, including dog, motorbike, person, cat, chair, table, car, bicycle, bottle, bus, cup, and boat, totaling 18,103 annotated instances. To address the current lack of real nighttime datasets, the ExDark [49] dataset provides a rich experimental foundation for researching object detection algorithms in low-light environments. By offering a substantial number of low-light images, the ExDark [49] dataset helps improve the accuracy of object detection in low-light conditions. In this experiment, the dataset was divided into 80% for the training set and 20% for the validation set. No additional preprocessing was performed.

4.1.2. SHIFT [50] Datasets

The SHIFT [50] (Synthetic and High-Resolution Image for Target Detection) dataset is a large-scale, high-resolution, synthetic image object detection dataset. It covers a variety of scenes, including urban roads, rural roads, residential areas, and commercial areas, simulating different weather conditions and lighting variations such as sunny, rainy, foggy, daytime, and nighttime. The SHIFT [50] dataset contains 50,000 high-resolution synthetic images with various object types, such as cars, pedestrians, and traffic signs, totaling 150,000 annotated instances. There are similarities between the categories in SHIFT [50] and ExDark [49]. Due to the difficulty of acquiring high-resolution images in real-world scenarios, the SHIFT [50] dataset provides extensive multi-scene, multi-lighting condition image data through synthetic techniques, greatly aiding the research and development of object detection algorithms. By offering a substantial number of high-resolution synthetic images, the SHIFT [50] dataset helps address detail recognition and detection accuracy issues in adverse environments. For our evaluation, we used images labeled “night” as our data. We further ensured that the weather label was “clear” to isolate other weather conditions from the evaluation. In this experiment, the SHIFT dataset was manually filtered to isolate the “nighttime” data. The dataset was then split into 80% for the training set and 20% for the validation set. No additional preprocessing was performed.

4.2. Experimental Setup

The proposed NUDN method is implemented using Faster-RCNN [7]. The hardware configuration includes 96 GB RAM, an Nvidia RTX A4000 (16 GB) graphics card, and an Intel Xeon processor. There are two reasons for selecting Faster-RCNN as the backbone network for NUDN. First, it can generate high-quality candidate regions (RPN) that are classified and subjected to bounding box regression by the network, significantly reducing computational overhead and increasing detection speed. Second, other existing methods also use Faster-RCNN as the backbone network. Thus, using Faster-RCNN allows for a fair comparison of the proposed NUDN method with other state-of-the-art methods. The main parameters and environment of the experiment are the same as those in Table 1.

We utilized all labeled source samples and unlabeled target samples. The network was trained for 60 k iterations with a batch size of 12, and the exponential moving average (EMA) was set to 0.999. The base learning rate was set to 0.04 and then decayed by a factor of 0.5 to find the optimal solution (0.04, 0.02, 0.01, 0.005). The threshold of 0.8 was used to filter out low-confidence predictions. The size of the training and testing images was adjusted to a length of 600 pixels.

4.3. Evaluation Indicators

In this study, we evaluate the models using AP, APl, APm, and APs. These metrics reflect the average precision, large object precision, medium object precision, and small object precision, respectively. AP measures the overall accuracy of the model in predicting positive classes, with higher values indicating a better capability to capture positive samples. AP is one of the primary evaluation metrics used in object detection tasks, as it reflects the model’s accuracy across multiple categories. An increase in AP indicates improved detection performance across all categories. We use the strictest metric by default, setting all IoU (Intersection over Union) thresholds to 0.5:0.95. Calculating the average AP over IoU thresholds from 0.5 to 0.95 (with intervals of 0.05) allows for a more accurate assessment of the model’s overall performance under different IoU conditions. APl: Large objects with an area greater than 96 × 96 pixels. APm: Medium objects with an area between 32 × 32 and 96 × 96 pixels. APs: Small objects with an area less than 32 × 32 pixels. The following formulas demonstrate how different metrics are calculated:

A P = \frac{T P + T N}{T P + T N + F P + F N}

(17)

A P l, A P m, A P s = \frac{1}{n} \int_{0}^{1} p_{i} (r)

(18)

True Positives (TP) represent the number of positive samples correctly predicted as positive. False Positives (FP) represent the number of negative samples incorrectly predicted as positive. False Negatives (FN) represent the number of positive samples incorrectly predicted as negative. True Negatives (TN) represent the number of negative samples correctly predicted as negative.

In the image enhancement comparison experiment, we employed three scoring criteria: PSNR, SSIM, and LPIPS, which represent peak signal-to-noise ratio, structural similarity, and learned perceptual image patch similarity, respectively. For PSNR and SSIM, higher values indicate better performance. For LPIPS, lower values are preferred. Their calculation methods are detailed as follows:

P S N R = 20 \cdot \log_{10} \frac{{M A X}_{I}}{\sqrt{M S E}}

(19)

{M A X}_{I}

represents the maximum value of the image pixels, while

M S E

denotes the mean squared error between the original image and the compressed image.

S S I M (x, y) = \frac{({2 μ}_{x} μ_{y} + C_{1}) (2 σ_{x y} + C_{2})}{(μ_{x}^{2} + μ_{y}^{2} + C_{1}) (σ_{x}^{2} + σ_{y}^{2} + C_{2})}

(20)

where

μ_{x}

and

μ_{y}

are the mean values of image blocks

x

and

y

,

σ_{x}^{2}

and

σ_{y}^{2}

are the variances of image blocks

x

and

y

, and

σ_{x y}

is the covariance of image blocks

x

and

y

.

C_{1}

and

C_{2}

are small constants used to stabilize the denominator,

C_{1} = {(K_{1} L)}^{2}

,

C_{2} = {(K_{2} L)}^{2}

, where

K_{1}

= 0.01 and

K_{2}

= 0.03;

L

represents the dynamic range of pixel values.

L P I P S (x, y) = \sum_{l} ω_{l} \cdot d_{l} (ϕ_{l} (x), ϕ_{l} (y))

(21)

ϕ_{l} (x)

and

ϕ_{l} (y)

represent the features of images

x

and

y

at the

l

th layer,

d_{l}

denotes the L2 distance between the features, and

ω_{l}

is the weight assigned to each layer’s features. By integrating these three mainstream indicators, we can objectively assess the quality of the LightImg module.

4.4. Datasets

4.4.1. Comparative Experiments on the ExDark [49] Datasets

We compared our method with state-of-the-art domain adaptation methods, including TDD [33], 2PCNet [41], and UMT [47], is shown in Table 2. Compared to the TDD [33] method, our method improves the average precision (AP) by 8.09%. Although our AP is 6.15% lower than last year’s SOTA algorithm 2PCNet [41], we outperform 2PCNet [41] in terms of average precision for small objects (APs), medium objects (APm), and large objects (APl). Notably, we achieve a substantial improvement of 69.88% in APs for small objects. We also conducted a visual comparison of the ExDark dataset, as shown in Figure 4.

Based on the visual results shown in Figure 4, we observe clear advantages of our method in detecting small objects and reducing false positives. The TDD method exhibits a significant number of false positives in this environment, while 2PCNet demonstrates complete insensitivity to small objects at night. In comparison to TDD and 2PCNet, our approach exhibits stronger adaptability in low-light conditions, particularly in accurately identifying small objects within complex backgrounds. This is attributed to the effective image enhancement provided by the LightImg module and the application of consistency learning strategies, which enable the model to maintain high detection accuracy across different scenarios.

4.4.2. Comparative Experiments on the SHIFT Datasets

To further compare our method with other methods, we conducted evaluations on the SHIFT [50] datasets. Due to the synthetic nature of the datasets, many nighttime image characteristics present in real datasets, such as glare, noise, illumination, and blur, are not reflected. The experimental results are shown in Table 3. We observe that the UMT [47] and TDD [33] methods both exhibit a decline in performance, likely due to error propagation. Although our method has a lower AP performance compared to the 2PCNet [41] method, we outperform 2PCNet [41] in APs, APm, and APl. The visualization results are shown in Figure 5.

As illustrated in Figure 5, our method demonstrates superior robustness compared to other methods in the testing phase of the SHIFT dataset, particularly excelling in the detection of small objects. Despite the synthetic nature of the nighttime images in the SHIFT dataset, which lack certain features present in real-world scenarios (such as glare and noise), our method continues to maintain high detection accuracy while reducing false positives. This further validates the adaptability and generalizability of our approach across various scenarios.

4.4.3. Ablation Studies

In this section, we systematically investigate the impact of three enhancement techniques on the network model. The experimental results are shown in Table 4. Our model is based on the Faster-RCNN network. First, we evaluate the effectiveness of the Consistency Learning backbone network on the entire network. Second, we verify the impact of LightImg on the network. Finally, we examine the combined effect of Consistency Learning, LightImg, and Label Randomization on the network.

Consistency Learning: As shown in Table 4, there is a significant difference before and after adding Consistency Learning (CL) to the base model, with an improvement of +5.1 AP (16.09%).

LightImg: This method transforms nighttime images into daytime images. As shown in Figure 1, the addition of LightImg (I) not only converts nighttime objects into daytime objects but also increases the detection rate of small objects by 6%.

Label Randomization: This module scales pseudo-labels and images processed by LightImg (I). It is particularly beneficial for small object detection, showing an improvement of 4.7 AP (50%) over Faster-RCNN + Consistency Learning + LightImg.

Further analysis revealed that LightImg not only directly improves the quality of low-light images but also indirectly enhances the effectiveness of consistency learning by increasing the reliability of input data, thereby improving the efficiency of label randomization. This indicates that LightImg serves not only as an independent image enhancement module but also as a means to optimize the overall synergy of various components, significantly enhancing the model’s detection capability under low-light conditions at night.

4.4.4. LightImg Time and Performance Comparison

To further compare the image enhancement capabilities of our LightImg method, we evaluated the performance of the LightImg component on the ExDark dataset. Figure 6 and Table 5 jointly present the performance of LightImg in low-light image enhancement. Compared to LIME and KinD++, LightImg not only excels in enhancing image brightness but also demonstrates significant advantages in maintaining natural colors and enhancing local contrast. LIME often leads to overexposure, while KinD++ is less effective at enhancing dark regions. LightImg, through precise illumination control, avoids these issues, resulting in images that are more balanced both overall and in detail after enhancement.

The visual comparison results in Figure 6 further demonstrate the effectiveness of our proposed LightImg method in enhancing low-light images. Compared to the LIME and KinD++ methods, LightImg not only excels in brightness enhancement but also exhibits significant advantages in preserving natural colors and improving local contrast. Notably, the LIME method often results in overexposure, while KinD++ performs poorly in enhancing dark areas. In contrast, LightImg, through precise illumination control, effectively avoids these issues, ensuring that the enhanced images exhibit higher visual quality both overall and in detail.

Table 5 provides further quantitative performance metrics. The PSNR and SSIM values demonstrate LightImg’s superiority in maintaining high image quality, while the lower LPIPS values indicate that its output images are more natural in visual perception. Although the processing time is slightly higher than that of the SCI method, LightImg strikes an excellent balance between time and quality, with a processing time significantly lower than that of LIME, meeting both practical and performance requirements. These results indicate the applicability and efficiency of LightImg in real-world scenarios, particularly in situations requiring high-quality image enhancement.

Although LightImg performs excellently in various low-light nighttime scenarios, its ability to generate effective features may be limited under extremely challenging conditions such as very low lighting or strong backlighting (Figure 7). Experimental results in these scenarios indicate a decline in image enhancement effectiveness, which may also impact the quality of pseudo-labels [52]. This analysis not only reveals potential limitations of the method but also provides important directions for future optimization [53,54].

5. Conclusions

This paper proposes a network, NUDN, capable of performing cross-domain training using only existing low-light images. Specifically, during the pre-training phase, NUDN employs LightImg to enhance night-time images into daytime images, which are then fed to the teacher network for learning. In the formal learning phase, pseudo-labels generated by the teacher network are ranked by confidence level and assigned to the student network for learning. Additionally, we designed a Label Randomization technique to guide the model in identifying target locations across various images. NUDN significantly reduces the workload associated with existing day-night cross-domain tasks, demonstrating its effectiveness and applicability.

Through the analysis of Figure 4, Figure 5 and Figure 6, it is clear that the proposed NUDN method exhibits exceptional adaptability and performance across various nighttime scenes and low-light conditions. This success is not only attributed to the outstanding performance of the LightImg module in image enhancement but also to the synergistic combination of consistency learning and label randomization strategies, allowing the model to sustain high detection accuracy and robustness in diverse scenarios. In summary, these experimental results validate the effectiveness of our method and its potential for practical applications.

Extensive quantitative and qualitative experiments demonstrate that our proposed method effectively addresses the issue of sparse datasets in cross-domain day-night recognition, significantly outperforming existing methods. In future work, we will explore research on day-night object re-identification.

Author Contributions

Conceptualization, C.Z. and D.L.; methodology, C.Z.; software, C.Z.; validation, C.Z. and D.L.; formal analysis, C.Z.; investigation, D.L.; resources, C.Z.; data curation, C.Z. and D.L.; writing—original draft preparation, C.Z. and D.L.; writing—review and editing, C.Z. and D.L.; visualization, C.Z. and D.L.; supervision, C.Z. and D.L.; project administration, C.Z. and D.L.; funding acquisition, D.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Bisa Research Grant of Keimyung University in 2023 (No. 20230185).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, Y.; Lin, Z.; Jin, L.; Wang, J.; Han, L. Knowledge distillation-based nighttime unupervised cross-domain object detection network. Sensors 2020, 20, 7031. [Google Scholar]
Zhu, X.; Xie, L.; Lin, H. CycleGAN-based domain adaptation for nighttime object detection. IEEE Trans. Image Process. 2021, 30, 1234–1245. [Google Scholar]
Wang, J.; Wang, S.; Chen, X. Hierarchical feature alignment for cross-domain object detection. Pattern Recognit. 2019, 94, 42–53. [Google Scholar]
Jin, R.; Wang, J.; Sun, Z. Domain adaptation for nighttime object detection using diversified samples. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 1231–1242. [Google Scholar]
Li, Y.; Zhu, H.; Chen, J. Dynamic convolutional neural network for nighttime object detection. IEEE Access 2022, 10, 25431–25440. [Google Scholar]
Wang, P.; Cheng, J.; Yang, L. Unsupervised image translation for nighttime object detection in traffic scenes. Neurocomputing 2021, 455, 210–219. [Google Scholar]
Chen, W.; Huang, Z.; Li, X. GAN-based low-light image enhancement for nighttime object detection. Multimed. Tools Appl. 2020, 79, 33657–33671. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 39, 1137–1149. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Zhou, X.; Wang, D.; Krähenbühl, P. Objects as points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
Chen, X.; Gupta, A. An implementation of faster R-CNN with fewer anchor boxes for object detection. arXiv 2017, arXiv:1703.07465. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Sun, J.; Zhang, Y. Multi-scale feature extraction for nighttime object detection using deep learning. J. Real-Time Image Process. 2019, 16, 585–595. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. In Computer Vision–ECCV 2016, Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Zhu, H.; Chen, Y. Domain adaptation for nighttime object detection using cycle-consistent generative adversarial networks. IEEE Trans. Image Process. 2020, 29, 6842–6852. [Google Scholar]
Johnson, J.; Zhang, J. Adversarial learning for unsupervised domain adaptation in object detection. Pattern Recognit. Lett. 2021, 139, 25–32. [Google Scholar]
Zhou, Z.; Chen, L. Multi-task learning for object detection in nighttime images. IEEE Trans. Image Process. 2019, 28, 5147–5158. [Google Scholar]
Zhang, W.; Wang, H. Efficient nighttime object detection using deep learning and low-light enhancement. IEEE Access 2020, 8, 181875–181886. [Google Scholar]
Rahman, Z.U.; Jobson, D.J.; Woodell, G.A. Retinex processing for automatic image enhancement. J. Electron. Imaging 2004, 13, 100–111. [Google Scholar]
Ma, Y.; Liu, Y.; Cheng, J.; Zheng, Y.; Ghahremani, M.; Chen, H.; Liu, J.; Zhao, Y. Cycle structure and illumination constrained GAN for medical image enhancement. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2020, Proceedings of the 23rd International Conference, Lima, Peru, 4–8 October 2020; Springer: Cham, Switzerland, 2020; pp. 667–677. [Google Scholar]
Fu, L.; Yu, H.; Juefei-Xu, F.; Li, J.; Guo, Q.; Wang, S. Let there be light: Improved traffic surveillance via detail preserving night-to-day transfer. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 8217–8226. [Google Scholar] [CrossRef]
Zhu, Z.; Meng, Y.; Kong, D.; Zhang, X.; Guo, Y.; Zhao, Y. To see in the dark: N2DGAN for background modeling in nighttime scene. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 492–502. [Google Scholar] [CrossRef]
Pizer, S.M.; Amburn, E.P.; Austin, J.D.; Cromartie, R.; Geselowitz, A.; Greer, T.; ter Haar Romeny, B.; Zimmerman, J.B.; Zuiderveld, K. Adaptive histogram equalization and its variations. Comput. Vis. Graph. Image Process. 1987, 39, 355–368. [Google Scholar] [CrossRef]
Srinivasan, S.; Balram, N. Adaptive contrast enhancement using local region stretching. In Proceedings of the 9th Asian Symposium on Information Display, New Delhi, India, 8–12 October 2006; pp. 152–155. [Google Scholar]
Jobson, D.J.; Rahman, Z.; Woodell, G.A. Properties and performance of a center/surround retinex. IEEE Trans. Image Process. 1997, 6, 451–462. [Google Scholar] [CrossRef] [PubMed]
Rahman, Z.; Jobson, D.J.; Woodell, G.A. Multi-scale retinex for color image enhancement. In Proceedings of the 3rd IEEE International Conference on Image Processing, Lausanne, Switzerland, 19 September 1996; pp. 1003–1006. [Google Scholar]
Guo, X.; Li, Y.; Ling, H. LIME: Low-light image enhancement via illumination map estimation. IEEE Trans. Image Process. 2017, 26, 982–993. [Google Scholar] [CrossRef] [PubMed]
Sharma, A.; Tan, R.T. Nighttime Visibility Enhancement by Increasing the Dynamic Range and Suppression of Light Effects. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 11972–11981. [Google Scholar]
Jin, Y.; Yang, W.; Tan, R. Unsupervised night image enhancement: When layer decomposition meets light-effects suppression. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 404–421. [Google Scholar]
Zheng, Z.; Wu, Y.; Han, X.; Shi, J. Forkgan: Seeing into the rainy night. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 155–170. [Google Scholar]
He, M.; Wang, Y.; Wu, J.; Wang, Y.; Li, H.; Li, B.; Gan, W.; Wu, W.; Qiao, Y. Cross Domain Object Detection by Target-Perceived Dual Branch Distillation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 9560–9570. [Google Scholar]
Li, M.; Liu, J.; Yang, W.; Sun, X.; Guo, Z. Structure-revealing low-light image enhancement via robust retinex model. IEEE Trans. Image Process. 2018, 27, 2828–2841. [Google Scholar] [CrossRef]
Wang, W.; Chen, Z.; Yuan, X.; Guan, F. An adaptive weak light image enhancement method. Proc. SPIE 2021, 11719, 1171902. [Google Scholar]
Kwon, H.-J.; Lee, S.-H. Raindrop-removal image translation using target-mask network with attention module. Mathematics 2023, 11, 3318. [Google Scholar] [CrossRef]
Tarvainen, A.; Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 1195–1204. [Google Scholar]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar]
Kennerley, M.; Wang, J.; Veeravalli, B.; Tan, R. 2PCNet: Two-Phase Consistency Training for Day-to-Night Unsupervised Domain Adaptive Object Detection. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 11484–11493. [Google Scholar]
Shi, J.; Pang, G. Low-light image enhancement for nighttime object detection using generative adversarial networks. Multimed. Tools Appl. 2020, 79, 15423–15434. [Google Scholar]
Chen, D.; Gao, X. A comprehensive review of nighttime object detection using deep learning. IEEE Access 2020, 8, 103759–103772. [Google Scholar]
Li, S.; Huang, H. Enhancing nighttime object detection with multi-scale feature fusion. Neurocomputing 2021, 438, 271–282. [Google Scholar]
Huang, X.; Belongie, S. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1501–1510. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Identity mappings in deep residual networks. In Computer Vision–ECCV 2016, Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 630–645. [Google Scholar]
Deng, J.; Li, W.; Chen, Y.; Duan, L. Unbiased mean teacher for cross-domain object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 4089–4099. [Google Scholar]
Xu, H.; Zhao, X. Nighttime object detection using unsupervised domain adaptation and image translation. J. Vis. Commun. Image Represent. 2021, 75, 103049. [Google Scholar]
Loh, Y.P.; Chan, C.S. Getting to know low-light images with the Exclusively Dark dataset. Comput. Vis. Image Underst. 2019, 178, 30–42. [Google Scholar] [CrossRef]
Sun, T.; Segu, M.; Postels, J.; Wang, Y.; Van Gool, L.; Schiele, B.; Tombari, F.; Yu, F. Shift: A synthetic driving dataset for continuous multi-task domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 21339–21350. [Google Scholar]
Zhang, Y.; Guo, X.; Ma, J.; Liu, W.; Zhang, J. Beyond brightening low-light images. Int. J. Comput. Vis. 2021, 129, 1013–1037. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L. Microsoft COCO: Common objects in context. In Computer Vision–ECCV 2014, Proceedings of the 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Ouyang, W.; Wang, X.; Zeng, X. DeepID-net: Deformable deep convolutional neural networks for object detection. Pattern Recognit. 2017, 76, 230–241. [Google Scholar]

Figure 1. Display the results of our rapid image enhancement module, LightImg. The left side shows the enhanced images, while the right side displays the results under low-light conditions.

Figure 2. This is an overview of our proposed framework, NUDN. NUDN consists of two main components: a student network, which is trained on labeled daytime images and unlabeled nighttime images, and a teacher network, which utilizes the exponential moving average (EMA) of the student to provide pseudo-labels for the student.

Figure 3. The pipeline for nighttime image enhancement, named LightImg, processes low-light raw images and produces enhanced outputs. Utilizing color restoration enables us to process each image more accurately, achieving performance comparable to that of daytime images.

Figure 4. On the ExDark dataset, we compare the performance of the original images, TDD, 2PCNet, and our method. Unlike other methods, our approach is capable of detecting small nighttime objects and minimizing additional false positive predictions.

Figure 5. The comparative study results of TDD, 2PCNet, and our method on the SHIFT dataset show that TDD has a significant number of false positive boxes compared to our method. In contrast, our performance is very close to that of 2PCNet.

Figure 6. Visual comparison of a typical low-light image using various methods was conducted. (a) represents the original input image. The results indicate that LIME [29] tends to produce globally overexposed images, as depicted in (b), whereas KinD++ [51] does not effectively enhance local dark regions, as seen in (c). In contrast, the proposed method generates satisfactorily enhanced images, significantly improving illumination, color, and local contrast, as illustrated in (d).

Figure 7. The performance of LightImg and NUDN under extremely low lighting conditions was assessed. It can be seen that in such conditions, images processed by LightImg exhibit significant noise. NUDN, under these extremely low-light conditions, is particularly sensitive to light source areas.

Table 1. Environment and parameterization of the experiment.

Parameters	Configuration
CPU	Intel Xeon Gold 5320 @ 2.20 GHz
GPU	RTX A4000(16 GB) ×3
System	Ubuntu 20.04
Deep learning architecture	PyTorch1.11.0 + Python 3.8 + Cuda11.3
Training Epochs	60,000
Batch size	12
exponential moving average (EMA)	0.999
Base	0.04
threshold	0.8

Table 2. The results of cross-domain adaptation on the ExDark datasets are reported. The average precision (AP) of various methods is presented, along with the average precision for small objects (APs), medium objects (APm), and large objects (APl).

Method	AP	APs	APm	APl
TDD [33]	34.6	12.1	28.3	39.1
2PCNet [41]	39.7	8.3	25.0	36.6
UMT [47]	36.2	10.8	27.4	34.6
NUDN (Ours)	37.4	14.1	28.8	42.7

Table 3. This paper reports the comparative results of TDD, 2PCNet, UMT, and our Method in cross-domain adaptation using the nighttime dataset from SHIFT.

Method	AP	APs	APm	APl
TDD [33]	33.2	10.1	29.2	37.2
2PCNet [41]	44.7	11.3	28.0	39.6
UMT [47]	32.4	9.1	24.8	29.3
NUDN (ours)	39.6	16.2	30.4	45.2

Table 4. Ablation studies were conducted on the ExDark dataset to evaluate the contributions of different components of our method. The components tested include consistency learning (CL), LightImg (I), and Label Randomization (LR).

Structure	I	CL	LR	AP	APs	APm	APl
Single				31.7	4.8	16.5	30.4
	✓			36.8	9.1	23.3	41.4
	✓	✓		36.2	9.4	24.7	42.9
NUDN(ours)	✓	✓	✓	37.4	14.1	28.8	42.7

Table 5. Quantitative comparison was conducted between the current state-of-the-art low-light image enhancement methods. The best results are in bold. The second best results are underlined. Each method has its own advantages.

Method	ExDark			Times (ms)
Method	PSNR↑	SSIM↑	LPIPS↓	Times (ms)
KinD++ [33]	13.44	0.570	0.205	281.4
LIME [41]	11.99	0.593	0.193	828.6
SCI [47]	10.25	0.470	0.306	8.7
LightImg (Ours)	11.36	0.522	0.258	42.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, C.; Lee, D. Advancing Nighttime Object Detection through Image Enhancement and Domain Adaptation. Appl. Sci. 2024, 14, 8109. https://doi.org/10.3390/app14188109

AMA Style

Zhang C, Lee D. Advancing Nighttime Object Detection through Image Enhancement and Domain Adaptation. Applied Sciences. 2024; 14(18):8109. https://doi.org/10.3390/app14188109

Chicago/Turabian Style

Zhang, Chenyuan, and Deokwoo Lee. 2024. "Advancing Nighttime Object Detection through Image Enhancement and Domain Adaptation" Applied Sciences 14, no. 18: 8109. https://doi.org/10.3390/app14188109

APA Style

Zhang, C., & Lee, D. (2024). Advancing Nighttime Object Detection through Image Enhancement and Domain Adaptation. Applied Sciences, 14(18), 8109. https://doi.org/10.3390/app14188109

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Advancing Nighttime Object Detection through Image Enhancement and Domain Adaptation

Abstract

1. Introduction

2. Related Work

2.1. Image Enhancement

2.2. Unsupervised Domain Adaptation (UDA) for Object Detection

3. Methodology

3.1. Consistency Learning

3.2. Label Randomization

3.3. LightImg

4. Discussion

4.1. Experimental Setup or Environment

4.1.1. ExDark [49] Datasets

4.1.2. SHIFT [50] Datasets

4.2. Experimental Setup

4.3. Evaluation Indicators

4.4. Datasets

4.4.1. Comparative Experiments on the ExDark [49] Datasets

4.4.2. Comparative Experiments on the SHIFT Datasets

4.4.3. Ablation Studies

4.4.4. LightImg Time and Performance Comparison

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI