DANet: A Domain Alignment Network for Low-Light Image Enhancement

Li, Qiao; Jiang, Bin; Bo, Xiaochen; Yang, Chao; Wu, Xu

doi:10.3390/electronics13152954

Open AccessArticle

DANet: A Domain Alignment Network for Low-Light Image Enhancement

by

Qiao Li

¹

,

Bin Jiang

^1,*

,

Xiaochen Bo

^2,†,

Chao Yang

^1,† and

Xu Wu

^3,†

¹

College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China

²

Academy of Military Medical Science, Beijing 100850, China

³

College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2024, 13(15), 2954; https://doi.org/10.3390/electronics13152954

Submission received: 18 June 2024 / Revised: 15 July 2024 / Accepted: 23 July 2024 / Published: 26 July 2024

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

We propose restoring low-light images suffering from severe degradation using a deep-learning approach. A significant domain gap exists between low-light and real images, which previous methods have failed to address with domain alignment. To tackle this, we introduce a domain alignment network leveraging dual encoders and a domain alignment loss. Specifically, we train two dual encoders to transform low-light and real images into two latent spaces and align these spaces using a domain alignment loss. Additionally, we design a Convolution-Transformer module (CTM) during the encoding process to comprehensively extract both local and global features. Experimental results on four benchmark datasets demonstrate that our proposed A Domain Alignment Network(DANet) method outperforms state-of-the-art methods.

Keywords:

low-light image enhancement; multi-domain alignment; convolution-transformer module

1. Introduction

Enhancing low-light images is a significant challenge in computer vision. The objective is to address poor visibility, low contrast, and various degradations like artifacts, color distortion, and noise in low-light images. These problems not only hinder human visual perception but also affect other computer vision tasks such as image processing [1,2], object detection [3], and autonomous driving [4], etc.

With the advancement of deep learning, numerous algorithms have emerged to tackle low-light image enhancement. Despite these efforts, existing methods have limitations. Most algorithms use an encoder-decoder or recurrent structure to process low-light images, aligning them with well-exposed images at the output. However, since this alignment of data distribution occurs only at the output, it often fails to effectively restore low-light images to properly well-exposed ones. This results in the output being either too dark, Figure 1c,f or over-exposed as shown in Figure 1b,j, some methods aim to align the output end with the original image as closely as possible or learn the residual between them, resulting in excessively dark brightness compared to the original image. On the other hand, some methods use Generative Adversarial Networks (GANs) to fit the distribution at the output end, such as EnlightenGAN [5], Signal-to-Noise-Ratio (SNR) [6], and so on. However, since they only align at the output end and ignore the alignment of intermediate layer features, the results can be unstable and sometimes overexposed, as shown in Figure 1e. Furthermore, most convolution-based methods tend to overlook long-range dependency issues, while transformer-based methods ignore the local perceptual nature of convolutions. Consequently, many methods produce locally blurry enhancement results or introduce noise into the image. Although certain methods attempt to combine long-range and short-range features, insufficient fusion often leads to inadequate texture details in the results.

In this paper, to address the above-mentioned drawbacks, we propose a novel network named domain-alignment network (DANet) to perform low-light image enhancement. Unlike traditional encoder–decoder structures, we introduce a dual encoder, single decoder architecture. This is because Dual Encoders are capable of extracting deep-level feature representations from images, which are highly useful for understanding the content and textural structure of the images. In the context of image enhancement tasks, Dual Encoders can be utilized for domain adaptation by learning the mapping between the source domain (low-light images) and the target domain (real images), thereby improving the effectiveness of image en- hancement. Additionally, Dual Encoders can enhance the robustness of the image enhancement by learning robust feature representations, especially in the presence of noise, compression artifacts, or other forms of degradation. Specifically, DANet consists of dual encoders (Enhancement Encoder, Reconstruct Encoder), and a decoder. The encoder is composed of convolution and convolution-Transformer modules (CTM). The dual encoders take ground-truth and low-light images as inputs, where one encoder is used to extract ground-truth features for reconstruction, while the other one extracts low-light image features for enhancing illumination. The decoder simultaneously reconstructs images under normal and low-light conditions, obtaining two reconstructed images, respectively. To better extract features, we adequately capture long-range dependencies and local perceptual features by integrating convolution and CTM modules. In addition, in order to reduce the difference between low light image and normal light image data in two different fields, and better carry out feature fusion and complementarity, we propose a domain alignment loss function, which improves the enhancement effect of low light image. These modules work together to extract relevant features from different scales and regions of the input images. By leveraging feature interactions, we achieve an effective fusion of these extracted features, thereby improving performance in feature representation and enhancement.

Experiments on benchmark datasets indicate that the proposed method performs well compared to state-of-the-art techniques. It produces clear, natural images and achieves brightness levels close to ground truth images. Our contributions can be summarized as follows:

We proposed a novel framework that combines dual encoders with domain alignment, named Domain Alignment Network (DANet), for enhancing low-light images.
We proposed dual encoders and a domain alignment loss function, where the dual encoders are used to encode images from two different domains separately, and the domain alignment function aligns the feature distributions of the two hidden spaces into the same domain, achieving feature alignment between the two domains.
We designed a Convolution-Transformer Module (CTM) comprising local and non-local branches. The local branch is intended for extracting local features, while the non-local branch is for extracting global features. Subsequently, the features are then deeply fused, enhancing the image’s detailed texture information and improving its quality.
Experimental results show that our approach outperforms other methods in both objective and subjective metrics.

2. Related Works

2.1. Non-Learning Based Low-Light Image Enhancement

Non-learning methods for low-light image enhancement primarily include histogram equalization and Retinex theory approaches. For histogram equalization, Zhuang et al. [7] introduced an enhancement technique based on entropy adaptive histogram equalization, while Mun et al. [8] proposed an edge-enhanced dual histogram equalization method using guided image filters. These main methods often produce unwanted artifacts in enhanced real-world images while losing image details. Compared to histogram equalization-based methods, image enhancement algorithms based on Retinex theory can achieve better enhancement effects. Retinex-based methods decompose an image into reflectance and illumination components, using the reflectance component as a reliable basis for image enhancement. Subsequently, illumination correction is applied to suppress artifacts [9,10], leading to more realistic and natural results. Li et al. [11] also considered noise in the Retinex model, enhancing the model’s robustness. Ghosh et al. [12] presented a variational filtering solution based on the Retinex model. However, when enhancing complex real-world images, these methods often introduce local color distortions [13], resulting in over-enhancement and local issues.

2.2. Learning-Based Low-Light Image Enhancement

In recent years, there has been a surge in learning-based methods proposed for enhancing weak light images [14,15,16,17]. Jiang Y et al. [5] introduced EnlightenGAN, designed for no-reference low-light image enhancement, reducing dependence on paired datasets. Guo et al. [18] introduced the Zero method, which transforms the nonlinear relationship between low and normal light into a task of learning image-specific curve estimation. SNR [6] employs transformers for signal-to-noise ratio perception and a CNN model with spatially varying operations to achieve spatially variant enhancement of weak light images. Zheng et al. [19] presented UTVNet, leveraging balanced parameter learning in a model-based denoising approach. UTVNet guides the noise layer map to recover finer details and suppress noise in realistically captured low-light scenes. Ma et al. [20] proposed the CSDNet method, a novel context-sensitive decomposition network architecture that utilizes scene-level context for spatial scale dependency. It constructs illumination-guided spatially variant changes for edge-aware smoothing properties, enhancing image brightness and details. Ma et al. [21] introduced the self-calibrated illumination (SCI) method, the SCI learning framework for rapidly, flexibly, and robustly improving image brightness in real-world low-light scenarios. Additionally, unsupervised methods are explored [18,22,23]. For instance, Guo et al. [18] constructed a lightweight network for pixel estimation. Prior methods overlooked domain alignment for handling low-light images. Thus, we propose a method using transaction encoders and a novel domain triple-loss alignment network. Our approach not only produces clear images but also exhibits brightness closer to real images.

2.3. Enhancement of Low-Light Images Using Vision Transformers

In recent years, since being proposed by Vaswani et al. [24] in 2019, Transformers and their variations have been applied to numerous computer vision tasks, including image classification [25,26], semantic segmentation [27,28], object detection [29,30,31], image restoration [32,33,34], and others. Particularly, since the introduction of the Vision Transformer (ViT), efforts have been made to adapt Transformers better for visual tasks. Many works have focused on reducing the quadratic computational cost of global self-attention within Transformers. Some work [35,36,37] concentrate on establishing pyramid Transformer architectures similar to convnet-based structures. For instance, in Xu et al. [6] an SNR hybrid network was introduced for low-light image enhancement. However, due to the high computational burden of the original global Transformer, SNR only used a single global Transformer layer at the lowest resolution in a U-shaped CNN. While Transformer-based methods have shown promising results in various computer vision tasks, their full potential in low-light image enhancement remains underexplored.

Unlike previous works, our proposed method enhances low-light images using transaction encoders and a novel domain alignment network. The framework includes new dual encoders with convolution-transformer modules and a decoder, transforming weak light and real images into two latent spaces to enhance low-light images through domain alignment.

3. Proposed Method

In this section, we first present the overall framework of the proposed DANet, followed by an introduction to the key components of CMT. Finally, we describe the loss function used.

The overall framework of the proposed Domain Alignment Network (DANet) is illustrated in Figure 2. Suppose the image pairs are given as (

I_{l l}

,

I_{g t}

), where

I_{g t}

represents the ground truth image and

I_{l l}

represents its corresponding low-light image. The image pairs are fed into the dual Encoders

E_{l l}

and

E_{g t}

, for separate processing. In the deepest layer of the encoder, we propose the CTM to model local and long-range dependencies by using a parallel dual-branch structure. One branch employs a Transformer structure, while the other utilizes a convolutional structure. Two latent features,

l_{l l}

and

l_{g t}

, are generated after dual encoders. A Domain alignment loss function is designed that aims to map

l_{l l}

and

l_{g t}

to the same latent space during the training phase. A domain alignment loss function has been designed for the purpose of mapping data from the

l_{l l}

and

l_{g t}

domains, which are distributed differently, onto a unified latent space during the training process. This approach aims to reduce the discrepancies between the data from different spaces and achieve complementary feature fusion. The decoder performs concurrent reconstruction of both

l_{l l}

and

l_{g t}

, yielding two reconstructed images,

I_{e n}

and

I_{r e c}

, respectively. This stage can be formulated as:

\begin{matrix} F_{l l} & = H_{e n} (I_{l l}) = H_{C T M} (H_{c n n} (I_{l l})), \\ F_{g t} & = H_{e n}^{'} (I_{g t}) = H_{C T M}^{'} (H_{c n n}^{'} (I_{g t})), \end{matrix}

(1)

\begin{matrix} I_{e n} & = H_{d e} (F_{l l}), \\ I_{r e c} & = H_{d e} (F_{g t}), \end{matrix}

(2)

where

F_{l l}

and

F_{g t}

are features of low-light and normal-light images, respectively.

H_{e n} (\cdot)

and

H_{e n}^{'} (\cdot)

represent the dual encoders.

H_{C T M}

and

H_{c n n}

denote the proposed CTM and convolution neural network.

H_{d e} (\cdot)

is the decoder.

I_{e n}

and

I_{r e c}

represent light-enhanced images and reconstructed images.

3.1. Dual Encoders with Convolution-Transformer Module

Traditional low-light image enhancement algorithms typically employ encoder-decoder architectures or iterative structures (such as diffusion models) to reconstruct images, aligning the output with the ground truth images. However, aligning only at the output stage fails to enable the deepest layers of the network to learn the distribution of the corresponding domain’s high-level features, resulting in low-quality outcomes.

To solve these problems, we devise a dual encoder, single decoder architecture, as shown in Figure 2, where {

E_{g t}

,D} represents a reconstruction network that maps the real image

I_{g t}

to latent space features and reconstructs it back to the original image. On the other hand, {

E_{l l}

,D} represents a low-light enhancement network that restores the low-light image

I_{l l}

to an illuminated image that closely resembles

I_{g t}

as much as possible. The structural composition of

E_{l l}

and

E_{g t}

bears resemblance, and for illustrative purposes, we will focus on

E_{l l}

. Given a low-light image

I_{l l} \in R^{H \times W \times 3}

,

E_{l l}

first utilizes a sequence of three convolutional layers to extract features and reduce the space dimension. Following the three convolutional layers, the feature

F_{l l}

is obtained. Afterward,

F_{l l}

is fed into the Convolution-Transformer Module (CTM) to obtain the latent space feature

F_{l l}

. Three deconvolutional layers are set on Decoder restores

F_{l l}

to the prediction of image

I_{e n}

. The following will introduce the detailed situation of CTM.

3.2. Convolution-Transformer Module (CTM)

Traditional methods often employ deep residual convolutional networks to extract features but overlook the issue of long-range dependencies. In recent years, the use of vision transformer (ViT) models has effectively addressed this problem. However, pure ViT networks lack the local perceptual capabilities of convolutions, leading to poor texture details in the results. Although some methods, such as SNR [6], consider both long-range and local features, the lack of effective interaction hinders proper feature fusion, resulting in the presence of some noise in the results.

To address the aforementioned issues, we designed the Convolution-Transformer based module (CTM) as illustrated in Figure 3. Given an input set of features, we split them into two branches: a local branch and a non-local branch. The local branch employs a deep residual convolution block to extract local features, capturing the detailed nuances of the image’s local region. For the dual non-local branch, the first branch divided the features into patches and fed them into a transformer to extract global features. The second branch used a residual convolution to extract local features in depth. The obtained features were then deeply fused through a series of transformers and residual networks for comprehensive feature fusion. This stage can be formulated as:

\begin{matrix} F_{C T M} & = H_{C T M} (F) \\ = H_{c n n} (H_{c t r a n} (H_{t r a n} (F), H_{r e s} (F))) \end{matrix}

(3)

where F is the input feature of CTM,

H_{c t r a n} (\cdot)

,

H_{t r a n} (F) (\cdot)

, and

H_{r e s} (F)

represent transformers with cross attention, common transformers, and resent blocks.

4. Domain Alignment Loss

Existing alignment loss functions. Existing loss functions like cross-entropy loss, triplet loss, and contrastive loss can be utilized. The cross-entropy loss aligns the distribution of

F_{l l}

towards

F_{g t}

in the form:

\begin{matrix} L_{cro} = - (F_{g t} log (F_{l l})) \end{matrix}

(4)

Assuming that during the training stage, another negative sample

F_{g t}^{-}

is extracted from the same batch. The formula for Triplet loss can be written as:

\begin{matrix} L_{t r i} = max (0, | | F_{l l} - F_{g t} {| |}^{2} - | | F_{l l} - F_{g t}^{-} {| |}^{2} + a) \end{matrix}

(5)

where

F_{g t}, F_{l l}

represent two aligned latent space vectors,

F_{g t}^{-}

means the other negative sample, a represents margin hyperparameter.

In the experiments, it was observed that as the number of training epochs increased, both cross-entropy loss and triplet loss exhibited pattern collapse. The possible reason for this could be the requirement of cross-channel input for features in both functions. However, in the latent space, the training of distinct channel features might differ between low-light images and keyboard images. Contrastive loss, on the other hand, tends to demonstrate its effects only after extensive training at a large scale and over numerous epochs.

Domain alignment loss functions. Based on this, we designed a latent space discriminator as our domain alignment loss function. Specifically, as shown in Figure 2, the discriminator consists of three layers of linear mappings, ultimately outputting a value representing the probability of vector authenticity. The formula is as follows:

L_{d a n} = - log (D (F_{l l})) - log (1 - D (F_{g t}))

(6)

where

F_{g t}

means the fake sample, and

F_{l l}

represents the true sample.

The reconstruction loss aids DANet in achieving results with complete image structures and can be formulated as follows:

L_{r e c} = \frac{1}{N} \sum_{i = 1}^{N} {(I_{n}^{(i)} - I_{e}^{(i)})}^{2}

(7)

where

I_{n}

represents target images and

I_{e}

denotes light-enhanced images.

The overall loss functions. A new loss function was designed to optimize DANet in terms of image structure, feature domain alignment, and human visual perception. The loss function is expressed as follows:

L = L_{r e c} + L_{d a n} + λ L_{t r i}

(8)

where

L_{r e c}

,

L_{t r i}

,

L_{d a n}

and

λ

mean the reconstruction loss, triplet loss, domain alignment loss and the weight coefficient used to balance the triplet loss, respectively. We set

λ

= 0.3.

5. Experimental Results and Discussion

5.1. Experimental Settings

Datasets. We evaluated our method on LOLv1 and LOLv2-synthetic [38], LSRW-Nikon and LSRW-Huawei [39] datasets Dataset link (https://github.com/JianghaiSCU/Diffusion-Low-Light, accessed on 5 November 2023).

LOLv1 consists of 485 pairs of low-light and normal-light images for training and 15 pairs for testing, with each pair including a low-light input image and its corresponding normal-illumination image. In contrast, the LOLv2-synthetic dataset generates low-light images from RAW images by analyzing illumination distribution. This subset includes 1000 pairs of low-light and normal images, with 900 pairs for training and 100 pairs for testing.

Implementation Details. Our implementation used PyTorch and is trained and tested on a PC with a single 1080Ti GPU. The model was trained with the Adam optimizer (

β_{1} = 0.9

,

β_{2} = 0.999

) over

2.5 \times 10^{5}

iterations. The initial learning rate was set at

2 \times 10^{- 4}

and decreased to

1 \times 10^{- 6}

using a cosine annealing schedule for stability. Training samples were randomly cropped to 128 × 128 patches from low-light or normal-light image pairs, with a batch size of 16. Data augmentation techniques, such as random rotation and flipping, were used to enhance the training data.

Evaluation Metrics. We employed three full-reference distortion metrics to evaluate the performance of the proposed method: Peak Signal-to-Noise Ratio (PSNR) [40], Structural Similarity Index (SSIM) [40], and Mean Absolute Error (MAE). Additionally, we used a perceptual metric, the Frechet Inception Distance (FID) [41], commonly used in generative adversarial networks, to assess the visual quality of the enhanced results.

5.2. Comparison with State of the Art(SOTA) Methods

We compared our method with current SOTA methods for low-light image enhancement, including EnlightenGAN [5], JED [42], KinD [43], LIME [44], RetinexNet [45], CSDGAN [20], Zero [18], UTVNet [19], SCI [21], URetinex [46], SNR [6], PairLIE [47], DSHGAL [48].

5.2.1. Quantitative Analysis

We compared our method with other approaches on the LOLv1, LOLv2-synthetic [38], LSRW-Huawei, and LSRW-Nikon [39] test sets. The performance comparisons on these four public datasets are reported in Table 1 and Table 2. Overall, DANet achieves either optimal or sub-optimal performance across all metrics, demonstrating the superiority of our proposed approach. In Table 1, our method significantly outperforms previous state-of-the-art methods. Specifically, for distortion metrics, we surpass the current SOTA methods, indicating our results contain more high-frequency details and structure. In particular, for the PSNR metric, compared to the second-best method DSHGAL, our method improved performance by 1.534 dB in PSNR, for the SSIM and MAE we achieved sub-optimal evaluation on the LSRW-Huawei test set. For the FID metric, we achieved sub-optimal evaluation on the LSRW-nikon test set. As shown in Table 2, we also observed similar improvements on the LOLv2-synthetic test set, i.e., 0.82 dB and 0.007. On the LoLv1 test set, our method improved both MAE and FID metrics by at least 0.007 and 4.31, respectively. For the PSNR metrics, we achieved sub-optimal results on the LOLv1 test set. Experimental outcomes demonstrate that our proposed method delivers better and satisfactory visual quality for high-resolution low-light image restoration, proving its effectiveness.

5.2.2. Qualitative Comparison

To facilitate a clearer comparison, we provided visual results of all methods separately in Figure 4, Figure 5, Figure 6, Figure 7 and Figure 8 across the LOLv1, LSRW-Nikon, LSRW-Huawei, and LOLv2-synthetic datasets.

Figure 4 displays the enhancement results comparison of each method on the LOLv1 dataset. The top row shows the enhancement result maps (RGB images) of several enhancement methods, and the bottom row shows the corresponding colour histograms. In terms of colour enhancement results, our method successfully recovers the original colour histogram, and the observation shows that the method approximates the colour histogram of the real image, whereas the enhancement results of the other methods are not fully recovered. Figure 5 displays the enhancement results comparison of each method on the LOLv1 dataset. We observed the following trends: (1) The signal-to-noise ratio is influenced by the inability to effectively integrate local and global features, resulting in inconsistent texture details. For example, URetinex produces blurry details; (2) SCI and KiD exhibit locally dim lighting phenomena, neglecting long-term dependency relationships and domain alignment, leading to excessively bright results and the appearance of artifacts. (3) Due to the lack of domain alignment in latent space, CSCDGAN, Zero, and PairLIE demonstrate uneven brightness distribution in the output, while EnlightenGAN suffers from local overexposure. Similarly, in KinD, the presence of noise or the absence of local texture details in the results is due to the lack of remote or local dependency relationships, resulting in ineffective feature extraction.

Figure 6 illustrates the enhancement results comparison of each method on the LSRW-Nikon dataset. EnlightenGAN, Zero, and KinD exhibit color distortion, while SNR and URetinex result in blurred detail enhancement, all due to insufficient extraction of effective feature information. Figure 7 displays the enhancement results comparison of each method on the LSRW-Huawei dataset. EnlightenGAN, Zero, and KinD again show color distortion, while URetinex, LIME, and CSDGAN lead to the excessive enhancement of distant sunlight due to a lack of remote dependencies in feature extraction. Figure 8 presents the enhancement results comparison on the LOLv2-synthetic dataset. The sky and clouds in EnlightenGAN and CSDGAN suffer from overexposure, whereas in KiD and Zero, the cloud illumination is insufficient. Overall, all other methods lack domain alignment in latent space, resulting in locally overexposed or underexposed regions in the generated results. In contrast, our approach utilizes domain alignment techniques and extracts remote or short-range dependency relationships, yielding not only clear and locally coherent images but also demonstrating a closer resemblance to real ground-level images in terms of lighting.

5.3. Ablation Study

To showcase the effectiveness of the domain alignment network, the CTM module, and the domain alignment loss function, we conducted the following three ablation experiments on the LSRW-Huawei dataset.

The effectiveness of DANet. We removed the aligned encoder and used a separate encoder and decoder for image reconstruction, which corresponds to a traditional encoder-decoder structure (referred to as ”w/o DANet”). The results for various metrics are depicted in row 3 of Table 3, with visualization outcomes presented in Figure 9. However, compared to real images, the contrast and saturation appear relatively poor, and the enhancement results are not entirely satisfactory. It can be seen that without adopting the region alignment method, the distribution of brightness in the results becomes unbalanced. This is because the domain alignment loss function proposed in this paper was not used, and the illumination of the images was not properly distinguished, resulting in poor enhancement effects on the contrast and saturation of the images. However, in the Domain Adaptation Network (DANet), when including the domain alignment loss function and two dual encoders, one encoder can reconstruct images under abnormal illumination, while the other encoder enhances images under weak lighting conditions. The domain alignment loss function matches the images under normal illumination, resulting in better contrast, saturation, and texture details in the enhanced images, which are also closer to the real images.

The effectiveness of CTM. We removed the CTM module (referred to as “w/o CTM”). The results for each metric are shown in row 4 of Table 3, and the visualization results are shown in Figure 9. If the CTM module is removed, the enhanced image may exhibit local region blurriness or some noise. This is because the long-range dependency is ignored, and the global features are not sufficiently extracted, along with the extraction of local features, leading to local blurriness in the enhanced image.

The effectiveness of loss function. The evaluation of loss function is shown in Figure 9. To validate the effectiveness of the domain alignment function proposed in the paper, we replaced the domain alignment loss function with contrastive loss (referred to as “RP con”), triplet loss (referred to as “RP tri”), and cross-entropy loss (referred to as “RP cro”) during training. From the results in Figure 9 that when using RP con, RP cro and RP tri, the contrast and saturation of the enhancement results are poor, and the contrast is somewhat distorted compared to GT. When using our proposed domain alignment loss function, the contrast and saturation of the DANet enhancement results are better, the details are clearer and closer to the real image, and the output visual effect is more natural. The experimental results show that the design of domain alignment loss function effectively improves the quality of model-enhanced images.

6. Conclusions

This paper proposed a DANet network for low-light image enhancement, comprising Dual Encoders with Convolution-Transformer Module (CTM). The Encoders with Convolution-Transformer Module aim to reduce the spatial dimensions between low-light and normal-light images, mapping them into two latent spaces. The CTM captures long-range dependencies, with separate branches extracting global and local features, effectively aligning the domains of these image characteristics. These innovations not only enhance detailed information in low-light images but also achieve illumination closer to real images, mitigating issues of over-enhancement or under-enhancement. Experimental results across four public datasets demonstrated the superior performance of our method compared to state-of-the-art techniques. Future research will delve deeper into Transformer-based methods for enhancing low-light images, emphasizing detailed and illumination improvements.

Author Contributions

Conceptualization, Q.L. and B.J.; methodology, Q.L.; software, Q.L.; validation, Q.L., B.J. and C.Y.; formal analysis, Q.L.; investigation, Q.L.; resources, Q.L. and B.J.; data curation, Q.L. and X.W.; writing—original draft preparation, Q.L.; writing—review and editing, Q.L. and C.Y.; visualization, Q.L.; supervision, X.B.; project administration, B.J. and X.B.; funding acquisition, B.J. and C.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science.Foundation of China under grant 62072169, 62172156 and Scientific Research Project of Hunan Provincial Education Department 19A286 and 21A0607.

Data Availability Statement

The training and testing dataset is provided with a web address in the article, other data can be shared upon request, and the author can be contacted.

Conflicts of Interest

The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; in the decision to publish the results.

References

Guo, X.; Hu, Q. Low-light Image Enhancement Via Breaking Down the Darkness. Int. J. Comput. Vis. 2023, 131, 48–66. [Google Scholar] [CrossRef]
Feng, B.; Ai, C.; Zhang, H. Fusion of Infrared and Visible Light Images Based on Improved Adaptive Dual-Channel Pulse Coupled Neural Network. Electronics 2024, 13, 2337. [Google Scholar] [CrossRef]
Liang, J.; Wang, J.; Quan, Y.; Chen, T.; Liu, J.; Ling, H.; Xu, Y. Recurrent Exposure Generation for Low-light Face Detection. IEEE Trans. Multimed. 2021, 24, 1609–1621. [Google Scholar] [CrossRef]
Li, G.; Yang, Y.; Qu, X.; Cao, D.; Li, K. A Deep Learning Based Image Enhancement Approach for Autonomous Driving at Night. Knowl.-Based Syst. 2021, 213, 106617. [Google Scholar] [CrossRef]
Jiang, Y.; Gong, X.; Liu, D.; Cheng, Y.; Fang, C.; Shen, X.; Yang, J.; Zhou, P.; Wang, Z. Enlightengan: Deep Light Enhancement without Paired Supervision. IEEE Trans. Image Process. 2021, 30, 2340–2349. [Google Scholar] [CrossRef] [PubMed]
Xu, X.; Wang, R.; Fu, C.W.; Jia, J. SNR-Aware Low-Light Image Enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2022; pp. 17714–17724. [Google Scholar]
Zhuang, L.; Guan, Y. Adaptive Image Enhancement Using Entropy-based Subhistogram Equalization. Comput. Intell. Neurosci. 2018, 2018, 3837275. [Google Scholar] [CrossRef] [PubMed]
Mun, J.; Jang, Y.; Nam, Y.; Kim, J. Edge-enhancing Bi-histogram Equalisation Using Guided Image Filter. J. Vis. Commun. Image Represent. 2019, 58, 688–700. [Google Scholar] [CrossRef]
Wang, S.; Zheng, J.; Hu, H.M.; Li, B. Naturalness Preserved Enhancement Algorithm for Non-uniform Illumination Images. IEEE Trans. Image Process. 2013, 22, 3538–3548. [Google Scholar] [CrossRef]
Rahman, Z.u.; Jobson, D.J.; Woodell, G.A. Retinex Processing for Automatic Image Enhancement. J. Electron. Imaging 2004, 13, 100–110. [Google Scholar]
Li, M.; Liu, J.; Yang, W.; Sun, X.; Guo, Z. Structure-revealing Low-light Image Enhancement Via Robust Retinex Model. IEEE Trans. Image Process. 2018, 27, 2828–2841. [Google Scholar] [CrossRef]
Ghosh, S.; Chaudhury, K.N. Fast Bright-pass Bilateral Filtering for Low-light Enhancement. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 205–209. [Google Scholar]
Wang, R.; Zhang, Q.; Fu, C.W.; Shen, X.; Zheng, W.S.; Jia, J. Underexposed Photo Enhancement Using Deep Illumination Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6849–6857. [Google Scholar]
Wu, Y.; Pan, C.; Wang, G.; Yang, Y.; Wei, J.; Li, C.; Shen, H.T. Learning Semantic-Aware Knowledge Guidance for Low-Light Image Enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 1662–1671. [Google Scholar]
Yang, S.; Ding, M.; Wu, Y.; Li, Z.; Zhang, J. Implicit Neural Representation for Cooperative Low-light Image Enhancement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 12918–12927. [Google Scholar]
Wu, Y.; Wang, G.; Wang, Z.; Yang, Y.; Li, T.; Wang, P.; Li, C.; Shen, H.T. ReCo-Diff: Explore Retinex-Based Condition Strategy in Diffusion Model for Low-Light Image Enhancement. arXiv 2023, arXiv:2312.12826. [Google Scholar]
Zhang, X.; Wang, X.; Yan, C.; Jiao, G.; He, H. Polarization-Based Two-Stage Image Dehazing in a Low-Light Environment. Electronics 2024, 13, 2269. [Google Scholar] [CrossRef]
Guo, C.; Li, C.; Guo, J.; Loy, C.C.; Hou, J.; Kwong, S.; Cong, R. Zero-reference Deep Curve Estimation for Low-light Image Enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1780–1789. [Google Scholar]
Zheng, C.; Shi, D.; Shi, W. Adaptive Unfolding Total Variation Network for Low-light Image Enhancement. In Proceedings of the IEEE/CVF international conference on computer vision, Montreal, BC, Canada, 11–17 October 2021; pp. 4439–4448. [Google Scholar]
Ma, L.; Liu, R.; Zhang, J.; Fan, X.; Luo, Z. Learning Deep Context-sensitive Decomposition for Low-light Image Enhancement. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 5666–5680. [Google Scholar] [CrossRef] [PubMed]
Ma, L.; Ma, T.; Liu, R.; Fan, X.; Luo, Z. Toward Fast, Flexible, and Robust Low-light Image Enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5637–5646. [Google Scholar]
Chen, Y.S.; Wang, Y.C.; Kao, M.H.; Chuang, Y.Y. Deep Photo Enhancer: Unpaired Learning for Image Enhancement from Photographs with Gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6306–6314. [Google Scholar]
Tuncal, K.; Sekeroglu, B.; Abiyev, R. Self-Supervised and Supervised Image Enhancement Networks with Time-Shift Module. Electronics 2024, 13, 2313. [Google Scholar] [CrossRef]
Lyu, H.; Sha, N.; Qin, S.; Yan, M.; Xie, Y.; Wang, R. Advances in Neural Information Processing Systems. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 12345–12355. [Google Scholar]
Ali, A.; Touvron, H.; Caron, M.; Bojanowski, P.; Douze, M.; Joulin, A.; Laptev, I.; Neverova, N.; Synnaeve, G.; Verbeek, J.; et al. Xcit: Cross-covariance Image Transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 20014–20027. [Google Scholar]
Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lučić, M.; Schmid, C. Vivit: A video Vision Transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 6836–6846. [Google Scholar]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2022; pp. 205–218. [Google Scholar]
Wu, B.; Xu, C.; Dai, X.; Wan, A.; Zhang, P.; Yan, Z.; Tomizuka, M.; Gonzalez, J.E.; Keutzer, K.; Vajda, P. Visual Transformers: Where Do Transformers Really Belong in Vision Models? In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 599–609. [Google Scholar]
Amjoud, A.B.; Amrouch, M. Object Detection Using Deep Learning, CNNs and Vision Transformers: A Review. IEEE Access 2023, 11, 35479–35516. [Google Scholar] [CrossRef]
Gehrig, M.; Scaramuzza, D. Recurrent Vision Transformers for Object Detection with Event Cameras. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 13884–13893. [Google Scholar]
Huang, Z.; Dai, H.; Xiang, T.Z.; Wang, S.; Chen, H.X.; Qin, J.; Xiong, H. Feature Shrinkage Pyramid for Camouflaged Object Detection with Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 5557–5566. [Google Scholar]
Yu, D.; Li, Q.; Wang, X.; Zhang, Z.; Qian, Y.; Xu, C. DSTrans: Dual-Stream Transformer for Hyperspectral Image Restoration. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 3739–3749. [Google Scholar]
Wang, Z.; Cun, X.; Bao, J.; Zhou, W.; Liu, J.; Li, H. Uformer: A General U-shaped Transformer for Image Restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 17683–17693. [Google Scholar]
Zhao, H.; Gou, Y.; Li, B.; Peng, D.; Lv, J.; Peng, X. Comprehensive and Delicate: An Efficient Transformer for Image Restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 14122–14132. [Google Scholar]
Heo, B.; Yun, S.; Han, D.; Chun, S.; Choe, J.; Oh, S.J. Rethinking Spatial Dimensions of Vision Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 11936–11945. [Google Scholar]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 568–578. [Google Scholar]
Zhou, D.; Yang, Z.; Yang, Y. Pyramid Diffusion Models for Low-light Image Enhancement. arXiv 2023, arXiv:2305.10028. [Google Scholar]
Jiang, H.; Luo, A.; Fan, H.; Han, S.; Liu, S. Low-light Image Enhancement with Wavelet-based Diffusion Models. ACM Trans. Graph. (TOG) 2023, 42, 1–14. [Google Scholar] [CrossRef]
Hai, J.; Xuan, Z.; Yang, R.; Hao, Y.; Zou, F.; Lin, F.; Han, S. R2rnet: Low-light Image Enhancement Via Real-low to Real-normal Network. J. Vis. Commun. Image Represent. 2023, 90, 103712. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. Gans Trained By a Two Time-scale Update Rule Converge to a Local Nash Equilibrium. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6626–6637. [Google Scholar]
Ren, X.; Li, M.; Cheng, W.H.; Liu, J. Joint Enhancement and Denoising Method Via Sequential Decomposition. In Proceedings of the 2018 IEEE International Symposium on Circuits and Systems (ISCAS), Florence, Italy, 27–30 May 2018; pp. 1–5. [Google Scholar]
Zhang, Y.; Zhang, J.; Guo, X. Kindling the Darkness: A Practical Low-light Image Enhancer. In Proceedings of the 27th ACM international conference on multimedia, Nice, France, 21–25 October 2019; pp. 1632–1640. [Google Scholar]
Guo, X.; Li, Y.; Ling, H. LIME: Low-light Image Enhancement Via Illumination Map Estimation. IEEE Trans. Image Process. 2016, 26, 982–993. [Google Scholar] [CrossRef]
Wei, C.; Wang, W.; Yang, W.; Liu, J. Deep Retinex Decomposition for Low-light Enhancement. arXiv 2018, arXiv:1808.04560. [Google Scholar]
Wu, W.; Weng, J.; Zhang, P.; Wang, X.; Yang, W.; Jiang, J. Uretinex-net: Retinex-based Deep Unfolding Network for Low-light Image Enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5901–5910. [Google Scholar]
Fu, Z.; Yang, Y.; Tu, X.; Huang, Y.; Ding, X.; Ma, K.K. Learning a Simple Low-Light Image Enhancer From Paired Low-Light Instances. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22252–22261. [Google Scholar]
Xu, H.; Liu, X.; Zhang, H.; Wu, X.; Zuo, W. Degraded Structure and Hue Guided Auxiliary Learning for Low-light Image Enhancement. Knowl.-Based Syst. 2024, 295, 111779. [Google Scholar] [CrossRef]

Figure 1. Examples of image enhancement results in both brightness and texture details.

Figure 2. Overview of Domain Alignment Network.

Figure 3. The architecture of Convolution-Transformer Module (CTM).

Figure 4. Visual Comparison from colour histogram with state-of-the-art methods on the LOLv1 dataset.

Figure 5. Visual Comparison with state-of-art methods on the LOLv1 dataset.

Figure 6. Visual Comparison with state-of-the-art methods on the LSRW–Nikon [39] dataset.

Figure 7. Visual Comparison with state-of-the-art methods on the LSRW-Huawei dataset.

Figure 8. Visual Comparison with state-of-the-art methods on the LOLv2-synthetic dataset.

Figure 9. Visual results for ablation study on LSRW-Huawei dataset.W/o DANet indicates the absence of the aligned encoder, using only a single encoder and decoder to reconstruct the image.W/o CTM indicates the removal of the CTM module. ‘RP con’, ‘RP tri’, and ‘RP cro’ denote the effects of using these three loss functions, respectively, as substitutes for the domain alignment loss function.

Table 1. Quantitative evaluation of different methods on the LSRW-Huawei and LSRW-Nikon test sets. Results are highlighted in bold for the best performance and underlined for the second-best performance. ‘↑’ indicates the higher the better. ‘↓’ means the lower the better.

Methods	LSRW-Huawei				LSRW-Nikon
Methods	PSNR↑	SSIM↑	MAE↓	FID↓	PSNR↑	SSIM↑	MAE↓	FID↓
EnlighteGAN [5]	17.45	0.488	0.138	40.95	14.63	0.402	0.149	47.01
JED [42]	15.11	0.512	0.252	93.98	14.79	0.441	0.190	60.91
KinD [43]	17.18	0.455	0.162	73.72	15.36	0.425	0.157	51.71
LIME [44]	18.45	0.444	0.134	37.43	14.43	0.369	0.149	55.25
RetinexNet [45]	16.81	0.385	0.15	84.03	13.49	0.294	0.178	85.68
CSDGAN [20]	18.48	0.488	0.139	43.87	15.99	0.425	0.147	35.77
Zero [18]	16.4	0.466	0.223	57.7	15.03	0.416	0.177	37.93
UTVNet	18.27	0.552	0.153	44.18	13.707	0.428	0.214	50.69
SCI [21]	15.1	0.417	0.296	56.99	15.26	0.393	0.176	27.08
URetinex [46]	18.96	0.55	0.118	58.28	16.44	0.452	0.125	35.74
SNR [6]	20.66	0.603	0.111	67.92	16.59	0.477	0.132	87.59
PairLIE [47]	18.98	0.545	0.125	63.53	15.52	0.429	0.141	38.91
DSHGAL [48]	20.7162	0.7835	0.0743	36.346	19.3665	0.7642	0.09606	36.762
ours	22.25	0.617	0.076	27.93	17.81	0.516	0.116	30.25

Table 2. Quantitative evaluation of different methods on the LOLv1 and LOLv2-synthetic test sets. Results are highlighted in bold for the best performance and underlined for the second-best performance. ‘↑’ indicates the higher the better. ‘↓’ means the lower the better.

Methods	LOLv1				LOLv2Syn
Methods	PSNR↑	SSIM↑	MAE↓	FID↓	PSNR↑	SSIM↑	MAE↓	FID↓
EnlighteGAN [5]	18.68	0.6796	0.135	108.29	16.585	0.772	0.138	76.44
JED [42]	13.68	0.6421	0.291	105.91	16.885	0.7367	0.164	80.04
KinD [43]	14.79	0.5431	0.217	118.14	17.51	0.772	0.146	66.83
LIME [44]	17.18	0.528	0.153	97.01	17.497	0.771	0.123	65.48
RetinexNet [45]	16.77	0.4616	0.15	175.07	17.14	0.758	0.13	85.6
CSDGAN [20]	17.54	0.672	0.165	96.87	14.95	0.763	0.163	57.82
Zero [18]	14.87	0.588	0.251	97.15	17.76	0.817	0.147	47.8
UTVNet [19]	15.53	0.715	0.223	101.55	14.61	0.681	0.216	82.16
SCI [21]	13.806	0.551	0.308	91.68	16.69	0.749	0.158	64.15
URetinex [46]	19.685	0.831	0.117	44.25	18.19	0.825	0.124	48.64
SNR [6]	24.61	0.850	0.067	36.89	24.14	0.928	0.074	35.57
PairLIE [47]	18.47	0.756	0.14	60.11	19.07	0.803	0.109	65.75
DSHGAL [48]	19.9698	0.7907	0.0979	45.521	23.643	0.8961	0.1270	53.582
ours	24.47	0.862	0.06	32.58	24.96	0.935	0.057	19.48

Table 3. Effect of the DANet, CTM and con on the LSRW-Huawei in terms of PSNR, SSIM, MAE and FID. The W/o DANet and W/o CTM, respectively, indicate removal of the aligned encoder and the CTM module. The contrastive loss (referred to as “RP con”), triplet loss (referred to as ”RP tri”), and cross-entropy loss (referred to as “RP cro”). Results are highlighted in bold for the best performance and underlined for the second-best performance. ‘↑’ indicates the higher the better. ‘↓’ means the lower the better.

Condition	PSNR↑	FID↓	SSIM↑	MAE↓
baseline	19.74	49.26	0.572	0.111
w/o DANet	20.58	41.42	0.589	0.098
w/o CTM	19.84	45.55	0.582	0.110
RP con	18.36	90.17	0.517	0.218
RP tri	21.52	31.23	0.614	0.088
RP cro	21.79	29.01	0.611	0.085
Ours	22.25	27.93	0.617	0.076

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Q.; Jiang, B.; Bo, X.; Yang, C.; Wu, X. DANet: A Domain Alignment Network for Low-Light Image Enhancement. Electronics 2024, 13, 2954. https://doi.org/10.3390/electronics13152954

AMA Style

Li Q, Jiang B, Bo X, Yang C, Wu X. DANet: A Domain Alignment Network for Low-Light Image Enhancement. Electronics. 2024; 13(15):2954. https://doi.org/10.3390/electronics13152954

Chicago/Turabian Style

Li, Qiao, Bin Jiang, Xiaochen Bo, Chao Yang, and Xu Wu. 2024. "DANet: A Domain Alignment Network for Low-Light Image Enhancement" Electronics 13, no. 15: 2954. https://doi.org/10.3390/electronics13152954

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DANet: A Domain Alignment Network for Low-Light Image Enhancement

Abstract

1. Introduction

2. Related Works

2.1. Non-Learning Based Low-Light Image Enhancement

2.2. Learning-Based Low-Light Image Enhancement

2.3. Enhancement of Low-Light Images Using Vision Transformers

3. Proposed Method

3.1. Dual Encoders with Convolution-Transformer Module

3.2. Convolution-Transformer Module (CTM)

4. Domain Alignment Loss

5. Experimental Results and Discussion

5.1. Experimental Settings

5.2. Comparison with State of the Art(SOTA) Methods

5.2.1. Quantitative Analysis

5.2.2. Qualitative Comparison

5.3. Ablation Study

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI