3.1. Motivation
The Local Maximum Color Value (LMCV) algorithm was proposed by Dong et al. [
3] in 2016. The LMCV algorithm stands as a remarkable classical LLE method due to its introduction of the LMCV prior concept based on well-exposed images. By combining the ASM model with the LMCV prior, the algorithm effectively enhances low-light images while also offering adaptive exposure adjustment for under-exposed regions. This innovative approach not only improves image quality but also demonstrates the adaptability of LLE methods to varying lighting conditions, making it a significant contribution to the field of image processing.
Firstly, they introduce a new term LMCV map
that represents a map where each pixel’s value is the local maximum color value within its neighborhood. Specifically, in an image processed using the LMCV algorithm, each pixel’s value in the LMCV map is derived from the highest color value found in its local area, which is represented by Equation (
3).
where
x represents the pixel coordinates,
J represents the scene radiance,
c is the RGB channel and
represents the points within a certain region centered at
x. Specifically, in the LMCV algorithm, the region size is typically set to 15 × 15 pixels.
Moreover, through their statistical study, they observed that in an image with sufficient exposure, each pixel’s value in its LMCV map tends to approximate the highest value within the pixel’s range (see
Figure 1). This observation is formally denoted as the LMCV prior, which is represented by Equation (
4).
Therefore, when we apply the LMCV map operation to both sides of the ASM Equation (
1), we obtain the modified Equation (
5),
where
represents the LMCV map of the source image,
represents the LMCV map of the scene radiance, and
represents the transmission map.
Based on the LMCV prior, we assume that the scene radiance
represents the well-exposed image we desire; then, we can set its LMCV map as
. Subsequently, Equation (
5) can be simplified to Equation (
6).
Consequently, as shown in Equation (
7), the transmission map
can be considered as composed of
and the atmospheric light
A.
Once we obtain the transmission map
, we can solve for the scene radiance
based on the ASM model, as shown in Equation (
8).
In the original LMCV algorithm, a 15 × 15 kernel operation is employed to generate the LMCV map
, while complex matrix element lookups are used to determine the atmospheric light
A. Additionally, the algorithm relies on the highly computationally intensive guided filter to mitigate checkerboard artifacts resulting from the 15 × 15 kernel operation (as shown in
Figure 2, middle). These operations contribute significantly to the overall complexity of computations within the LMCV algorithm. To streamline the algorithm and improve computational efficiency, it is imperative to simplify these operations or replace them with more efficient deep learning techniques.
In this paper, we introduce a novel method for estimating the transmission map called Dark Image Increment Estimation (DIIE). Furthermore, we propose two deep learning modules: the Transmission Estimation Module (TEM) and the Correction Module (CM). These methods and modules are specifically designed to replace the computationally intensive functions present in the original LMCV algorithm. By leveraging DIIE and these modules, we aim to enhance efficiency while either maintaining or improving accuracy in low-light image enhancement tasks.
3.2. Dark Image Increment Estimation (DIIE) for Transmission Map
In
Section 3.1, we derive the transmission map
and scene radiance
from the dark source image
, its LMCV map
, and the atmospheric light
A, as shown in Equations (
7) and (
8). If we assume that any missing influence of atmospheric light
A can be addressed through deep learning model adaptivity, by ignoring
A in Equations (
7) and (
8), we can simplify them to Equations (
9) and (
10).
where
represents the simplified estimation of the transmission map and
refers to the scene radiance derived from our simplified LMCV algorithm.
Combining Equations (
9) and (
10) yields Equation (
11), indicating that the scene radiance
depends solely on the dark source image
and its LMCV map
.
Additionally, considering the definition of the LMCV map as having pixel values representing maximum values in local areas,
must be greater than or equal to
at any pixel point
x, as per Equation (
12).
Thus, we decompose
into adding a positive increment
to
in Equation (
13). Notably, we redefine both the LMCV map
and the transmission map
as three-channel maps from Equation (
13) onwards, aligning better with deep learning models and preserving image details and color information.
Therefore, solving for scene radiance
involves finding the correct increment
as shown in Equation (
14), where
is an intermediate variable for
.
To address biases from removing atmospheric light
A, we need to predict a correction
to obtain the final scene radiance
, as shown in Equation (
15).
In summary, our proposed DIIE (Dark Image Increment Estimation) method simplifies the LMCV algorithm by focusing on predicting the increment
and performing atmospheric light correction
. To achieve this, we introduce two deep learning modules: the Transmission Estimation Module (TEM) and the Correction Module (CM) to estimate them, respectively. Pseudocode Algorithm 1 provides an overview of our proposed model.
Algorithm 1 Overview of our proposed model |
1: Input: I: Dark image: [C × H × W] | |
2: Output: E Enhanced image: [C × H × W] | |
3: procedure Model(I) | |
4: | ▹ Estimates the increment f |
5: | ▹ Gets 3-channel transmission map t |
6: | ▹ Gets scene radiance image J |
7: | ▹ Generates the final enahnced image E |
8: return E | |
9: end procedure | |
3.3. Model Architecture: TEM and CM Module
We propose an LLE model based on our proposed DIIE method, comprising approximately 5K learnable parameters, as illustrated in
Figure 3. The Transmission Estimation Module (TEM) is structured as a U-Net [
34] model, focusing on estimating the increment
between the dark source image
and the three-channel transmission map
shown in
Figure 2 (Right). On the other hand, the Correction Module (CM) adopts the skip connection [
35] structure, which is utilized for color correction and noise reduction, compensating for the lack of atmospheric light
A. Notably, our model maintains a consistent three-channel matrix output for most layers without increasing the number of channels in the matrices. This design choice significantly contributes to the model’s compactness and efficiency in terms of parameters.
3.3.1. Transmission Estimation Module
The Transmission Estimation Module (TEM) takes the dark source image
I as input and is responsible for estimating the increment
f. The sum of the increment
f and the dark source image
I yields the transmission map
t. To meet this requirement, we propose a U-Net structure module for our TEM, comprising three parts: Encoder, Skip Block, and Decoder as shown in
Figure 4.
Encoder: does not contain any trainable parameters but includes three AvgPooling layers. These layers generate pooling images at 1/2, 1/4, and 1/8 scales of the original image. These pooling images will be fed into the Decoder.
Skip Block: has a structure of conv3×3 + ResBlock + ResBlock + ResBlock + conv3×3 with inputs and outputs being 3-channel features. It is primarily responsible for extracting global features from the smallest 1/8 scale pooling image, which will be fed into the first layer of the Decoder.
Decoder: contains most of our trainable parameters and consists of four T_UP layers. The structure of the T_UP layer is similar to the Skip Block, but we have added a Deconvolution layer for upscaling the pooling images from the Encoder, as shown in the T_UP diagram in
Figure 5.
Based on the outputs of the Skip Block and Decoder in
Figure 5, we find that the output of the T_UP 4 layer is the estimated increment
f. At the fluorescent tube lights’ position in the image
, the increment value
tends to 0, while in the dark source image
, the color at position
should tend to 1. According to Equation (
14), we find that the scene radiance
at this position is equal to
, maintaining the original brightness, which aligns with our desire.
Compared to the original transmission map, our estimated 3-channel transmission map addresses the checkerboard artifacts problem while also offering enhanced detail and color information, as depicted in
Figure 2.
3.3.2. Correction Module
While the TEM can somewhat reduce the influence of atmospheric light due to the adaptability of deep learning models, the scene radiance image
J outputted by TEM may still suffer from color biases and noise introduced by image enhancement. Therefore, we introduce the Correction Module (CM), which includes a color correction block and a denoising block, as shown in
Figure 6.
Color Correction Block: To adjust color biases, we concatenate the dark source image I and the scene radiance image J, forming a 6-channel feature metric . This combined feature metric is then fed to the Color Correction block. The output of the Color Correction block is a 3-channel matrix c in the range of 0–1, and the result of represents the color-corrected result.
Denoise Block: Inspired by the Zero-Shot Noise2Noise model [
36], we utilize two convolutional layers to estimate and subtract noise from the image, thereby producing a denoised image or, in other words, the final enhanced image
E.
In summary, our proposed model, including Transmission Estimation Module (TEM) and Correction Module (CM), leverages deep learning techniques to estimate the increment for the transmission map based on our DIIE method. Furthermore, it performs color correction and noise reduction for the final enhanced image E. This streamlined approach offers computational efficiency while maintaining high quality in low-light image enhancement tasks.
3.4. Joint Loss
To train the proposed model to achieve the best performance, we combine multiple loss functions, which include both full-reference and no-reference loss functions.
- (1)
Full-Reference Loss Functions
Mean Absolute (L1) Loss. L1 Loss is a common loss function for various image-to-image tasks such as image reconstruction, image super-resolution and image denoise. It is an effective loss function as it directly computes the mean distance between the enhanced result and the ground truth. We propose L1 loss
to compare the final enhanced image and the ground truth data.
where
E denotes the final enhanced image,
denotes the ground truth data and
x denote the pixel’s coordinate in
p space.
N is the total number of pixels.
The L1 loss measures the pixel-level difference between the enhanced image and the ground truth, offering a highly accurate match to the ground truth data. However, its high precision can sometimes lead to overfitting issues. To address this, we introduce the Root Mean Squared Log Error (RMSLE) loss.
Root Mean Squared Log Error (RMSLE) loss. RMSLE loss utilizes the logarithm function based on the root mean squared error, which can reduce the impact of large differences between a few values and the ground truth in the overall error calculation. Thus, RMSLE loss
allows for localized small errors. We use RMSLE loss to measure the difference between the scene radiance
and the ground truth because scene radiance
is not the final enhanced image processed by the correction module. Therefore, we tolerate some errors in this comparison to overcome the overfitting.
While both the L1 and RMSLE loss functions operate at the pixel level and aim to minimize differences between pixels, they do not account for whether the enhanced image aligns with human visual perception. To enhance the visual quality of the enhanced image, we introduce the Structural Similarity (SSIM) loss.
Structural Similarity (SSIM) loss. SSIM loss compares two images’ brightness, contrast, and structure, providing a better metric of human visual perception of image differences. We propose the SSIM loss
to measure the loss between the enhanced image and the ground truth data.
- (2)
No-Reference Loss Functions
Additionally, to suppress noise and mitigate color biases, we introduce two no-reference functions.
The total loss in this paper can be represented as Equation (
21); it is important to note that no special hyperparameters need to be set for the loss functions in the equation. The outcome of each loss function is solely dependent on the enhanced image and the ground truth image.
Within the total loss function, we utilize three full-reference loss functions (
,
,
) and two reference-independent loss functions (
,
).
efficiently calculates the error of the enhanced image at the pixel level but may lead to overfitting due to its high precision. To address this, we introduce
, which can tolerate local large errors and help mitigate overfitting issues. Furthermore, to ensure that the enhanced images align with high-quality human perception, we use
to encourage brightness, contrast, and structural similarities between the enhanced image and high-quality ground truth images. To suppress noise amplification and correct color biases from post-enhancement, we employ
and
functions in our final total loss function. The combination of these five loss functions contributes significantly to producing visually high-quality enhanced images while ensuring robustness. The individual impact of each loss function can be observed in
Figure 7.