1. Introduction
Image fusion has been a crucial low-level image processing task for various applications, such as multi-spectrum image fusion [
1,
2], multi-focus image fusion [
3], multi-modal image fusion [
4], and multi-exposure image fusion [
5]. Among these applications, thanks to smartphones’ prevalence with their built-in cameras, multi-exposure image fusion is one of the most common applications. Since most natural scenes have a larger ratio of light to dark than what a single camera shot can capture, a single-shot image usually cannot present details of high dynamic ranges, thus having under- or overexposed parts for the scene. When a camera captures an image, its sensors can only catch a limited luminance range during a specific exposure time, resulting in a so-called low-dynamic-range image. An image taken for short exposure tends to be dark, while it is bright for long exposure, as shown in 
Figure 1a. Fusing differently exposed low-dynamic-range (LDR) images to obtain a high-dynamic-range (HDR) image requires extracting well-exposed (highlighted) regions from each LDR image to generate an excellent fused image, which has been very challenging.
Several research works have been performed for Multi-scale Exposure Fusion (MEF) [
6,
7,
8]. In general, it is common to fuse LDR images using a weighted sum, where the weight associated with each input LDR is determined in a pixel-wise fashion [
6,
7,
8]. Mertens et al. [
6] proposed the fusion of images in a multi-scale manner based on pixel contrast, saturation, and well-exposedness to ease content inconsistency issues in the fused results. However, this often yields halo artifacts in its fusion results. In [
7,
8], the authors addressed the artifacts by applying modified guided image filtering to weight maps to eliminate halos around edges.
The abovementioned methods produce good results using a sequence of images exposed in a small interval of different exposure values (EV). Thanks to advanced sensor technology, a camera with Binned Multiplexed Exposure High-Dynamic-Range (BME-HDR) or Spatially Multiplexed Exposure High-Dynamic-Range (SME-HDR) technology can simultaneously capture an image pair with short- and long-exposure image sensors. The captured pair has only a negligible difference, possibly caused by local motion blur between them. The existing MEF methods may not work well with two exposure images, since none of the inputs may have well-exposed contents. In addition, weighted-sum fusion based on well-exposedness may not be able to deal with highlighted regions of a short-exposure image that are darker than dark parts in a long-exposure image, resulting in the method ignoring contents in the short-exposure image. Yang et al. [
9] proposed the production of an intermediate virtual image with a medium exposure based on an image pair with two exposures to help generate better fusion results. Nevertheless, it does not work in situations where highlighted regions of both input LDR images are not well exposed.
In recent years, deep convolutional neural networks (CNNs) have gained tremendous success in low-level image processing works. In MEF, CNN-based methods [
10,
11] can better learn features from input multiple-exposure images and fuse them into a nice image. However, the fused images often lack image details [
12], since spatial information may be lost when features pass through deep layers. Xu et al. [
13] proposed a unified unsupervised image fusion network trained based on the importance and information carried by the two input images to generate fusion results. However, these learning-based methods can only produce a fused image based on the two input images’ interpolation. They cannot deal with cases where both of the input images do not have highlighted regions/contents.
This paper presents a two-exposure fusion framework that generates a more helpful intermediate virtual image for fusion using the proposed Optimized Adaptive Gamma Correction (OAGC). The virtual image has better contrast, saturation, and well-exposedness, and it is not restricted to being an interpolated version of the two input images. Fusing the input images with their virtual image processed by OAGC works well even though both inputs have no well-exposed contents or regions. 
Figure 1b shows an example where the proposed framework can still generate a good fusion result for when both of the input images lack highlighted regions (
Figure 1a). Our primary contributions are three-fold:
- Our image fusion framework adopting the proposed OAGC can produce better fusion results for two input images with various exposure ratios, even when both of the input images lack well-exposed regions. 
- The proposed framework with OAGC can also adapt to single-image enhancement. 
- We conduct an extensive experiment using a public multi-exposure dataset [ 14- ] to demonstrate that the proposed fusion framework performs favorably against the state-of-the-art image fusion methods. 
  2. Related Work
MEF-based methods produce fusion results using a weighted combination of the input images based on each pixel’s “well-exposedness”. In [
15], fusion weight maps were calculated based on the correlation-based match and salience measures of the input images. With the weight maps, one can fuse the input images into one by using the gradient pyramid.
Mertens et al. [
6] constructed fusion weight maps based on contrast, saturation, and exposedness of the input images. Differently from [
15], the fusion was performed with the Gaussian and Laplacian pyramids. The problem was that using the smoothed weight maps in fusion often causes halo artifacts, especially around the edges. The method proposed in [
7] addressed this issue by applying an edge-preserving filter (weighted guided image filtering [
16]) to fusion weight maps. Kou et al. [
8] further proposed an edge-preserving gradient-domain guided image filter (GGIF) to avoid generating halo artifacts in the fused image. To extract image details, Li et al. [
7] proposed a weighted structure tensor to manipulate details presented in a fused image. In general, MEF-based methods can generate decent fusion results.
General MEF algorithms [
6,
8] that require a sequence of images with different exposure ratios as the inputs may not work with only two input images. Yang et al. [
9] proposed the use of the MEF algorithm for two-exposure-ratio image fusion, where an intermediate virtual image with a medium exposure is generated to help produce a better fusion result. However, the virtual image’s intensity and exposedness are bounded by the two input images, which often fails to work for cases where two images are both underexposed and overexposed. Yang’s method [
9] can only generate both the intermediate and fusion results with approximate medium exposure between its two input images. The problem is that medium exposure between the inputs may still be under- or overexposure. Image fusion will not improve visual quality. We will discuss this issue more in the next section.
In the following paragraphs, we introduce the techniques adopted in the work of Yang et al., including the generation of the virtual image and fusion weights and the multi-scale image fusion. Before continuing, we define several notations that are used here. Let  be a color image. We denote  as the color channel c, where  stand for the red, green, and blue channels.  represents the pixel located at , where  and . M and N are the image width and height. Let  be the luminance component or the grayscale version of . Note that the values of images in this paper are normalized to .
  2.1. Quality Measures and Fusion Weight Maps
In HDR imaging, an image taken at a certain exposure may contain underexposed or overexposed regions, which are less informative and should be assigned fewer weights in multi-exposure fusion. The input’s contrast, saturation, and well-exposedness determine a pixel’s weight at 
 [
6]. The contrast of a pixel, denoted by 
, is obtained by applying a 
 Laplacian filter to a grayscale version of the image:
Let 
 be the map of the contrast of 
; therefore,
        
        where 
, 
, 
, and 
 are obtained from 
, 
, 
, and 
; i.e., shifting 
 one pixel left, right, up, and down, respectively. The saturation of the pixel, denoted by 
, is obtained by computing the standard deviation across the red, green, and blue channels:
        where
        
The well-exposedness of the pixel, 
, is defined as:
        where 
 and 
. Essentially, 
E is a normal distribution centered at 
 with a standard deviation of 
. The maps of saturation and well-exposedness of 
 can, respectively, be represented as 
 and 
. Next, the weight of the pixel for fusion is computed using:
        where 
, 
, and 
 can be adjusted to emphasize or ignore one or more measures. Considering a set of 
P images 
 for image fusion, the weight of this pixel in the 
 image is normalized by the sum of the weights across all the images at the same pixel:
The weight map of the image  is represented as .
  2.2. Multi-Scale Fusion
In the MEF algorithm [
6], a fusion image, 
, is obtained through multi-scale image fusion based on the standard Gaussian and Laplacian pyramids. For each input image 
 in the set of 
, the Laplacian pyramid, 
, and the Gaussian pyramid of its weight map, 
, in the 
 level are constructed by applying the Gaussian pyramid generation [
17]. In this level, the overall Laplacian pyramid is collapsed by performing weighted averaging on the Laplacian pyramids from all of the input images in the set:
        where ⊙ denotes element-wise multiplication. Finally, the fusion image, 
, is reconstructed by collapsing the Laplacian pyramids 
.
Applying edge-preserving filtering to preserve edges in the weight maps before averaging the Laplacian pyramids in Equation (
6) can reduce halo artifacts in fused images. In [
9], the GGIF [
18] was adopted to smooth the weight maps 
 and to preserve the significant change as well. Let 
 be the square local patch with a radius of 
 centered at 
, and let 
 be a pixel in the patch. In 
, the weight map in the 
 level of the 
 image, 
, is the linear transform of the luminance component, 
:
        where 
 and 
 are the coefficients and are assumed to be constant in 
. 
 and 
 can be obtained by minimizing the objective function:
        where 
 is a constant for regularization. The variance of the intensities within this local patch, 
, is computed when solving for the coefficients in Equation (
8).
In GGIF, a 
 local window, 
, is applied to the pixels within 
 for capturing the structure within 
 by computing the variance within 
, 
 [
18]. This local window makes GGIF a content-adaptive filter; thus, GGIF produces fewer halos and better preserves the edge than the GIF. In GGIF, the regularization term is designed to yield:
        where 
 and 
 are computed according to the product of 
 and 
 (the standard deviations of the pixels within 
 and 
), and 
 is a constant for regularization. The filter coefficients 
 and 
 can solved by minimizing 
 in Equation (
9).
The fused image 
 can be obtained by fusing the Laplacian pyramids of the input images taken at different exposures using the weight maps retrieved from the Gaussian pyramids, 
. Note that the weight maps are filtered using GGIF, as described in Equation (
9), to preserve edges.
  2.3. Virtual Image Generation
In [
9], Yang et al. proposed the modification of two differently exposed images to have the same medium exposure using the intensity mapping function based on the cross-histogram between two images, called the comparagram (Ref. [
19]), and fused them to produce an intermediate virtual image. Let 
 and 
 be the two input images and let 
 and 
 be the intensity mapping functions (IMFs) that map 
 to 
 and 
 to 
. Based on [
19], the IMFs that map the two images to the same exposure, denoted as 
 and 
, are computed as
        
        where 
z is a pixel intensity. The two modified images with the same exposure are 
, 
. The desired virtual image 
 is computed by fusing 
 and 
 using the weighting functions adopted in [
9]. The two-exposure-fusion image in [
9] is obtained by fusing 
, 
, and 
 based on the MEF algorithm [
8].
As described previously, Yang’s method often fails to produce a satisfying fusion result when the medium exposure between inputs is still under- or overexposure. The proposed method addresses this issue by improving the contrast, saturation, and well-exposedness for the intermediate virtual image to generate better fusion results under different input conditions.
  3. Proposed Method
The algorithm in [
9] can work for two images with a large difference between their exposure ratios. In this case, the intermediate virtual image with medium exposure helps bridge the dynamic range gap between the two inputs. Thus, it can improve the quality of the fusion result. However, if the two inputs’ exposure is under- or overexposure, the generated virtual image would not help fusion. Thus, the quality of the fused image is not improved much.
For example, to fuse 
Figure 2a,b, both of which look overexposed, the virtual image 
 (
Figure 2c) generated by [
9] with medium exposure between the inputs is still overexposed and, thus, not helpful for the fusion result (
Figure 2e). We propose Optimized Adaptive Gamma Correction (OAGC) to enhance the intermediate virtual image to have better contrast, saturation, and well-exposedness (
Figure 2d) so that it can improve the fusion quality and produce a better result (
Figure 2f).
In OAGC, we derive an optimal 
 based on the input’s contrast, saturation, and well-exposedness by formulating an objective function based on these image quality metrics and apply it to the input image using gamma correction. Let 
 be the luminance of a pixel. One can gamma-correct the image 
 to alter its luminance through the power function as follows:
      where 
 is the corrected image, 
 and 
 are positive scalars, and 
 is usually set to 1 [
20]. Here, the notation 
 in bold represents the entire image, while 
 stands for the pixel located at 
. If 
, it stretches the contrast of shadow regions (pixel intensities less than the mid-tone of 
), and features in these regions become discernible, whereas if 
, it stretches the contrast of bright regions (intensities larger than 
), and features in the regions become perceptible. For 
, it is linear mapping.
To derive the optimal gamma, we design an objective function as follows:
      where 
, 
, 
, and 
, where 
, 
, and 
 are the maps of quality measures computed based on the gamma-corrected version of the input image, denoted as 
. Here, the virtual image 
 is used as the input, which is 
. We set 
, 
, and 
 to 4, 
, and 1 according to the upper bounds of the corresponding quality measures (contrast, saturation, and well-exposednesse; refer to the 
Appendix A for the derivation). The term with 
 in the objective function prevents the corrected image from deviating the input too much. Hence, minimizing the objective function 
 is to maximize all three quality measures: the contrast, saturation, and well-exposedness. 
, 
, and 
 are the weighting factors for the contributions from different quality measures (independent from 
, 
, and 
 in Equation (
4) and are all set to 
. 
 is a small, fixed scalar and is set to 
 in the present study. 
 is the vector of 1s, 
 is the vectorization of a matrix, and 
 represents the 2-norm of a vector. The regularization term is added to avoid possible color distortion caused by gamma correction.
The optimal gamma, 
, which aims to increase contrast, saturation, and well-exposedness simultaneously, can be obtained by minimizing the optimization function 
:
Since there is no closed-form solution for Equation (
13), we apply the gradient descent to iteratively approximate it:
      where
      
      with 
, 
, 
, and 
 being the vectorization of 
, 
, 
, and 
, as well as 
, 
, and 
 being 
, 
, and 
 respectively.
      
      with ⊘ being the element-wise division,
      
      and 
 is the adjustable learning rate.
Figure 3 shows the flowchart of the presented two-exposure image fusion framework, where the two inputs are taken in the same scene at different exposure ratios. The virtual image is first generated using the intensity mapping function [
9]. Next, we solve Equation (
12) to find the optimal gamma value 
 for the virtual image, which enhances the contrast, saturation, and well-exposedness of 
. The final fused image, 
, is obtained by applying the MEF algorithm [
8,
9] to the fusion of two input images and 
.