1. Introduction
To increase the reliability of images for CCTV, robot vision, and autonomous driving and to perform object recognition technology fully, it is necessary to provide error-free image information in any situation, including lighting, or space. A color camera uses an infrared cut-off filter, which blocks an infrared signal, to take a visible light image. Infrared light influx distorts color information, and in particular, since the amount of light in the daytime is highly intense, infrared light can easily saturate the image [
1,
2]. However, since the infrared image contains detailed information that is not expressed in the visible light image, the visibility of the image can be improved by the information fusion method. For example, in the case of fog and haze, since infrared rays have a stronger penetrability to particles than visible rays, more detailed information can be obtained from infrared images [
3,
4].
Visible and near-infrared (NIR) fusion algorithms have been developed by various methods, e.g., subspace-based methods, multi-scale transform, and neural networks. Subspace-based methods aim to project a high-dimensional input into a low-dimensional space or subspace. Low-dimensional subspace representations can be used to improve generalization. Principal component analysis (PCA) [
5], non-negative matrix factorization (NMF) [
6], and independent component analysis (ICA) [
7] are mainly used for this method. Multi-scale transform-based methods assume images to be represented by various layers in different grains. In this method, the source image is decomposed to various levels and the corresponding layers are fused according to a specific rule. Then, the reconstructed image is acquired accordingly. Decomposition and reconstruction generally include wavelet [
8], pyramid [
9], and curvelet [
10] methods. Neural network-based methods imitate the way the human brain processes neural information. This method has the advantages of good adaptability and fault tolerance [
2,
11].
In one of the subspace-based methods, low-rank fusion, Li et al. used low-rank representation to extract features, then reconstruct the fused image to use l1-norm and the max selection strategy [
12]. Additionally, with one of the multi-scale transform-based methods, Laplacian–Gaussian pyramids and the local entropy-fusion algorithm, Vanmali et al. used Laplacian–Gaussian pyramids and local entropy to generate weight maps to control local contrast and visibility to produce fusion results [
13].
Figure 1 shows the image fusion results using visible and NIR images through Vanmali method. As shown in
Figure 1c, NIR images play an important role in image improvement because they contain many edge components of objects, and objects covered by clouds or fog can easily be identified with NIR rays.
However, such a rule-based synthesis method has a problem in that as the size of the image increases, the amount of calculation rapidly increases, resulting in a longer processing time. Therefore, it is not suitable for video processing for the object detection and classification of autonomous driving that requires high processing speed at high resolution.
With the rise of deep learning, various image synthesis methods based on deep learning have been proposed. The convolutional neural network (CNN) obtains image features by repeatedly applying filters to the entire input image and can synthesize images based on the features. This CNN-based image synthesis method does not take longer than the rule-based synthesis method even if the resolution of the input image increases. Therefore, the CNN-based synthesis method can improve the problem of the rule-based synthesis method [
14]. In contrast, deep learning training requires a dataset containing a large number of captured images, but it is difficult to obtain sufficient training images via a deep learning training model for fusing visible and infrared images with a high degree of improvement for the same image scene. This problem is important in deep learning-based synthesizing methods because if the training dataset is insufficient, the training is not performed properly, which lowers the quality of the fused image.
To solve this problem, this paper presents a method for securing an effective training dataset and a classification method for obtaining a high-quality fused image. It provides a method for training and fusing the proposed model using DenseFuse [
15], a CNN-based image fusion method. We propose a novel feature map fusion method that can improve the synthesis performance by tuning the training network in DenseFuse. Moreover, we suggest rapid color image synthesis with a proposed feature map synthesis method through color space channel separation in the fusion phase. At last, we conducted experiments to verify and evaluate the performance of the proposed method. The experimental results show that the proposed method is superior compared with several existing image fusion methods.
3. Proposed Methods
In this paper, we propose a method for securing insufficient visible and NIR image datasets and a dataset selection method for effective training as well as proposing a training model and a fusion scheme to reconstruct an excellent fused image through the selected dataset. First, the target image to be used for learning is fused from the visible and NIR image datasets, RGB-NIR scene dataset [
21] from EPFL, and sensor multi-spectral image dataset (SSMID) [
22], through Vanmali’s fusion method [
13]. Then, a dataset to be used for training is selected by comparing the luminance and variance values of the visible and NIR images. By training the proposed model through the selected dataset, high-quality visible and NIR-fused images are obtained.
3.1. Visible and NIR Image Pair Generation
From the visible and NIR image datasets, several 256 × 256 images, which are the input image size of the proposed model training step, are cut out at regular intervals and secured with only the luminance channel.
Figure 4 shows a pseudocode of the proposed algorithm. In order to obtain images as many as possible, we acquired images cropped with as little overlap as possible through this pseudocode.
Figure 5 shows how an input image is obtained from an image size of 1024 × 679 through the proposed algorithm. In this way, we were able to augment the visible and NIR images from 971 to 9883, respectively. The proposed algorithm is also applied to the previously fused target images to obtain the target images to be trained.
3.2. Local Tone-Based Image Classification
The difference image between the visible and NIR image pairs obtained by the proposed algorithm is obtained using Equation (3). In Equation (4), the visible and infrared images are denoted by
and
, respectively, and the difference image is denoted as
. Using
obtained from Equation (3), the luminance average of
is obtained using Equation (4). In Equation (4), the average luminance value of
is denoted by
, and the size of
is given by
.
and
represent the row of
and column of
, respectively. Visible and infrared image selection of
above a certain value is performed.
The variance represents the effect of the square of the average luminance value of the difference image, as the difference in luminance between each pixel and the neighboring pixel in the image is greater. Therefore, it is easy to detect edges in the distributed image. Through the difference between the edges of each image, an image that is difficult to select due to the difference in the tone region can be selected with a clearer difference through the difference in the distributed image. Equation (5) shows an algorithm for obtaining a distributed image using a block-based average operator.
In Equation (5), and represent the mean and variance of the input image respectively, and is the size of the block. We set in the proposed method.
From each variance image obtained using Equation (5), the difference image of the variance image is obtained through Equations (6) and (7). Similar to
in Equation (4),
is obtained using Equation (7), and the visible and infrared images in which the average value of the difference image of the dispersion image is greater than or equal to a certain value are classified.
Figure 6 shows the variance difference and luminance difference of the augmented visible and NIR images as histograms. The difference values of the images tended to be concentrated in specific values, and we determined the classification criteria by visually analyzing the images based on the specific values. In
Figure 6, the blue stars indicate the values referenced by the analysis, and the values corresponding to the red stars mean the criteria values because the difference between the visible image and NIR image is visually clear. The luminance difference was relatively easy to visually identify the difference in the image, and the value corresponding to the top 36.4% was set as the criterion value. In the case of variance difference, the value corresponding to the top 94.1% was set as the criterion value in order to maintain the number of images classified by luminance difference as much as possible.
Through this classification, we selected a total of 3431 visible and NIR images for training, respectively. Additionally, among the images not used for training, 30 image pairs were used as the validation set, and 26 image pairs were used as the testing set.
3.3. Weighted Training
Figure 7 shows the learning structure of the proposed model. The proposed model divides the channel according to the input so that the synthesis structure and the learning structure are matched. Here, the numbers at the bottom of each convolutional layer represents the size of the filter and the number of input and output feature maps. The proposed model consists of an encoder, fusion layer, and a decoder. Inputs go into channel 1, composed of C11, DC11, DC21, and DC31 and channel 2, composed of C12, DC12, DC22, and DC32 in the encoder network. For each input to generate a feature map, convolutional layers having the same structure in each channel are computed in parallel. At this time, DC11, DC21, and DC31 (or DC12, DC22, and DC32) have a cascade structure, so that useful information of each convolutional layer can be learned without much loss. The generated feature maps are fused together after being multiplied by appropriate weights in the fusion layer. Weights are determined as the model is trained.
The proposed method learns weights to be multiplied by feature maps for fast image synthesis and optimal synthesized image quality. The learned weights are multiplied by each feature map and synthesized by the addition method. Equation (8) and
Figure 8 show the process of multiplying and merging weights from the generated feature map.
and
denote multiplied weight, and
indicates a feature map generated from each channel.
denotes the synthesized feature map and
denotes the index of the feature map
A final image is generated through a total of four convolutional layers in the decoder network. The encoder, fusion layer, and decoder are trained to minimize the loss function by comparing the final image generated to learn the target image with the acquired target image. In Equation (9), the loss function is denoted by
. In Equation (10), the pixel loss function is denoted by
In Equation (11), the structural similarity (SSIM) loss function is denoted by
with the weight
[
14].
Here,
and
represent an output image and a target image, respectively. The pixel loss function is the Euclidean distance between the output image and the target image.
indicates the SSIM operator and compares the SSIM between two images [
23]. In training phase, since
has a value about three orders of magnitude larger than
, and the weight of
can be increased by multiplying
. In the proposed method, the
is set to 1000, which can reduce time consumption in the learning phase [
15].
3.4. Image Fusion
Figure 9 shows the fusion method of the proposed model. First, an image to be fused is divided into luminance channel l and color channels a and b through CIELAB color space conversion. The CIELAB color space is based on the color perception characteristics of human vision, has excellent color separation, and is widely used in image tone mapping models to preserve and compensate color components [
24]. The color channel
of the visible image is preserved and used as the color channel of the fused image. Separate
of the visible and NIR image used as inputs are fed into the input of each channel of the learned encoder network. The feature map output from the output of each channel are fused at the ratio trained from the fusion layer, and the fused feature map enters the input of the decoder network to obtain a fused luminance image
. Finally,
are merged into
, and a color image is obtained through RGB color space conversion.
4. Experimental Results
To compare the training results of the proposed methods, after training the dataset under various conditions through the proposed model, various visible and NIR images were fused, and the similarity between the target image and the resulting image was compared through SSIM values.
Table 1 shows the average of the SSIM values between the fused image and the target image after fusing 26 images from the model trained for each dataset. Lum_var_above refers to a dataset selected from images in which the average values of luminance and variance difference used in the proposed model are above each criterion value. Lum_above refers to a dataset selected by considering only the luminance difference value, and Lum_below refers to a dataset in which the luminance difference average value is less than the criterion value. Finally, Not_considered refers to a dataset that does not consider luminance and variance difference values. From the results, it can be seen that the model trained with Lum_var_above has the highest similarity to the target image.
Additionally, in order to check the effect on image fusion according to the ratio at which the feature map is fused in the fusion layer, images are obtained through the pro-posed model by varying the fusion ratio, and to check whether the performance of the proposed method is excellent, the image fusion methods of Lowrank [
12], DenseFuse [
15], and Vanmali [
13] were compared with the image quality metrics.
Table 2 shows the average of the values obtained through the quality metrics by acquiring a total of 26 visible and NIR-fused images for each fusion method. Weighted_addition1 is a model in which only the weights multiplied by the infrared feature map are learned in the fusion layer of the proposed method, and Weighted_addition2 is a model in which the weights multiplied by the infrared feature map and the visible feature map to be fused are trained, respectively. Additionally, there is a model fused by the addition strategy without training the weights to be multiplied by the feature map. Both LPC [
25] and S3 [
26] evaluate the sharpness of the image. FMIpixel [
27] indicates how much information of two input images to be fused is preserved. Qabf [
28] evaluates image quality. The cross entropy [
29] shows how similar the source image is to the fused image using information contents. The average gradient [
30] can reflect detail and textures in the fused image. The larger the average gradient means that the more gradient information is contained in the fused image. The edge intensity [
31] represents the quality and clearness of the fused image. The spatial frequency [
32] metric indicates how sensitive and rich in the edges and textures are according to the human visual system.
Here, Weighted_addition2 has the best score value for the four quality metrics (LPC, FMIpixel, average gradient, and edge intensity) and the second-best value for the three other metrics (S3, Qabf and spatial frequency). Thus, it can be confirmed that among the proposed methods, Weighted_addition2 shows slightly better results for image fusion. Overall, the proposed method shows an insignificant improvement of 1% to 2% compared to the existing method in LPC and FMIpixel metrics, but shows improvement in quantitative metrics of 5% to 22% in S3 and Qabf metrics. Additionally, the proposed method shows 39% improvement in the cross entropy compared to the existing method, and the proposed method shows improved performance of 5% to 18% in average gradient, edge intensity, and spatial frequency. It can be seen that the proposed method acquires a clear, high-quality image with less distortion compared to other fusion methods.
5. Discussion
The proposed model, the existing visible and infrared image fusion method, and several images were fused and evaluated for comparison.
Figure 10,
Figure 11,
Figure 12,
Figure 13,
Figure 14,
Figure 15 and
Figure 16 show the input visible and NIR images and the resulting images. It can be seen that
Figure 10e,d. the target image, contain the detailed information of the image better in the shaded area compared to
Figure 10b,c. Particularly, in
Figure 10b, the edge of the shaded area is hardly expressed. In
Figure 11c, it can be seen that the detail does not appear well in the distant mountain part. Particularly, in
Figure 11b, it can be seen that not only the detail of the mountain but also the overall quality of the image has deteriorated. In contrast, in the proposed method,
Figure 11e shows a clear, high-quality image. As for the composite result of
Figure 12b,c, the expression of the boundary between the mountain boundary and the trees and buildings is inferior. In contrast, in the proposed method,
Figure 12e learns the target image,
Figure 12d, so that the boundary expression is good and the visibility is excellent.
In
Figure 13d,e, the boundaries between the trees are clear and the detail of the leaves is superior to
Figure 13b,c. The proposed method,
Figure 14e and the target image,
Figure 14d, have information on distant mountains that are not visible in
Figure 14b,c. In addition, the proposed method provides a clearer image even in the details of the grass part. In the case of
Figure 15b, the buildings beyond the glass that is concealed in the visible image can be seen clearly, but the detail of the tree part that is detectable in the visible image is inferior.
Figure 15c has an overall blurry image. However, in
Figure 15e, not only the trees but also the buildings beyond the glass are clear. In
Figure 16b,c, the boundary between the tree part and the structure is blurred, so the structure cannot be clearly distinguished. In
Figure 16d,e, the boundary is clear and the image quality is excellent, so the structures can be distinguished well.
Table 3 shows the processing time comparison compared to the Vanmali fusion method, which is the target image fusion method of the proposed model, and the image fusion processing time of the proposed method. For each method, images were fused 10 times by each resolution, and the average processing time was calculated and compared with those of the other methods. As the resolution increases, it can be seen that the deep learning-based fusion method has a significantly faster processing speed than the rule-based fusion method. Both methods were implemented with NVIDIA RTX 2060 GPU and i5-6500 CPU as a common PC and NVIDIA RTX 3090 GPU and i9-10980XE CPU as a high-performance PC, respectively.
6. Conclusions
In this paper, we propose a method for reducing the processing speed while preserving the excellent synthesis quality of the rule-based image synthesis method using the deep learning-based visible light and near-infrared image synthesis method. The proposed method learns the excellent detail expression of the rule-based image synthesis method by presenting a data set acquisition method and a classification method for effective learning.
The proposed method has been compared with several existing synthesis methods through quantitative evaluation metrics, and the results of the metrics have been improved by 5% to 22% in the S3, Qabf, average gradient, edge intensity, and spatial frequency metrics. In particular, the proposed method shows 39% improvement in cross-entropy compared with the existing methods, and in the comparison of visibility through the result image, the proposed method not only shows the excellent resulting image but also shows the synthesized image quality equal to or higher than the target image. In addition, by using a deep learning model, the amount of computation is reduced, and the processing speed is three times faster than the target image synthesis method. This means that it can be considered as a method more suitable for video synthesis than existing synthesis methods.