1. Introduction
In outdoor scenes, images taken in ill-weather conditions (foggy, hazy, rainy, etc.) are usually degraded by the opaque medium (fog, haze, rain, etc.) when the light diffuses into the atmosphere. This degradation negatively affects the visibility and contrast of captured images, because of the light scattering and absorbing whenever the distance between the scene and the camera increases. These adverse effects prevent most of the self-activating systems (smart transportation systems, intelligent monitoring systems, self-driving vehicles, etc.) since they always require clear input images to understand and explore useful info to perform well. Consequently, designing an efficient haze removal technique is a valuable issue in computer vision and image processing and their applications (for instance, image classification and aerial imagery), which has been widely studied recently.
As a first entry to the research, considering the haze phenomenon as a contrast or noise reduction problem, the traditional image processing methods are used to lessen haze particles from single hazy images. In [
1,
2,
3], the authors attempted to remove haze particles from single hazy images by using the histogram equalization technique to improve the contrast of those images. Chen et al. [
4] proposed the homomorphic filtering for image dehazing task. This method serves to enhance high-frequencies and reduce low-frequencies of the hazy image to make it visually understandable. After, the Retinex theory was adopted to remove haze [
5], which is an illumination compensation method that ameliorates the visual aspect of the hazy image in the case of bad lighting conditions. All these aforementioned dehazing methods rely on image quality improvements, such as contrast enhancement and edges visibility and ignore the real mechanism of degradation. Despite the simplicity and enhancement attained by these techniques, however, some haze particles still appear on the recovered hazy images. These methods are called image enhancement-based dehazing.
Simultaneously, polarization-based techniques [
6,
7,
8] were introduced to recover the scene radiance of hazy images using multiple image degrees. In addition, under different atmospheric conditions of hazy images of the same scenery, the authors of [
9,
10] proposed a new way of dehazing method. These dehazing methods perform well and recover free-haze images. However, their common problem is that additional information is required, whether an image or another kind of information, which is hard to provide. This limit makes the dehazing operation more difficult and sometimes impossible.
Recently, the advent of the widely used physical hazy image degradation model [
11] has incited numerous researchers to use it for image dehazing task. Thus, Markov Random Field (MRF)-based physical model has been exploited to dehaze single images. This idea is derived from the assumption that the local contrast of haze-free images is much higher than that of hazy images. In [
12], a local contrast maximization method was proposed to restore images, but this method generates over-saturated recovered images. Fatal et al. [
13], in turn, took advantage of the Independent Component Analysis (ICA) method to separate the haze layer from scene objects and then recover hazy images. However, this method fails to recover hazy images with a dense haze and consumes a superior time complexity.
To address this significant problem, He et al. [
14] proposed a dehazing based prior method (dark channel prior (DCP)) to estimate the thickness of haze, which stands on the physical degradation model. This method is a robust, recently discovered approach in image dehazing research area under the assumption of dark pixels. Despite the success achieved by the DCP approach, it cannot remove haze from the sky region in hazy images and it has poor edge consistency.
Many attempts have been proposed to overcome these deficiencies. In [
15], Lin et al. also used the basic structure of DCP approach to build an effective method for real-time image and video dehazing. They used the guided filter as a transmission map refinement step, and the maximum of DCP as atmospheric light value. Xu et al. [
16] benefited from the dark channel prior and combined it with the fast bilateral filter to design their dehazing method. In addition, Song et al. [
17], Yuan et al. [
18], and Hsieh et al. [
19] adopted the original DCP in their proposed dehazing methods. However, despite the effectiveness and simplicity of these follow-up methods, the results still have the problem of over-enhancement due to the atmospheric light inaccuracy estimation, and halo effects on the edges. Qingsong et al. [
20] proposed a simple dehazing method based on a linear model to recover the scene depth. However, this method is not efficient enough, particularly for images with thick haze.
At present, Convolutional Neural Networks (CNNs) have attained great success in addressing several low-level computer vision and image processing tasks, e.g., image segmentation [
21,
22,
23], object detection [
24], and image denoising [
25]. Many researchers [
26,
27,
28,
29] have applied Convolutional Neural Networks (CNNs) to explore haze-relevant features deeply. Ren et al. and Rashid et al. [
26,
27] proposed multi-scale CNN frameworks to calculate the transmission medium from the input hazy image directly. Cai et al. [
28] presented an End-to-End framework based on CNNs estimating the transmission of image pixels within small patches. To make the learning process easier, Song et al. [
29] added a new ranking layer into the classical CNN framework, which can capture the structural and statistical attributes jointly.
Despite the impressive achievement of learning-based dehazing techniques, they still have some problems that appear in results, in terms of saturation, and naturalness of recovered hazy image because of non-massive data in the learning operation. In addition, redundant computations increase the computational complexity, as mentioned in Song et al.’s [
29] conclusion. Additionally, most of the existing CNN-based dehazing approaches estimate only the transmission medium of hazy images and neglect the atmospheric light estimation step. This shows the inadequate representation of the widely used hazy image degradation model (Equation (
1)) because it still requires the estimation of another important parameter, namely the atmospheric light
A, which is absent in their proposed dehazing models.
To tackle the inherent limitations mentioned above, in this paper, we propose a new efficient and robust dehazing system that comprises two main stages: the efficient atmospheric light (AL) estimation algorithm “A-Est” and the cascaded CMTnet-based transmission map estimator that has two subnetworks. The first estimates the rough transmission medium of the hazy image and the other refines the estimated transmission map. These two subnetworks are generated jointly within the proposed CMTnet network.
The main contributions presented in our work can be summarized as follows. First, we propose a new accurate atmospheric light estimation algorithm “A-Est” that avoids the problem of over-saturation posed by most of the proposed dehazing approaches. Then, we build a two-task cascaded MCNN-based transmission map estimator, which directly generates the rough transmission medium of input hazy image through the CMCNN
, and then refines it by the second subnetwork (CMCNN
), inspired by Ren [
26] and Li [
30].
3. Proposed Dehazing Method
In this section, we present our proposed image dehazing system overcoming the inherent limitations of existing dehazing methods, aiming to improve the method’s efficiency, the accuracy of the estimation of both transmission map and atmospheric light, and the quality of recovered results.
Figure 2 illustrates the general process of the proposed method.
The proposed approach has four key components: (1) atmospheric light estimation; (2) transmission map calculation; (3) transmission map refinement; and (4) image reconstruction via the degradation model in Equation (
1). For the atmospheric light estimation, we propose a new algorithm based on both image blurriness. Both transmission map and its refined map are generated by a cascaded multi-scale CNN model as two subnetworks (CMCNN
and CMCNN
). The details of each step are explained in detail in the following subsections.
3.1. Atmospheric Light Estimation
In previous dehazing works, most researchers [
20,
26,
28] set the atmospheric light value empirically (a constant value) or estimate it under the assumption of bright pixels within a local patch. However, the brightest pixel can represent a white object, a gray object in the scene, or an extra-light source, which leads to serious estimation errors. Despite the improvements done in several proposed works [
35,
36], the estimation of the atmospheric light still has significant errors in specific cases (hazy nighttime images).
To further increase the accuracy of atmospheric light value estimation, we propose a new effective algorithm labeled ‘A-Est’, to avoid the limitations of most previously proposed dehazing methods. This algorithm employs image blur to estimate the atmospheric light value A accurately. This idea is inspired by the fact that haze is one of the main reasons for producing blurred images.
Generally, the atmospheric light value of hazy images is selected as the distant scene points with high intensity. In contrast, far scene points belong to the most blurred image region because the blur amount that occurs on the hazy image increases with scene depth. More clearly, whenever the scene depth increases, the blur amount increases as well. Hence, it is necessary to consider the blur amount in the atmospheric light estimation.
Algorithm 1 explains the running of proposed approach in detail. We utilize a recursive quadtree decomposition (
Figure 3 gives an overview) according to the blur amount measured on each quadtree to reduce the solution space of estimated
AL value. First, we find the most blurry region in the hazy image by averaging the blur map calculated, and then pick
of pixels from this region. Note that Avg operator indicates the average operator.
Algorithm 1 A-Est |
- 1:
Input: Hazy image , Measured Blur map B. - 2:
Output: . - 3:
Function () - 4:
←Avg [ (,B)]; - 5:
←; - 6:
Estimated A←; - 7:
Return A; - 8:
End Function - 9:
- 10:
Function () - 11:
←rgb2gray; - 12:
← - 13:
whiledo - 14:
Devide into four equal-sized patches (); - 15:
Calculate blurriness map by using Equation ( 3); - 16:
Select with largest Blur amount; - 17:
; - 18:
end while - 19:
Return; - 20:
End Function
|
For measuring blur amount on an image, various techniques have been proposed in the literature. Usman et al. [
37] discussed and evaluated more than thirty blur measure operators. They showed that “Gradient Energy” is one of the best operators in terms of measure and precision; thus, we choose it to estimate the initial blur map. The Gradient Energy-based blur map can be expressed as:
where
=
and
=
. Note that the blur amount can be estimated for the whole hazy image or a sliding window or single pixels.
For hazy images, the atmospheric light
AL is defined as a mixture of haze with incident light, and its value
A ranges between 0 and 1. According to Equation (
1), by setting (A = 0) and (A = 1), the restored scene radiance can be deduced as:
Equation (
4) and
Figure 4 indicates that recovering a hazy image with poor light
A would produce a bright scene radiance (see Figure 6c), while obtaining the opposite result in the case of a bright light
A (as shown in Figure 6b). Thus, the accuracy of atmospheric light value estimation is an essential sub-task in image dehazing operation. To assess the performance of “A-Est” algorithm, we present a visual comparison with existing dehazing methods.
Table 1 and
Figure 5 summarize the
AL estimation principle used in each of the concerned methods. The visual comparison employs the physical degradation model in Equation (
1), where the transmission map is generated by using the conventional DCP approach [
14].
As shown in
Figure 6f, the proposed algorithm estimates the
AL value accurately from a hazy image, where it achieves an estimation result similar to the ground truth. Some color distortions appear on dehazing results when using Zhu’s [
20] and Salazar-Colores’s [
35] methods (as shown in
Figure 6a,d, the white color of the window becomes blue), which is caused by the inaccuracy of
AL estimation. In addition, Cai’s method [
28] provides a dim scene radiance, because the estimated
A is brighter than that of ground truth (A = 1). Conversely, Sulami’s method [
38] produces a bright scene radiance since the estimated
A is darker than that of the ground truth. On the other hand, we observe that Haouassi’s method provides an estimation of
AL somewhat similar to the ground truth.
3.2. Transmission Medium Estimation Based on Cascaded Multi-scale CNN
For single image dehazing, estimating the transmission medium is an important task, because it is a crucial part according to the hazy image degradation model in Equation (
1). Recently, a multitude of strategies has been developed to generate the transmission map from a hazy image directly. In [
14], He et al. proposed handcrafted haze-relevant features called
Dark Channel under an assumption on haze-free outdoor images. This assumption states that, for haze-free outdoor images, most of the patches have at least one color channel intensity value close to zero for some pixels, except for the sky area patches. Based on this property, the initial dark channel can be expressed as follows:
where
c represents a color channel,
of RGB image.
is a local patch centered at pixel
x. This prior has a high relation with the thickness of haze in the image so that the transmission medium can be calculated directly as:
.
Subsequently, to estimate the thickness of haze in a hazy image, Zhu et al. [
20] built a new color attenuation model that employs saturation and brightness, because, whenever the haze concentration is increased, the saturation is decreased. Based on this model, the haze concentration can be estimated as:
Note that
and
are two image components in
HSV color space (Saturation and Value, respectively). As C(x) is relative to scene depth d, the transmission medium can be calculated directly using Equation (
2).
Despite the great success achieved by these haze-relevant feature-based dehazing methods for removing haze from single images, they can be invalid in particular cases. He’s method becomes useless for the sky region in the hazy image, and Zhu’s model fails to remove haze effectively in the case of thick haze. Differently, inspired by the inherent high performance of CNNs, we propose a two-stage cascaded multi-scale CNN architecture to generate accurate transmission maps for hazy images.
Figure 7 illustrates the general mechanism of the proposed cascaded multi-scale CNN for generating transmission maps for hazy images. The proposed cascaded architecture consists of two subnetworks (CMCNN
and CMCNN
), one for producing rough transmission maps
and the other for refined transmission maps
t.
3.2.1. Rough Transmission Map Subnetwork CMCNN
CMCNN is composed of four sequential cascaded MCNN D-units. Each D-unit is defined as follow:
D-Unit Structure: The overall design of a D-unit is shown in
Figure 8 (Left) and comprises three multi-scale convolutional layers pursued by a Bilateral Rectified Linear Unit (BReLU) [
28], except for the last convolutional layer. Each convolutional layer represents a fusion of convolutions with a different kernel size (
,
, and
), thus it is called multi-scale convolution. Regarding the number of filters
, for the first and second layers
and
, we use 16 filters, and for the last layer we use eight filters.
The idea of these D-units is inspired by the DenseNet [
39] architecture, where layers of each D-unit are densely connected. The output feature maps of each layer
are concatenated with those of all succeeding layers; feature maps of the first layer
are concatenated with that of the second and third layers
and
. Moreover, the same for feature maps of
are concatenated with
feature maps.
The proposed D-units-based architecture has significant benefits, e.g. it can avoid the gradient vanishing problem that deep CNNs have with fewer required parameters. In addition, it can maximize the information flow with non-redundancy of feature maps.
Multi-scale CNN: It is a general truth that human visual perception is a multi-scale neural system. In contrast, using CNN models to solve low-level computer vision tasks always requires the best choice of kernel size. The small size of the filter window can highlight only low-frequency features and neglect the high-frequency content or other important information. Likewise, big kernels can extract only high-frequency details and disregard low-frequency image information. Besides, employing CNNs with successive single filter sizes produces a deeper CNN architecture, which makes the computations complicated and thus hinders the speed of the training process. Therefore, the success of multi-scale CNN representation [
26,
28] motivates us to apply multi-scale CNN layers in our work to obtain both low- and high-frequency details. In this work, each convolutional layer can be defined as a concatenation of three convolutions with different filter sizes (
,
, and
), as shown in
Figure 8 (right), and can be expressed as:
where
represents the i-th layer generated feature maps,
n is the layer depth (
n = 3), and
are the output feature maps attained by the three convolutions (
,
, and
) in each multi-scale CNN layer.
BReLU: Recently, most of the deep learning methods [
26,
40] have employed the ReLU as a nonlinear transfer function to solve the problem of vanishing gradient instead of the previous weak sigmoid function [
41] that makes the learning convergence very slow. However, the ReLU function [
42] is especially designed for classification problems. In this work, we adopt the BReLU function: Cai [
28] proposed a sparse representation, benefiting from local linearity of ReLU and preserving the bilateral control of Sigmoid function and considering the restoration problem.
Figure 9 shows the difference between the conventional ReLU and Bilateral ReLU; note that
=0 and
.
Concatenation: After generating the rough transmission map through CMCNN subnetwork, it is transferred to CMCNN subnetwork as support information, where it is merged with the feature maps extracted by the first unit to explore new features. Exploring new features by using concatenation augments the performance of the CMCNN to predict efficient refined transmission maps.
3.2.2. Refined Transmission Map via CMCNN Subnetwork
The subnetwork CMCNN
is offered to settle the problem of the blocking artifacts that appear on the edges of the estimated rough transmission map. The structure of CMCNN
is similar to that of CMCNN
, except for the number of used units. CMCNN
(shown in
Figure 7) consists of three D-units that extract features from input hazy image and fed from the rough transmission map, by combining it with image features produced by the first unit. Note that the internal structure of convolutional layers is also similar to that of CMCNN
subnetwork with a different filter size
N = 8.
3.2.3. Training of CMT
Training Data: A well-performed network cannot be attained only by making a network with good structure and perfect implementation, but it is also based on the training process. Generally, deep learning networks are data-hungry models. However, it is not easy to provide massive data to train the network, especially for providing pairs of real-world hazy images and their truth transmission maps. Based on the assumption that indicates that the transmission map of an image is constant in small patches, and employing the hazy image formation model (Equation (
1), by regarding
A = 1), we synthesized a massive number of hazy/clear pairs of patches (as shown in
Figure 10). First, we collected from the Internet more than 1000 natural clear images for different scenes (mountain, herb, water, clouds, and traffic). Then, we randomly picked 10 patches of
from each image, thus totally we had 10,000 training patches.
To reduce the overfitting and make the training process more robust, We added a GaussianNoise layer as the input layer. The amount of noise must be small. Given that the input values are within the range [0, 1], we added Gaussian noise with a mean of 0.0 and a standard deviation of 0.01, chosen arbitrarily.
Training Strategy and Network Loss: The prediction of the transmission map requires the mapping link between the color image and its corresponding transmission medium through automatic supervised learning. The optimization of the model is realized by minimizing the loss function between I(x) and its truth transmission, with predicting optimal parameters (weights and biases). First, the network feeds on the training image patches I(x) and updates the parameters iteratively, until the loss function reaches the minimum value. The loss function in learning models is a crucial part because it measures the performance of the model to predict the desired result.
The extensively used loss function in image dehazing is MSE (Mean Squared Error) that can be expressed as follows:
where
represents the predicted transmission medium,
is the truth transmission medium, and
M is the number of hazy image patches in the training data.
As noticed, although the
loss
can maintain both edges and details well during the reconstruction process, it cannot preserve sufficiently the background texture. Therefore, we also consider the SSIM loss [
43] in this work to assess the texture and structure similarity. The structural similarity (SSIM) value of each pixel
x can be calculated as follows:
Note that
and
are constants of regularization set as default 0.02 and 0.03.
,
,
and
represent the average and standard deviation of the predicted image patch x and ground truth image patch y, respectively. Eventually, the loss
is calculated as:
For the proposed network CMT
, the final loss is defined as the aggregation of
and
as follows: