A Lightweight CNN Based on Axial Depthwise Convolution and Hybrid Attention for Remote Sensing Image Dehazing

He, Yufeng; Li, Cuili; Li, Xu; Bai, Tiecheng

doi:10.3390/rs16152822

Open AccessArticle

A Lightweight CNN Based on Axial Depthwise Convolution and Hybrid Attention for Remote Sensing Image Dehazing

¹

School of Information Engineering, Tarim University, Alar 843300, China

²

Key Laboratory of Tarim Oasis Agriculture, Ministry of Education, Tarim University, Alar 843300, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(15), 2822; https://doi.org/10.3390/rs16152822

Submission received: 23 June 2024 / Revised: 26 July 2024 / Accepted: 30 July 2024 / Published: 31 July 2024

(This article belongs to the Special Issue Deep Learning for Remote Sensing Image Enhancement)

Download

Browse Figures

Versions Notes

Abstract

Hazy weather reduces contrast, narrows the dynamic range, and blurs the details of the remote sensing image. Additionally, color fidelity deteriorates, causing color shifts and image distortion, thereby impairing the utility of remote sensing data. In this paper, we propose a lightweight remote sensing-image-dehazing network, named LRSDN. The network comprises two tailored, lightweight modules arranged in cascade. The first module, the axial depthwise convolution and residual learning block (ADRB), is for feature extraction, efficiently expanding the convolutional receptive field with little computational overhead. The second is a feature-calibration module based on the hybrid attention block (HAB), which integrates a simplified, yet effective channel attention module and a pixel attention module embedded with an observational prior. This joint attention mechanism effectively enhances the representation of haze features. Furthermore, we introduce a novel method for remote sensing hazy image synthesis using Perlin noise, facilitating the creation of a large-scale, fine-grained remote sensing haze image dataset (RSHD). Finally, we conduct both quantitative and qualitative comparison experiments on multiple publicly available datasets. The results demonstrate that the LRSDN algorithm achieves superior dehazing performance with fewer than 0.1M parameters. We also validate the positive effects of the LRSDN in road extraction and land cover classification applications.

Keywords:

remote sensing; haze removal; attention mechanism; haze image synthesis; CNN

Graphical Abstract

1. Introduction

Remote sensing has revolutionized our capacity to perceive and analyze the environment, providing invaluable insights into various phenomena from a distance. However, a significant challenge that often compromises the accuracy and clarity of remote sensing imagery is the presence of haze [1]. The detrimental effects of haze on remote sensing images primarily manifest in several ways. Firstly, haze diminishes image clarity, resulting in blurred object edges, which are difficult to discern. Secondly, haze reduces image contrast, leading to decreased differentiation between objects and making them harder to distinguish. Lastly, haze may induce color distortion in images, thereby compromising the accurate identification of land cover types. The reliability and robustness of numerous remote sensing applications depend on the availability of clear and accurate remote sensing data. Consequently, the presence of haze can severely undermine the efficacy of these applications. Remote sensing image dehazing is an image-restoration task [2], similar to de-raining [3], cloud removal [4], and image super-resolution [5], which aims to recover clear images from degraded images. However, different degradation sources make these tasks inherently specific, e.g., the different physical properties (size and type of particle, concentration, and optical properties, etc.) of rain, cloud, and haze make their solutions often not generalizable. Effective remote sensing image dehazing algorithms can significantly enhance the usability of remote sensing data and play a crucial role in fields such as geographic information extraction, environmental monitoring, urban planning, and disaster assessment.

Haze is a common atmospheric phenomenon, and understanding atmospheric scattering is crucial for accurately modeling and correcting atmospheric effects in remote sensing imagery. Scene objects emit radiance, which interacts with atmospheric particles as it traverses through the atmosphere before reaching imaging sensors. These complex interactions contribute to the formation of hazy images. The primary aspects include the absorption and reflection of radiation by haze particles, leading to attenuation of scene radiance, and atmospheric light scattered in the air, diffusing onto the path of scene radiance and causing a phenomenon known as airlight. The physical process described above can be mathematically expressed using an established atmospheric scattering model [6,7], which is essential for addressing image degradation caused by haze. This model necessitates that algorithms estimate unknown parameters such as atmospheric light and transmission and generate a haze-free image from a given hazy image.

As the atmospheric scattering model presents a challenge due to its inherently underdetermined nature, the estimation of atmospheric light and transmission through supplementary information has become indispensable in traditional image dehazing methods. This supplementary information usually derives from priors based on statistics, observations, and assumptions [8,9,10,11,12]. However, these artificially designed priors tend to exhibit insufficient robustness, leading to algorithm failures in some specific scenes. Data-driven algorithms [13,14,15,16,17,18] represent an alternative approach to tackling the problem of image dehazing. They attempt to leverage the strong representation capability of convolutional neural networks (CNNs) to directly model the latent mapping from hazy images to haze-free ones, but this strategy requires a significant amount of learning samples, and the number of model parameters and computational complexity are often substantial, which makes it challenging to deploy for real-world applications.

In this paper, we present a lightweight neural network model that achieves state-of-the-art dehazing performance with fewer than 0.1M parameters. This efficiency is primarily attributed to two meticulously designed modules within the network. Firstly, we propose a feature extraction module composed of axial depthwise convolutions and residual connections [19]. The residual connections enhance information propagation across different layers of the network, thereby improving the model’s representation capacity and facilitating the backpropagation of gradients, effectively mitigating gradient explosion and vanishing issues during training. Additionally, axial depthwise convolution, compared to conventional convolution operators, not only enlarges the receptive field, but also reduces computational complexity. By computing separately along the height and width dimensions of the input, this approach allows the model to adapt to diverse spatial shapes of image objects, which is particularly beneficial for remote sensing images, where features may vary significantly in size and aspect ratio. Secondly, we design a hybrid attention mechanism that combines channelwise attention and pixelwise attention. This module recalibrates the input feature maps, enhancing relevant information while suppressing redundancy. Furthermore, to address the scarcity of high-quality remote sensing haze image datasets, we propose a hazy image simulation algorithm with controllable haze density and a non-uniform distribution using Perlin noise [20]. This enables the construction of a large-scale and realistic remote sensing hazy image dataset, providing a valuable resource for model training and evaluation:

We propose a lightweight convolutional neural network for remote sensing image dehazing, comprising specially designed feature extraction and recalibration modules. Experimental results on 11 test sets demonstrate its promising dehazing performance with limited parameters and computational overhead.
We introduce a lightweight feature extraction module that utilizes residual connections and a specific type of convolution, called axial depthwise convolution. Compared to traditional convolution operators, this module reduces the number of parameters while expanding the receptive field and effectively capturing the features of diverse ground objects with varying aspect ratios in remote sensing images.
We propose a hybrid attention structure that employs channelwise operations to adaptively enhance and suppress features from different channels through learnable weights. Additionally, we observe that the features captured by dehazing models progressively become more representative of haze-free images. Based on this observation, we introduce haze-aware pixel attention to enhance the network’s ability to restore dense haze regions in spatial dimensions, enabling the proposed LRSDN algorithm to effectively address the uneven distribution of haze in degraded images.
We present a novel haze synthesis algorithm and leverage it to build and release a large-scale remote sensing haze image dataset, which comprises 42,240 training pairs with a resolution of $1024 \times 1024$ and six test sets with distinct haze characteristics, each containing 510 samples. Additionally, the dataset is annotated with road-extraction masks, enabling quantitative evaluation for downstream high-level tasks.

2. Materials

In this section, we first introduce the widely used atmospheric scattering model in remote sensing image dehazing. Next, we give a brief overview of two types of existing haze-removal algorithms: prior-based and learning-based dehazing methods.

2.1. Atmospheric Scattering Model

The atmospheric scattering model [6,7] is a mathematical representation of the interaction between light and atmospheric particles as they traverse through the atmosphere. In the field of image dehazing, this model is utilized to describe the degradation process of images taken in hazy weather, providing a theoretical foundation for dehazing algorithm design. The model assumes a homogeneous distribution of media in the atmosphere with isotropic optical properties. It primarily considers two major factors that contribute to image quality degradation during the imaging process: one is the direct attenuation phenomenon due to the absorption and scattering of scene radiance by haze particles, and the other is the superposition of atmospheric ambient light scattered onto the scene irradiance along its propagation direction, known as the airlight or veiling light. Mathematically, the atmospheric scattering model can be expressed as:

I (x) = J (x) t (x) + A (1 - t (x))

(1)

where

I (x)

represents the observed image, i.e., the degraded image with haze, while

J (x)

denotes the original irradiance of the scene, i.e., the haze-free image to be restored. The variable x indicates the spatial location of the pixel. The atmospheric light A is generally treated as a global constant, representing the uniform and constant intensity of ambient light due to atmospheric scattering across the entire image. In traditional physics-based image dehazing algorithms [8,9,10,11,12], this parameter is estimated by statistical analysis of the brightest pixels or regions in the image.

t (x)

represents the ratio of the unscattered and unabsorbed part to the original scene irradiance as it traverses the atmospheric medium, termed the transmission. Under the assumption of the uniform atmosphere, the transmission can be expressed as

t (x) = e^{- β d (x)}

(2)

where

β

is the atmospheric scattering coefficient, which is related to the properties of atmospheric particles and the wavelength of light. In practice, estimating the transmission often requires complex methods, typically involving the analysis of depth information

d (x)

in various regions of the image. Thus,

J (x) t (x)

describes the portion of light that reaches the observer from the scene point

(x)

.

Using the atmospheric scattering model, we can decompose the observed hazy image

I (x)

into two components: the direct attenuation of the original scene radiance

J (x)

and the superimposed effect of atmospheric light scattering. The goal of dehazing algorithms is typically to estimate the transmission map

t (x)

and atmospheric light A using certain assumptions, statistics, and priors, and then utilize them to recover the haze-free image

J (x)

. Additionally, the atmospheric scattering model can also serve as an important theoretical basis for synthesizing hazy images. By reasonably designing the atmospheric light and transmission map and incorporating them into the model, realistic haze effects can be achieved.

2.2. Prior-Based Dehazing Methods

For an extended period, researchers have endeavored to leverage diverse assumptions, observations, and statistical insights as supplementary constraints to address the challenges of the ill-posed atmospheric scattering model, leading to significant advancements in the field. He et al. [8] observed that most local regions in outdoor fog-free images contain pixels with very low intensities in at least one color channel, based on which they proposed a dehazing algorithm using this dark channel prior. Bui and Kim [9] fit the color ellipsoid to haze pixel clusters in the RGB space and utilized the geometric properties of the color ellipsoid to calculate the transmission of the hazy image, thereby restoring the dehazed result without oversaturation. Zhu et al. [10] constructed a linear model to predict the scene depth of haze images, thus estimating transmission maps and haze-free images. Berman et al. [11] proposed a defogging algorithm based on global color clustering. They assumed that the colors of a haze-free image could be well approximated by hundreds of distinct colors forming tight clusters in the RGB space and connecting to lines passing through the atmospheric light, known as haze-lines, which are used to estimate the transmission map. Building upon the dark and bright channel prior, Han et al. [12] proposed a dehazing method based on the local minimal and maximal value prior, which can efficiently restore haze-free remote sensing images. Xu et al. [21] integrated the concept of “virtual depth” into an iterative computing framework, resulting in an effective algorithm for remote sensing image dehazing. Considering the non-uniformity of haze in remote sensing images, Li et al. [22] presented a haze-removal method based on homomorphic filtering and an improved dark channel prior using a sphere model. He et al. [23] assumed that each superpixel in remote sensing images has the same atmospheric light and transmission rate and postulated that the atmospheric light is globally inhomogeneous. They proposed a defogging algorithm for remote sensing images that effectively addresses non-uniform haze. Xie et al. [24] achieved adaptive removal of non-uniform haze for single remote sensing images based on the dark channel saturation prior. Similarly, Ning et al. [25] constructed an effective remote sensing image defogging algorithm through a novel bright–dark prior. These methods depend heavily on the robustness of various assumptions and priors. However, due to the complexity of haze properties, they may not always perform effectively, particularly when confronted with the intricate atmospheric environment of remote sensing imaging.

2.3. Learning-Based Dehazing Methods

With the rapid development of deep learning technology, data-driven dehazing algorithms have emerged continuously. Li et al. [13] introduced AOD-Net, an end-to-end CNN model that can directly reconstruct clear images from hazy ones without estimating transmission maps and atmospheric light. Li and Chen [15] designed a two-stage attention network, FCTFNet, which employs a coarse-to-fine strategy to remove unevenly distributed haze in remote sensing images. Chen et al. [16] introduced GCANet, which employs a smoothed dilation technique to eliminate the gridding artifacts produced by traditional dilated convolution. Liu et al. [17] incorporated a learnable preprocessing module and a multi-scale attention mechanism to enhance the dehazing effect in the GridDehazeNet algorithm. In [18], a temporal information injection network (TIIN) was proposed to improve dehazing performance by fusing temporal information. Jiang et al. [26] introduced a non-uniform remote sensing haze image-restoration method by combining three modules: the asymmetric size feature cascade module, the k-means pixel attention block, and the fast Fourier transform channel attention module. Additionally, Qin et al. [27] and Ma et al. [28] focused on extracting spectral information from remote sensing images and proposed dehazing algorithms tailored to multispectral and hyperspectral images, respectively. Furthermore, vision transformer-based dehazing methods [29,30,31] have been proposed successively. Although these algorithms have achieved significant progress in dehazing tasks, they either lack more realistic remote sensing haze images of diverse imaging scenes to enhance the generalization capability of the algorithms or face challenges in real-time performance and efficient deployment due to the large number of model parameters and excessive computational overhead.

Considering the reduction of the parameter scale, some lightweight dehazing networks have been proposed. Yang et al. [32] proposed a lightweight region-detection network, which models the mapping between the haze image and its transmittance in each image patch, and a novel cross-channel pool module, which can fuse haze-relevant features at multiple scales. Ullah et al. [14] proposed Light-DehazeNet, a lightweight model that jointly estimates the transmittance and ambient light, reconstructing hazy images using an atmospheric scattering model. Li et al. [33] proposed TGL-Net, a lightweight dehazing model based on the guidance by transmission, which employs the transmission extracted by the guided filter and dark channel prior from hazy images as auxiliary information embedded in the training stage of the model, resulting in a significant improvement in the dehazing effect. Li et al. [34] presented a lightweight progressive feedback optimization network that integrates a multi-stream dehazing module and a progressive feedback module for effective restoration of hazy images. Wen et al. [35] introduced RSHNet, an encoder-minimal and decoder-minimal network, which utilizes a novel intra-level transposed fusion module to capture comprehensive context-aware information for feature fusion, and the introduction of the multi-view progressive extraction block further improves the dehazing performance of the method.

2.4. Fundamental Technologies

In this subsection, we provide a brief description of the classical lightweight design, depthwise separable convolution [36] (DSConv), and the hybrid attention mechanism. DSConv mainly consists of two operations, depthwise convolution (DWConv) and pointwise convolution (PWConv), which significantly reduces the number of parameters and computation compared to standard convolution. As shown in Figure 1a, DSConv separates the input features along the channel dimension, then independently performs the single convolution operation for each feature channel, and finally, concatenates the outputs and performs feature fusion and scaling along the channel by 1×1 PWConv.

The hybrid attention mechanism integrates different views of attention for feature enhancement and is widely used in various remote sensing image tasks [37,38,39]. The convolutional block attention module [40] (CBAM) is a typical hybrid attention module, which refines the feature by serial channel attention and spatial attention. As shown in Figure 1b, in the channel attention block of CBAM, it employs max pooling and average pooling layers to extract the global information of each feature channel and then computes the weight of each feature channel through the shared multilayer perceptron (MLP), which is ultimately multiplied with the original features to obtain the recalibrated features. Thereafter, a similar operation is used in the spatial attention block, but the operation performs along the spatial dimension instead of the channel dimension of the feature.

3. Methods

In this section, we will explore the proposed Lightweight Remote Sensing Image Dehazing Network (LRSDN) in detail. Initially, we introduce the overall architecture and design philosophy of this neural network model. Subsequently, we discuss its two crucial modules: the feature extraction module, which comprises the axial depthwise convolution and residual learning block (ADRB), and the hybrid attention block (HAB), which integrates the channel and pixel attention structures in series. Finally, we present the loss function employed during the model’s training process.

3.1. The Lightweight Architecture of the Model

Given the limited hardware resources available in various remote sensing devices, the core idea of this paper is to design a lightweight neural network architecture. Instead of adopting the commonly used U-shaped structure in dehazing networks, we build the network using tailor-made and lightweight modules in series, enabling more efficient utilization of limited computational resources. This approach ensures that the network can be seamlessly deployed on terminal devices while maintaining excellent dehazing performance. In the model design, we deliberately avoided the use of large convolutional kernels and, instead, widely employed

1 \times 1

convolutions, which contribute to reducing the number of parameters and computational complexity. Considering the limited depth and concise structure of the model, we meticulously crafted a basic block composed of two core modules, ensuring that the model can learn rich and effective representations of hazy images.

The overall architecture of the LRSDN model is depicted in Figure 2. The network takes a hazy remote sensing image as the input and first passes it through a conventional convolutional neural network structure, including a convolution layer, a batch normalization layer, and an activation operation. This shallow structure rapidly expands the number of channels in the input data, laying the foundation for subsequent feature extraction. Subsequently, the output feature maps from the shallow structure passes successively through several cascaded basic modules, marked as BasicBlock in Figure 2. These base modules gradually extract the high-level semantic features of the haze image, enabling the model to deeply understand the latent pattern of the haze. Simultaneously, we introduce a hybrid attention mechanism to recalibrate the feature map to further improve the quality and utilization efficiency of the features. Within the BasicBlock, the ADRB module focuses on extracting features and enhancing the representation capabilities of the model. The HAB module, on the other hand, is responsible for feature recalibration, which enables the model to better cope with complex haze distributions by refining the features. The specific implementations and principles of these two modules will be elaborated in Section 3.2 and Section 3.3, respectively. Finally, the model directly outputs a dehazed result through a 3×3 convolutional layer and a Tanh activation.

The computational process of the entire network can be mathematically expressed as follows:

\begin{matrix} y_{1} (x) = C o n v B l o c k_{S} (I (x)) \end{matrix}

(3)

\begin{matrix} y_{2} (x) = B a s i c B l o c k_{N} (y_{1} (x)) \end{matrix}

(4)

\begin{matrix} \hat{J (x)} = T a n h (C o n v_{3 \times 3} (y_{2} (x))) \end{matrix}

(5)

where

I (x)

represents the input hazy image,

y_{1} (x)

and

y_{2} (x)

denote the output features of the shallow convolution block and the intermediate BasicBlock, respectively, and

\hat{J (x)}

represents the final estimated haze-free image of the network.

C o n v B l o c k_{S}

refers to the shallow conventional convolution module, consisting of a convolution layer, a batch normalization layer, and an activation operation, serving as one of the most fundamental components of a CNN model.

B a s i c B l o c k_{N}

denotes the core module proposed in this paper, where N is a hyperparameter related to the model scale, indicating the number of cascaded BasicBlocks. Here, the number of BasicBlocks was experimentally set to 5, and a detailed discussion of the parameter count and computational complexity of the models with different N will be presented in Section 6.4.

It is noteworthy that, except for the final output layer, the activation function adopted in the network is the h-swish function, which is a computationally efficient activation function proposed in MobileNetv3 [41]. The h-swish activation function has demonstrated superior performance compared to ReLU and swish in many computer vision tasks, and it enables faster convergence during network training. Its mathematical expression is as follows:

h - s w i s h (x) = x \cdot \frac{R e L U 6 (x + 3)}{6}

(6)

where

R e L U 6 (x) = min (R e L U (x), 6)

and ReLU is the linear rectification function.

3.2. Axial Depthwise Convolution and Residual Block

The primary structure of the LRSDN model proposed in this paper consists of serially stacked BasicBlocks. The BasicBlock comprises two core components, where the axial depth convolution and residual block (ADRB) is responsible for feature extraction, achieving the mapping from the remote sensing hazy image to a high-dimensional feature space. The design principle of the ADRB is to obtain an efficient representation capability with a limited number of parameters, inspired by a previous study [42]. Built upon the depthwise separable convolution, the ADRB significantly reduces the model’s parameter while maintaining performance. Additionally, we introduced a multi-scale axial depthwise convolution operation (denoted as AxialDWConv) into the ADRB. Unlike the conventional convolution operation, which typically employs a rectangular convolution kernel to obtain a rectangular local receptive field, AxialDWConv combines two 1D convolutions in both horizontal and vertical directions, enabling a larger receptive field with the same number of parameters. Inspired by the vision permutator [43], AxialDWConv considers axial information in the feature maps, forming a larger cross-shaped receptive field, which is better suited for remote sensing images with a large spatial range of imaging, as shown in Figure 3b.

The structure of the ADRB is shown in Figure 3a. Overall, it employs a classic residual learning framework. Mathematically, it can be expressed as follows:

\begin{matrix} y = C o n v B l o c k (F_{i n}) \end{matrix}

(7)

\begin{matrix} F_{o u t} = y + A D W C o n v B l o c k (y) \end{matrix}

(8)

Here,

C o n v B l o c k

still represents the standard convolutional block, as defined in Equation (3).

F_{i n}

and

F_{o u t}

denote the input and output feature maps, respectively.

A D W C o n v B l o c k

signifies the core component of the ADRB, which is the multi-scale axial depthwise convolution structure, enclosed by a dashed box in Figure 3. In the

A D W C o n v B l o c k

, we achieved multi-scale feature map extraction by concatenating three AixialDWConv blocks of different sizes (

3 \times 3

,

5 \times 5

, and

7 \times 7

), where the dilation parameters for these blocks were set to 1, 2, and 3, respectively. Each AixialDWConv block is composed of two vertical and horizontal 1D group convolutions.

3.3. Hybrid Attention Block

Another core component in the BasicBlock is the hybrid attention block (HAB), a novel attention mechanism architecture that concurrently integrates a simplified channel attention and haze-aware pixel attention, which is depicted in Figure 4. Feature maps often exhibit information redundancy in the channel dimension, and the main idea of channel attention is to re-weight each feature channel based on its importance, thereby enhancing useful channels and suppressing redundant ones. To model the importance between channels, SENet [44] employs a squeeze-and-excitation operation. It squeezes feature information from each channel into a single value through the global average pooling, then models the weight of each channel from these values through two fully connected layers, and finally, multiplies the weights with the corresponding channels to excite the useful feature information. However, this strategy embeds a significant number of fully connected layers in the model, resulting in a dramatic increase in network parameters and computational overhead. Inspired by previous work [45], we introduce a simplified channel attention (SCA) that captures the channel attention using only an average pooling operation and a 1 × 1 pointwise convolution, as illustrated in the upper left part of Figure 4.

Following the SCA module, we introduced a simple pixel attention mechanism, derived from the observed prior of feature maps in dehazing networks. In neural network-based dehazing models, as the network deepens, the feature maps progressively approach a high-dimensional representation of haze-free images. We observed that the feature maps in the shallow layers typically capture low-level features of the input image, such as edges, structures, textures, and colors, which often degrade due to haze. Conversely, deep layers tend to represent semantic information and align more closely with the haze-free ground truth images. Thus, we hypothesize that, in neural networks, the deviation between the deep and shallow feature maps can indicate the location and concentration of the haze of the image. Based on this, we propose a haze-aware pixel attention structure, which recalibrates the features using the spatial weights derived from the deviation between the SCA module’s output and input features.

The mathematical expressions of HAB are presented in Equations (9) and (10), where the former describes SCA and the latter depicts pixel attention, as follows:

\begin{matrix} S C A (F_{i n}) = F_{i n} * C o n v_{1 \times 1} (A v g P o o l i n g (F_{i n})) \end{matrix}

(9)

\begin{matrix} F_{o u t} = S C A (F_{i n}) \otimes B N (S C A (F_{i n}) - C o n v_{1 \times 1} (F_{i n})) \end{matrix}

(10)

where

F_{i n}

and

F_{o u t}

represent the input features and recalibrated output features, respectively.

S C A (\cdot)

describes the simplified channel attention. * and ⊗ represent channelwise and elementwise multiplication operations, respectively.

B N

indicates the batch normalization, and

C o n v_{1 \times 1}

means the point convolution operator.

3.4. Loss Function

Many dehazing networks use a combination of loss functions to train neural networks, such as perceptual loss, SSIM loss, MSE loss, and adversarial loss. This combined loss function strategy can enhance model performance, but requires careful fine-tuning, such as balancing the weights of different losses. Currently, there is no solid theoretical guidance for this process, and most studies rely on empirical experiments to find appropriate weights. In this paper, we do not extensively explore the effects of various loss functions, but instead, focus purely on the design of the network architecture. The L2 loss function has been proven to yield promising results in many low-level image processing tasks, so we simply adopted the L2 loss function in our training stage. Compared to the L1 loss function, L2 loss can optimize the network more stably. Moreover, hazy images often contain a significant amount of noise, and the smoothing property of the L2 loss can effectively address this issue. The mathematical expression of the L2 loss function is as follows:

L_{2} = {∥ \hat{J (x)} - J (x) ∥}^{2}

(11)

where

\hat{J (x)}

is the haze-free result estimated by the LRSDN model and

J (x)

is the corresponding ground truth.

4. Dataset Construction

In this section, we introduce a haze-image-simulation algorithm based on Perlin noise. Then, we elaborate on the construction of a remote sensing hazy image dataset using the proposed haze-image-simulation algorithm.

4.1. Haze Image Simulation Based on Perlin Noise

Perlin noise [20] is a stochastic noise based on interpolation and smoothing functions, commonly employed to simulate diverse natural textures such as water ripples, flames, turbulence, terrains, and clouds. The fundamental concept of Perlin noise entails creating a smooth and isotropic noise function through a systematic process involving gridwise sampling, gradient computation, and interpolation, thereby overcoming the unnaturalness and lack of controllability inherent in traditional noise functions for texture generation. The generation process of two-dimensional (2D) Perlin noise unfolds in three primary steps: (1) gridwise sampling: the division of the 2D plane into a uniform grid, with random values generated at each grid point; (2) Gradient Calculation: the determination of a gradient vector for each grid point indicating the noise variation direction around that point; and (3) interpolation: the computation of noise values for each pixel using interpolation techniques, with dependencies on surrounding grid points’ noise values and gradient vectors.

Given Perlin noise’s capability to effectively simulate natural textures, we incorporated it into synthesizing remote sensing haze images. Combining it with the atmospheric scattering model, we utilized Perlin noise to generate the scene’s transmission map and randomly generate atmospheric ambient light, to ultimately simulate hazy remote sensing images, as depicted in Figure 5. In this study, we employed the implementation of Perlin noise provided by the open-source project perlin-numpy (https://github.com/pvigier/perlin-numpy (accessed on 8 October 2023) ) and set the period parameter of the noise along each axis to 8 and the octave parameter of the noise to 5, with all other parameters following the default settings. The noise is first normalized to the range [0, 1], and then, it is nonlinearly transformed according to the formula

t (x) = e^{- n (x) β}

, where n(x) denotes the normalized noise and

β

is the parameter to control the concentration of the haze, and we set it to 0.5, 1, and 3 for generating thin, moderate, and dense haze, respectively. After that, we performed the “crop-and-resize” operation on

t (x)

to control haze uniformity by the cropping ratio. In Figure 6, the cropping ratios of the transmission maps from rows (2) to (6) are 1/25, 4/25, 9/25, 16/25, and 1 (i.e., no cropping), respectively, whereas the transmission of the homogeneous haze sample is set to the mean value of

t (x)

. As for the global atmospheric light, we randomly generated it based on haze concentrations (ranging within [0.7, 0.8], [0.8, 0.9], and [0.9, 1] for thin, moderate, and dense haze).

A significant advantage of our haze simulation method lies in its avoidance of extracting haze masks from real-world haze images, which is often challenging and labor-intensive. Instead, our method efficiently generates highly realistic haze masks solely using Perlin noise, seamlessly integrating them as transmission maps into the haze simulation process, i.e., the atmospheric scattering model. Furthermore, leveraging Perlin noise’s fine controllability, we can easily adjust its parameters to manipulate its frequency, amplitude, and level of detail, enabling the generation of haze with varying densities and distributions. Most importantly, we will release all source code of our haze-image-simulation algorithm on the project’s official website to advance the progress of the remote sensing image dehazing task.

4.2. Remote Sensing Haze Image Dataset

In natural scene image dehazing, deep learning algorithms have demonstrated superior performance. However, in the task of remote sensing image haze removal, supervised learning algorithms have been facing challenges due to the scarcity of training data. This is mainly due to the limitation of the satellite revisiting period, which makes it impractical to acquire both haze-free and hazy images of the same remote sensing imaging scene simultaneously. Therefore, we adopted the remote sensing haze-image-simulation algorithm based on Perlin noise proposed in the previous subsection to successfully construct a large-scale remote sensing haze image dataset (RSHD), which will be released on the official website of our project.

The construction of remote sensing haze image datasets typically involves collecting haze-free images first and then synthesizing the corresponding hazy versions using a certain haze image simulation technology to form paired data samples. In the supervised learning framework, the synthesized hazy images are used as the input during the model training stage, while the original haze-free images serve as supervision information to compute the loss in optimization and update the model weights. During the inference stage, the input of the model is still the hazy image, but the clear image is utilized as the ground truth to objectively evaluate the dehazing performance of the algorithm.

In this study, we did not collect clear remote sensing images manually. Instead, we directly utilized the publicly available DeepGlobeRoad dataset [46]. It was released in the Satellite Image Understanding Challenge Session at CVPR 2018, and it comprises 8570 RGB remote sensing images from 6 different countries. The images have high spatial resolutions of 0.3 m and 0.05 m, covering a total land area of 2220 square kilometers. Originally used as a benchmark for the remote sensing-image-road-extraction task, the dataset publicly provides road masks corresponding to all of 6226 training samples, while the annotated labels for the validation and test sets of 2344 images are not publicly released. To meet our research needs, we re-split the DeepGlobeRoad dataset and synthesized haze samples according to the following steps:

Test set preparation: We randomly selected 1530 clear remote sensing images and their corresponding road mask files from the training set of the DeepGlobeRoad dataset as synthetic materials for our RSHD test set, denoted as set A.
Test set construction: To ensure that the test set adequately covers various haze distribution scenarios, we synthesized two groups of hazy images using the clear remote sensing images from set A: one group consists of 1530 images with uniformly distributed haze, while the other group comprises 1530 images with inhomogeneous haze. Additionally, by adjusting the parameters of the haze image-synthesis algorithm, these images were further divided into categories of dense haze, moderate haze, and thin haze images. Thus, we successfully constructed 6 test sets. Each sample in the test set consists of a triplet of images containing a haze-free image, its corresponding hazy image, and a road mask file. This design not only facilitates a visual evaluation of the quality of haze-free images restored by dehazing algorithms, but also enables the quantitative assessment of the impact of dehazing algorithms on downstream high-level tasks. The synthesized samples are illustrated in Figure 6.
Training set construction: We merged the remaining 4696 haze-free remote sensing images from the DeepGlobeRoad training set with the 2344 samples from the test and validation sets to form a new collection, denoted as set B, containing 7040 haze-free images. Then, we synthesized six images with different degrees of haze distribution uniformity and varying haze concentrations for each sample in set B. Ultimately, we obtained a training set comprising 42,240 sample pairs.

Table 1 provides a comprehensive overview of the RSHD constructed in this paper. Additionally, we list several existing remote sensing haze image datasets in this table for comparison purposes. In the names of the 6 test sets of our RSHD, “H” and “IH” describe the haze distribution, representing homogeneous and inhomogeneous haze, respectively, while “T”, “M”, and “D” indicate the haze concentration, standing for thin, moderate, and dense haze. For example, “RSHD-IHD” denotes the test set with inhomogeneous dense haze, with samples depicted in rows (2) to (6) of column (c) of Figure 6. The “#” in column “Spatial Resolution” indicates that these data are not announced in the corresponding dataset. The column “Distribution Diversity” indicates whether the samples exhibit haze with different uniformity. The symbol “√” means that the haze of the samples has various distributions and the uniformity of the haze distribution is controllable, “⨂” implies the same situation, but the uniformity is uncontrollable, while “×” suggests that the samples only contain homogeneous haze. The column “Density Diversity” signifies whether the dataset is further divided based on the haze concentration, and the column “High-level Task” denotes whether the dataset supports quantitative evaluation for downstream tasks.

In comparison to other datasets, the RSHD demonstrates significant advantages in several aspects. Firstly, the RSHD has a higher spatial and pixel resolution, indicating superior image quality and richer details. Secondly, the RSHD contains a larger number of samples compared to existing datasets, providing broader data support for our research and enhancing the generalization capability and stability of dehazing algorithms. Furthermore, the RSHD’s samples exhibit a more diverse range of haze concentrations and distributions due to the finer control granularity of the haze synthesis method. This diversity not only increases the dataset’s challenge, but also makes the synthetic haze image more realistic. It is noteworthy that each sample in the RSHD test set includes a corresponding road mask, facilitating further quantitative assessment of the impact of dehazing algorithms on downstream tasks. To fully validate the effectiveness of the dehazing algorithm proposed in this paper, we conducted comprehensive and meticulous evaluations on all listed datasets.

5. Results

In this section, we first present the implementation details of the proposed LRSDN, followed by the comparison methods and experimental setup. Next, we introduce the evaluation metrics employed in the comparative experiments. Finally, we quantitatively and qualitatively compare the dehazing performance of our algorithm with state-of-the-art methods on multiple datasets.

5.1. Implementation Details

The proposed LRSDN is an end-to-end trainable network that requires no additional multi-stage training strategies. We trained the network from scratch for 150 epochs without pre-training on large image datasets. The initial learning rate was set to 0.0005, and the cosine annealing strategy was employed to adjust the learning rate, with the minimum learning rate of 0.000001. The model was trained using the Adam optimizer with its default settings (

β_{1} = 0.9

and

β_{2} = 0.999

). During training, the input images were randomly cropped or resized to the resolution of

256 \times 256

, and no additional data augmentation techniques were applied. The batch size was 64. We implemented the network using Python 3.11 and Pytorch 2.1 and completed the model training on an Nvidia 3090 GPU.

5.2. Comparison Algorithms

To validate the effectiveness of the LRSDN model, we compared it with other state-of-the-art dehazing algorithms. These include prior-based image haze-removal methods such as CEP [9], HazeLine [11], EVPM [12], and IDeRs [21]. The prior theories of CEP and HazeLine derive from statistics and assumptions about natural images, while EVPM and IDeRs are specifically designed for remote sensing images. Additionally, we compared the LRSDN model with five deep neural network-based methods: AOD [13], FCTFNet [15], LDN [14], RSHNet [35], and AUNet [50]. These algorithms were adopted for their representation of both traditional and advanced deep learning approaches in the field. The source codes of all algorithms were provided by their respective authors and executed on the same computer to ensure fairness and consistency in the experiments.

5.3. Algorithm Evaluation Metrics

For evaluating dehazing algorithms, a set of standard metrics is typically used to quantify the performance of different methods, including both objective measures and subjective assessments. Visual evaluation, as a commonly used subjective evaluation method, ensures the practical usability of the dehazed images. Objective evaluation methods can be roughly categorized into no-reference methods and full-reference methods. The former is applied in scenarios where the original undegraded image is not available, while the latter is suitable for cases where the ground truth corresponding to the hazy image is known. Since our test datasets contain both hazy images and their reference ground truths, we employed some commonly used evaluation metrics for dehazing algorithms as follows.

Peak Signal-to-Noise Ratio (PSNR): The PSNR measures the ratio between the maximum possible value of a signal and the power of corrupting noise, which affects the quality of its representation, which can be formulized as

\begin{matrix} PSNR = 20 \cdot {log}_{10} (\frac{{MAX}_{I}}{\sqrt{MSE}}) \end{matrix}

(12)

\begin{matrix} MSE = \frac{1}{N} \sum_{i = 1}^{N} {(I_{i} - K_{i})}^{2} \end{matrix}

(13)

where

{MAX}_{I}

is the maximum possible pixel value of the image (e.g., 255 for an 8-bit image) and

MSE

is the mean-squared error between the original and dehazed image. Here, N denotes the total number of pixels in the image, and

I_{i}

and

K_{i}

represent the i-th pixel values of the original and restored images, respectively. A higher PSNR value indicates a better dehazing capability of the algorithm.

Structural Similarity Index (SSIM): The SSIM assesses the similarity between two images, considering changes in luminance, contrast, and structure, which can be expressed as

SSIM (x, y) = \frac{(2 μ_{x} μ_{y} + C_{1}) (2 σ_{x y} + C_{2})}{(μ_{x}^{2} + μ_{y}^{2} + C_{1}) (σ_{x}^{2} + σ_{y}^{2} + C_{2})}

(14)

where x and y are the original and dehazed images,

μ

and

σ

are the mean and variance, and

C_{1}

and

C_{2}

are constants to stabilize the division. The value of the SSIM ranges from 0 to 1, with higher SSIM values indicating that the restored image is closer to the original image.

Learned Perceptual Image Patch Similarity (LPIPS) [51]: The LPIPS is a perceptual image similarity metric based on deep learning, designed to evaluate the perceptual difference between two images. It extracts feature representations of the images using a pre-trained deep neural network, and then calculates the distance between these features to assess image quality. By comparing image features rather than pixel values, LPIPS can better reflect human visual perception. Compared to traditional pixel-level evaluation metrics, LPIPS more effectively captures perceptual quality differences, so it is widely used in the evaluation of various image enhancement and restoration algorithms. The LPIPS values range from 0 to 1, where 0 indicates minimal perceptual difference between two images, and 1 represents the opposite.

Feature Similarity Index Measure (FSIM) [52]: The FSIM is a feature similarity-based image-quality-assessment metric that primarily uses phase congruency and the gradient magnitude to evaluate image quality. Phase congruency is a feature that reflects the local structural information of an image, while the gradient magnitude captures the edge information. The FSIM exhibits a high correlation with human perception of image quality. The formula of the FSIM is given as

FSIM = \frac{\sum_{i \in Ω} P C_{m} (i) \cdot S_{P C} (i) \cdot S_{G M} (i)}{\sum_{i \in Ω} P C_{m} (i)}

(15)

where

Ω

represents the set of all pixels in the image.

P C_{m} (i)

is the phase congruency value at pixel i, which reflects the significance of the feature at that location.

S_{P C} (i)

and

S_{G M} (i)

are similarity measures based on phase congruency and the gradient magnitude at pixel i, respectively. Similar to the SSIM, the FSIM also ranges from 0 to 1, with higher values indicating better similarity between two images.

5.4. Dehazing on RSHD

We first conducted dehazing experiments on the proposed RSHD and compared the LRSDN with existing state-of-the-art methods using both visual evaluation and quantitative assessments. As described in Section 4.2, the RSHD is a fine-grained remote sensing hazy image dataset synthesized based on Perlin noise with varying haze concentrations and uniformities. The dataset is divided into two parts based on haze uniformity: RSHD-H for homogeneous haze and RSHD-IH for inhomogeneous haze. Each part is further subdivided into three subsets based on haze concentration, “dense”, “moderate”, and “thin”, resulting in six test subsets labeled with the suffixes -D, -M, and -S (i.e., RSHD-HD, RSHD-HM, RSHD-HT, RSHD-IHD, RSHD-IHM, and RSHD-IHT). As listed in Table 1, each of these six test sets contains 510 pairs of hazy and haze-free images with a resolution of

1024 \times 1024

.

All compared algorithms were executed on the same device for dehazing on these six test sets. The results for the six RSHD subsets are shown in Figure 7. For both homogeneous and inhomogeneous haze, physics model-based methods yielded noticeable color distortions and unnatural saturation. AOD and LDN performed well on homogeneous haze, but struggled with inhomogeneous haze, showing apparent haze residues. FCTFNet, RSHNet, and AUNet removed the haze better, but still lost some texture detail. In contrast, the proposed LRSDN algorithm could effectively restore remote sensing images under various haze conditions, producing satisfactory and clear results.

The quantitative evaluation results for these six test sets are listed in Table 2. Consistent with the visual comparisons, the quantitative evaluation scores of prior-based methods were generally poor, while the proposed LRSDN algorithm achieved the best PSNR, SSIM, and FSIM scores on all datasets. As for the LPIPS index, our model obtained the best scores on the two dense haze test sets, and the second-best scores on the remaining four test sets. The comprehensive subjective and objective evaluations validate the effectiveness of the proposed LRSDN algorithm for the dehazing challenge of remote sensing images.

5.5. Dehazing on Haze1K Dataset

Haze1K [47], a widely used dataset for remote sensing image dehazing, comprises synthetic hazy images, corresponding haze-free images, and SAR images. The original haze-free images are from the GF-2 satellite, while the SAR images are from the GF-3 satellite. Following the conventional practice in optical remote sensing image dehazing, we only utilized the RGB hazy and haze-free images, and did not include SAR images as supplementary information. In the haze image synthesis stage, haze masks extracted from real-world remote sensing hazy images are adopted as transmission maps to generate three data subsets: thick, moderate, and thin haze samples. Each subset contains 320 pairs of samples for training, 35 pairs for validation, and 45 pairs for testing. All images have a resolution of

512 \times 512

.

The visual comparison of the dehazing results using various approaches on the Haze1K dataset is illustrated in Figure 8. For the thin haze sample (g), physics-based methods produce over-enhanced results, with over-saturation appearing in the outputs of CEP and HazeLine. Except for the proposed LRSDN model and RSHNet method, remaining haze is present in the bottom-left region of the images restored by other compared methods. For the moderate haze sample (h), all algorithms exhibit good dehazing performance, but with varying degrees of detail loss, as shown in the three magnified rectangular regions of the image. Additionally, the results of CEP and HazeLine are notably dim. For the thick haze image (i), physics-based algorithms fail to effectively remove the haze, leaving numerous haze patches, while neural network methods perform better. However, the results of AOD and LDN exhibit color distortions and noise due to failed texture reconstruction.

The quantitative comparison of different dehazing algorithms on the Haze1K dataset is listed in Table 3. For the PSNR and SSIM metrics, neural network-based algorithms significantly outperform physics-based methods, indicating that learning-based methods can better restore image details and structures. Specifically, our LRSDN algorithm achieves the best PSNR scores on all three test subsets and the best SSIM scores on the thick and thin haze subsets, as well as the third-best SSIM score on the moderate haze test set. As for the LPIPS and FSIM metrics, it consistently achieves the best results.

5.6. Dehazing on RICE Dataset

The RICE [48] dataset is specifically constructed for the removal of thin haze and clouds in remote sensing imagery, which comprises two parts. RICE1 contains 500 pairs of hazy and haze-free images, where 420 pairs form the training set and 80 pairs constitute the test set; RICE2 contains 450 triplets of images, comprising cloud-free images, cloudy images, and corresponding cloud masks, intended for cloud detection and removal tasks. In this study, we utilized only the RICE1 subset. The samples of the RICE1 dataset were sourced from Google Earth. The corresponding haze-covered and haze-free images were obtained by toggling the cloud layer visibility in Google Earth. Then, the obtained images were precisely cropped to a non-overlapping size of

512 \times 512

pixels.

The results of the visual evaluation and full-reference quantitative evaluation are presented in Figure 9 and Table 4, respectively. The IDeRs algorithm significantly enhanced the brightness and saturation of the images, and its over-processing resulted in the worst LPIPS, SSIM, and FSIM scores. In contrast, the results of the CEP exhibited insufficient brightness, leading to a very low PSNR value. HazeLine and EVPM also suffered from excessive color and texture enhancement. Both FCTFNet and the LRSDN achieved satisfactory defogging results, and the LRSDN attained the highest PSNR, SSIM, and FSIM scores, where the PSNR score was 9% higher than the second-best score. The comparative experiments on the RICE dataset demonstrate the effectiveness of LRSDN for real-world remote sensing image haze removal.

5.7. Dehazing on RSID Dataset

Recently, researchers proposed the RSID [31] dataset, which comprises 1000 pairs of

256 \times 256

-resolution samples, with a 9:1 split between the training and testing sets. The dehazing results of the compared algorithms on the RSID dataset are illustrated in Figure 10. Due to the significant non-uniform distribution of haze in sample (i), traditional algorithms based on the atmospheric scattering model left substantial residual haze, and severe color distortion occurred in the outputs of CEP and IDeRs methods. Moreover, AOD, FCTFNet, and LDN failed to remove the haze in the lower-right corner of the image. The LRSDN algorithm proposed in this paper effectively removed all haze on the degraded image, but introduced a small amount of noise. In sample (m), it achieved clear and natural dehazing results.

Similarly, we also conducted a quantitative comparison of the dehazing performance of various dehazing algorithms using four widely used full-reference assessment indexes, as shown in Table 5. Consistent with the observational results, neural network-based algorithms generally outperform prior-based methods. Notably, our LRSDN algorithm achieved the second-best scores in all four evaluations, while AUNet performed best on the RSID dataset. Both the LRSDN and AUNet worked well to remove the haze in samples of the RSID without residual mist. However, the LRSDN produced some noise in the area of the water surface of the restored image of sample (i).

6. Discussion

In this section, we further discuss the dehazing effectiveness of the LRSDN in comparison to other state-of-the-art methods. Unlike existing studies, we approached this issue from a new perspective, i.e., we measured the effectiveness of the dehazing, which is usually a preprocessing operation of a vision system, in terms of the performance of downstream high-level tasks. Additionally, we conducted comparative experiments on the model’s parameter size, computational complexity, and execution time, the results demonstrate that the LRSDN achieves promising dehazing performance while maintaining a lightweight model, facilitating its deployment on edge devices such as drones.

6.1. Road Extraction

Haze removal is an image-enhancement and -restoration technique that typically serves as a preprocessing step for remote sensing image-application systems. Consequently, the outputs of dehazing algorithms are often used as the inputs for downstream tasks in these vision systems. However, existing research generally evaluates the effectiveness of dehazing algorithms based solely on human visual perception without considering the impact of dehazing on subsequent algorithms from the perspective of the entire application system. For instance, some algorithms may produce visually satisfying results, but simultaneously introduce noise that is tolerable to human eyes. However, these subtle disturbances could significantly reduce the robustness of downstream high-level tasks. Therefore, we believe that it is crucial to evaluate the dehazing performance of algorithms from the perspective of high-level tasks. As a result, we conduct a supplementary discussion and experiments in this subsection.

We employed a typical high-level task in remote sensing, road extraction, to test the impact of the restored results from different dehazing algorithms on the performance of downstream tasks. We conducted this experiment on the RSHD proposed in this paper, which was built based on the DeepGlobeRoad dataset, a classic dataset for road-extraction tasks that provides road masks corresponding to remote sensing images. Thus, the RSHD naturally supports this meaningful exploration. The experimental steps are as follows: (1) Various dehazing algorithms are applied to the six subsets of the RSHD, each with different haze concentrations and distributions. (2) A typical road extraction algorithm, CoANet [53], is employed to extract roads from the dehazed images obtained in step (1). (3) We calculated the average recall, accuracy, IoU, and F1-scores for each dehazing algorithm’s output images in the road-extraction task, which are the most commonly used evaluation metrics in road-extraction and image segmentation tasks.

The dehazed images of different algorithms and their corresponding road-extraction results are shown in Figure 11. Due to page limitations, Figure 11 only presents the result images of two test subsets, RSHD-HD and RSHD-IHD, while the quantitative evaluation of the dehazing results is listed in Table 6. We observed some noteworthy phenomena: In sample (I) of Figure 11, from human visual perception, the CEP, AUNet, and IDeRs algorithms restored much of the structural information of the hazy images, significantly enhancing their visibility. However, their road-extraction results were extremely poor, missing most of the roads, resulting in lower scores in all four quantitative evaluation metrics. This demonstrates the inconsistency between the robustness of downstream algorithms and the perceived image quality by human eyes in the visual system. In the dehazed results for sample (II), although AOD and LDN effectively removed the non-uniform dense haze, they exhibited severe color distortions, leading to suboptimal subsequent road-detection results.

This experiment provides valuable guidance for the development of remote sensing image dehazing tasks, highlighting the importance of considering both human visual quality and the robustness of high-level tasks in dehazing processes. The proposed LRSDN algorithm achieves minimal interference with downstream tasks while ensuring good visual quality, resulting in the best evaluation scores for road extraction.

6.2. Land Cover Classification

The monitoring of land use and land cover is an important application of remote sensing imagery, which plays a key role in the management of natural resources and the guidance of governmental policy making. In this subsection, we will discuss the impact of the clarity-limited remote sensing images degraded by haze on land cover classification applications and the positive effect of dehazing algorithms on this issue.

In this experiment, we used the DeepGlobe land cover dataset [46] (abbreviated as DeepGlobeLand) and the Perlin-based haze-image-synthesis algorithm proposed in this paper. DeepGlobeLand contains 803 remotely sensed images at a 2448 × 2448 resolution, each annotated with seven landscape classes, urban land, agriculture land, rangeland, forest, water, barren land, and unknown background.

Our experimental steps are as follows: (1) Firstly, 100 images randomly selected from DeepGlobeLand were used as test set A, and the remaining 703 images were used as training set B. (2) The pre-trained DeepLabv3+ [54] model was fine-tuned on set B for land cover classification. (3) The 100 test images in set A had haze added to them by the proposed haze-image-simulation algorithm, and then, these synthesized haze images were recovered using some state-of-the-art dehazing algorithms. (4) The DeepLabv3+ model trained in step 2 was used to classify land cover in the dehazed images by each dehazing algorithm in step 3, and the average classification scores were calculated separately to quantitatively assess the effect of different dehazing methods on land classification.

For evaluating land classification, we adopted the mean intersection over union (IoU), F1-score, recall, and precision as the metrics. The visual results and quantitative evaluations of the experiment are presented in Figure 12 and Table 7. We observed that the presence of haze greatly impacts the performance of the land classification algorithm. When hazy images are directly used as the input, our trained DeepLabv3+ model fails completely due to the substantial loss of information in remote sensing images caused by haze, rendering the model incapable of identifying land types. After preprocessing with the dehazing algorithm, for the test sample (I), the land classification results significantly improved. However, for sample (II), models FCTFNet [15], LDN [14], RSHNet [35], and AUNet [50] demonstrated limited enhancement, whereas the proposed LRSDN algorithm achieved notable improvements. This finding aligns with the quantitative evaluation results listed in Table 7, where images dehazed by the LRSDN exhibit a 258% increase in the IoU score and a 209% boost in the F1-score compared to the unprocessed hazy images.

6.3. Computational Complexity and Execution Time

Finally, we further discuss the execution efficiency and parameter scale of the proposed algorithm. In this experiment, all compared algorithms were executed on the same device equipped with an Intel(R) Core(TM) i7-10700 CPU, 64GB RAM, a ZHITAI TiPro7000 2TB solid-state drive (SSD), and the Windows 11 Pro operating system. Algorithms DCP [8], CEP [9], HazeLine [11], EVPM [12], IDeRs [21], CAP [10], and SIMDCP [22] were implemented in Matlab R2023b, while algorithms AOD [13], GCANet [16], GDN [17], FCTFNet [15], LDN [14], RSHNet [35], AUNet [50], and LRSDN were implemented using Pytorch and executed with acceleration by an Nvidia GeForce RTX 3090 GPU. We individually executed all the comparison algorithms to remove haze for 100 images with a resolution of

256 \times 256

, and then recorded the average execution time for each algorithm. For the deep neural network-based algorithms, we also report their parameter counts and multiply–accumulate operations (MACs). The results are presented in Table 8. It can be observed that our LRSDN algorithm has only 0.087M parameters due to our delicately designed lightweight network modules. Additionally, the LRSDN’s computational complexity is significantly lower than that of GCANet [16], GDN [17], AUNet [50], RSHNet [35], and FCTFNet [15], yet it achieved impressive dehazing performance.

6.4. Ablation Study

To quantitatively analyze the effectiveness of each component of the LRSDN model, we conducted a comprehensive ablation study in terms of model size, hazy-image-restoration quality, and road-extraction performance. The core structure of the LRSDN model is the BasicBlock, which consists of the ADRB and HAB. Therefore, we first replaced the BasicBlock of the LRSDN with a classical ResBlock [19] to obtain a baseline model, denoted as M1. After that, we added the ADRB on the basis of the baseline to build the variant model M2. Another key component of the BasicBlock is the HAB, which consists of two components, simplified channel attention (SCA) and pixel attention (PA), so we added SCA and PA to M2 in turn to build the variant models M3 and M4. Additionally, to investigate the impact of the number of BasicBlocks on dehazing performance, we increased the number of BasicBlocks to 10, forming the variant model M5. The parameter counts, multiply–accumulate operations (MACs), and execution time of each variant model are shown in Table 9. Since both the ADRB and HAB are lightweight modules, they do not significantly increase the computational cost and parameters. However, when the number of BasicBlocks increases from 5 to 10, the computational complexity and parameters of the model increase sharply; correspondingly, the computation time increases from 7.664 to 16.307.

We initially assessed the dehazing performance of various variant models based on the quality of haze image restoration. We adopted these models to remove haze for all samples from the six test subsets of the RSHD and then calculated their average full-reference evaluation scores, as shown in Figure 13. On the basis of the baseline, the sequentially embedded ADRB and HAB modules both significantly enhance the dehazing performance, resulting in significant increases in the PSNR, SSIM, and FSIM of the restored images, while the LPIPS exhibits a decreasing trend. This is because the LPIPS is a learning-based image-evaluation method, which is trained on a large number of learning samples, causing it to be insufficiently sensitive to subtle changes in the image pixels. In addition, comparing M2, M3, and M4, we found that channel attention had a significant improvement for haze image datasets with various concentrations and uniformity, while pixel attention had a limited effect on images with thin haze (shown as blue and black dashed lines in Figure 13). This is because the design goal of pixel attention is to capture regions of dense haze, which is not obvious on the thin haze image datasets. In particular, the enhancement of the dehazing effect is more remarkable on dense haze datasets (RSHD-HD and RSHD-IHD), with an average increase of 20% in the PSNR score.

Furthermore, following the experimental setup described in Section 6.1, we evaluated the performance impact of the dehazed results of each variant model on the downstream road-extraction algorithm, as depicted in Figure 14. Consistent with the results of the dehazed image quality assessment shown in Figure 13, M2, M3, M4, and M5 derived from the incremental expansion of the baseline model M1 progressively improve the dehazing performance, especially on the dense haze datasets (RSHD-HD and RSHD-IHD), and thus, the restored images by the variant models dramatically improve the performance of the downstream road-extraction algorithm, especially in the case of thick fog, regardless of whether the haze distribution is homogeneous or non-homogeneous. However, the precision of road extraction shows volatility. This is because road extraction can be viewed as a single-category segmentation task, in which the number of pixels in the target category (road) is usually much smaller than the number of pixels in the background. This sample imbalance predicament may lead to anomalies in precision evaluation being affected by negative samples (background). The ablation experiments demonstrate the effectiveness of the LRSDN’s structural design, which exhibits exceptional defogging performance while maintaining a lightweight scale, even for dense haze images, yielding high-quality restoration results. It is worth noting that, although increasing the BasicBlock quantity further improves the dehazing effect, we set the number of BasicBlocks of the LRSDN to five, balancing the trade-off of model size and computational complexity.

6.5. Visualization of the Feature Enhanced by Hybrid Attention Block

The hybrid attention block (HAB) is a core component of the LRSDN designed to enhance feature representation. To visually demonstrate the impact of the HAB, we employed the feature visualization technique to show the features before and after the enhancement by the HAB, as shown in Figure 15. The LRSDN consists of five BasicBlocks, each consisting of the ADRB and HAB in cascade. We sequentially visualized the intermediate outputs of the ADRB and HAB within each BasicBlock. Within a single BasicBlock, the HAB captures the non-uniform haze distribution and enhances feature representation in dense haze regions, which enables our algorithm to cope with the non-uniformly distributed haze better compared to other state-of-the-art methods. From the perspective of the entire model, the shallow layers focus on the global haze distribution, while the deep layers concentrate on specific local regions. This occurs because, with increased network depth, the details of the haze image are progressively restored, and the model tends to recover extreme regions.

6.6. Failure Cases and Future Work

Although the LRSDN algorithm demonstrates promising dehazing performance in most cases, it may fail in some specific scenes. As shown in Figure 16, the inputs of (a) and (b) are natural scene images, and the LRSDN fails to remove the haze successfully, producing intolerable color spots in the sky region, which is due to the domain gap between the natural scene images and the remotely sensed images. While the inputs of (c) and (d) are remote sensing images with large areas of monotonous colors, the LRSDN suffers from over-enhancement and incorrectly restores colors in the dark parts of the water body. In future work, we will dedicate our work to addressing these issues. From the perspective of model design and deployment, it is a persistent challenge to further reduce the model’s computation and parameter counts while ensuring the network’s dehazing performance. In addition, there is a huge research space for model deployment and application on edge devices with extreme resource constraints (e.g., micro UAVs). Lastly, the dataset is a crucial factor for research progress. We proposed a haze-image-simulation algorithm based on Perlin noise to construct paired hazy and clear data for supervised learning. However, the domain gap between simulated data and real-world images is still non-negligible. In natural image dehazing, researchers have attempted to use specialized haze machines to generate real haze [55] instead of synthesizing hazy images. However, this approach is difficult to apply in remote sensing scenarios because of the imaging differences between remotely sensed imagery and natural photographs (e.g., the range, distance, and timing of imaging). Multi-temporal, multi-spectral, and hyperspectral data from remote sensing satellites may offer insights into solving this problem.

7. Conclusions

In this paper, we propose a simple, yet effective remote sensing image dehazing algorithm, LRSDN, which achieves outstanding restoration performance with only 0.087M parameters. This remarkable efficiency is primarily attributed to two specifically designed lightweight modules. First, the axial depthwise and residual block (ADRB) leverages axial depthwise convolution and residual learning, significantly reducing the parameter count through depthwise convolutions while enabling large receptive fields via axial convolution kernels. The residual connections and multi-scale structure design further facilitate the extraction of robust feature information with strong representational capabilities. Second, the simplified channel attention and haze-aware pixel attention enhance useful information within the feature maps and suppress redundant data. The effective feature recalibration is another crucial factor contributing to the LRSDN’s superior defogging capability. The extensive visual evaluations and quantitative comparisons across multiple datasets demonstrate that the LRSDN outperforms existing state-of-the-art methods. Furthermore, we validated the contribution of our proposed LRSDN model in real-world remote sensing vision systems through its application in road extraction and land cover classification tasks. Compared to initial haze-affected images, remote sensing images dehazed by the LRSDN significantly improve the performance of subsequent high-level vision algorithms.

Author Contributions

Conceptualization, Y.H. and C.L.; methodology, Y.H. and C.L.; software, Y.H. and X.L.; validation, Y.H. and C.L.; data curation, Y.H. and X.L.; writing—original draft preparation, Y.H.; writing—review and editing, T.B. and C.L.; supervision, C.L. and T.B.; funding acquisition, C.L. and T.B. All authors have read and agreed to the published version of the manuscript.

Funding

This work was jointly supported by the National Natural Science Foundation of China (62261046); the Oasis Ecological Agriculture Corps Key Laboratory Open Project (202002); the Corps Science and Technology Program (2021DB001 and 2021BB023); the Innovation Team Project of Tarim University (TDZKCX202306 and TDZKCX202102); and the Joint Funds of Tarim University and China Agricultural University (ZNLH202103 and ZNLH202402).

Data Availability Statement

The public datasets used in the experiments are available at the following links: DeepGlobeRoad https://www.kaggle.com/datasets/balraj98/deepglobe-road-extraction-dataset (accessed on 10 May 2023), Haze1K https://www.dropbox.com/s/k2i3p7puuwl2g59/Haze1k.zip?dl=0(accessed on 6 September 2023), RICE https://github.com/BUPTLdy/RICE_DATASET(accessed on 6 September 2023), and RSID https://drive.google.com/file/d/1FC7oSkGTthjHl2sKN-yGrKhssgV0QB4F/view?usp=sharing (accessed on 22 December 2023). The source code of the LRSDN and RSHD will be released at https://github.com/foreverfruit/LRSDN(accessed on 30 July 2024) as soon as possible.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, J.; Wang, S.; Wang, X.; Ju, M.; Zhang, D. A review of remote sensing image dehazing. Sensors 2021, 21, 3926. [Google Scholar] [CrossRef] [PubMed]
Qi, Y.; Yang, Z.; Sun, W.; Lou, M.; Lian, J.; Zhao, W.; Deng, X.; Ma, Y. A Comprehensive Overview of Image Enhancement Techniques. Arch. Comput. Methods Eng. State Art Rev. 2022, 29, 583–607. [Google Scholar] [CrossRef]
Xiao, J.; Fu, X.; Liu, A.; Wu, F.; Zha, Z.J. Image De-Raining Transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12978–12995. [Google Scholar] [CrossRef] [PubMed]
Liu, J.; Pan, B.; Shi, Z. Cascaded Memory Network for Optical Remote Sensing Imagery Cloud Removal. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5613611. [Google Scholar] [CrossRef]
Chen, K.; Li, W.; Lei, S.; Chen, J.; Jiang, X.; Zou, Z.; Shi, Z. Continuous Remote Sensing Image Super-Resolution Based on Context Interaction in Implicit Function Space. IEEE Trans. Geosci. Remote Sens. 2023, 61, 3272473. [Google Scholar] [CrossRef]
Narasimhan, S.G.; Nayar, S.K. Vision and the atmosphere. Int. J. Comput. Vis. 2002, 48, 233–254. [Google Scholar] [CrossRef]
Narasimhan, S.G.; Nayar, S.K. Interactive (de) weathering of an image using physical models. In Proceedings of the IEEE Workshop on Color and Photometric Methods in Computer Vision, Nice, France, 12 October 2003; Volume 6, p. 1. [Google Scholar]
He, K.; Sun, J.; Tang, X. Single image haze removal using dark channel prior. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 33, 2341–2353. [Google Scholar]
Bui, T.M.; Kim, W. Single image dehazing using color ellipsoid prior. IEEE Trans. Image Process. 2017, 27, 999–1009. [Google Scholar] [CrossRef]
Zhu, Q.; Mai, J.; Shao, L. A fast single image haze removal algorithm using color attenuation prior. IEEE Trans. Image Process. 2015, 24, 3522–3533. [Google Scholar]
Berman, D.; Treibitz, T.; Avidan, S. Single image dehazing using haze-lines. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 42, 720–734. [Google Scholar] [CrossRef]
Han, J.; Zhang, S.; Fan, N.; Ye, Z. Local patchwise minimal and maximal values prior for single optical remote sensing image dehazing. Inf. Sci. 2022, 606, 173–193. [Google Scholar] [CrossRef]
Li, B.; Peng, X.; Wang, Z.; Xu, J.; Feng, D. AOD-Net: All-In-One Dehazing Network. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
Ullah, H.; Muhammad, K.; Irfan, M.; Anwar, S.; Sajjad, M.; Imran, A.S.; de Albuquerque, V.H.C. Light-DehazeNet: A novel lightweight CNN architecture for single image dehazing. IEEE Trans. Image Process. 2021, 30, 8968–8982. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Chen, X. A Coarse-to-Fine Two-Stage Attentive Network for Haze Removal of Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2021, 18, 1751–1755. [Google Scholar] [CrossRef]
Chen, D.; He, M.; Fan, Q.; Liao, J.; Zhang, L.; Hou, D.; Yuan, L.; Hua, G. Gated context aggregation network for image dehazing and deraining. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, 7–11 January 2019; pp. 1375–1383. [Google Scholar]
Liu, X.; Ma, Y.; Shi, Z.; Chen, J. Griddehazenet: Attention-based multi-scale network for image dehazing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7314–7323. [Google Scholar]
Ma, X.; Wang, Q.; Tong, X.; Atkinson, P.M. A deep learning model for incorporating temporal information in haze removal. Remote Sens. Environ. 2022, 274, 113012. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Perlin, K. An image synthesizer. ACM Siggraph Comput. Graph. 1985, 19, 287–296. [Google Scholar] [CrossRef]
Xu, L.; Zhao, D.; Yan, Y.; Kwong, S.; Chen, J.; Duan, L.Y. IDeRs: Iterative dehazing method for single remote sensing image. Inf. Sci. 2019, 489, 50–62. [Google Scholar] [CrossRef]
Li, J.; Hu, Q.; Ai, M. Haze and Thin Cloud Removal via Sphere Model Improved Dark Channel Prior. IEEE Geosci. Remote Sens. Lett. 2019, 16, 472–476. [Google Scholar] [CrossRef]
He, Y.; Li, C.; Bai, T. Remote Sensing Image Haze Removal Based on Superpixel. Remote Sens. 2023, 15, 4680. [Google Scholar] [CrossRef]
Xie, F.; Chen, J.; Pan, X.; Jiang, Z. Adaptive haze removal for single remote sensing image. IEEE Access 2018, 6, 67982–67991. [Google Scholar] [CrossRef]
Ning, J.; Zhou, Y.; Liao, X.; Duo, B. Single Remote Sensing Image Dehazing Using Robust Light-Dark Prior. Remote Sens. 2023, 15, 938. [Google Scholar] [CrossRef]
Jiang, B.; Wang, J.; Wu, Y.; Wang, S.; Zhang, J.; Chen, X.; Li, Y.; Li, X.; Wang, L. A Dehazing Method for Remote Sensing Image Under Nonuniform Hazy Weather Based on Deep Learning Network. IEEE Trans. Geosci. Remote Sens. 2023, 61, 3261545. [Google Scholar] [CrossRef]
Qin, M.; Xie, F.; Li, W.; Shi, Z.; Zhang, H. Dehazing for Multispectral Remote Sensing Images Based on a Convolutional Neural Network With the Residual Architecture. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 1645–1655. [Google Scholar] [CrossRef]
Ma, X.; Wang, Q.; Tong, X. A spectral grouping-based deep learning model for haze removal of hyperspectral images. ISPRS J. Photogramm. Remote Sens. 2022, 188, 177–189. [Google Scholar] [CrossRef]
Song, Y.; He, Z.; Qian, H.; Du, X. Vision transformers for single image dehazing. IEEE Trans. Image Process. 2023, 32, 1927–1941. [Google Scholar] [CrossRef] [PubMed]
Nie, J.; Xie, J.; Sun, H. Remote Sensing Image Dehazing via a Local Context-Enriched Transformer. Remote Sens. 2024, 16, 1422. [Google Scholar] [CrossRef]
Chi, K.; Yuan, Y.; Wang, Q. Trinity-Net: Gradient-guided Swin transformer-based remote sensing image dehazing and beyond. IEEE Trans. Geosci. Remote Sens. 2023, 61, 3285228. [Google Scholar] [CrossRef]
Yang, X.; Li, H.; Fan, Y.L.; Chen, R. Single image haze removal via region detection network. IEEE Trans. Multimed. 2019, 21, 2545–2560. [Google Scholar] [CrossRef]
Li, Z.; Zhang, J.; Zhong, R.; Bhanu, B.; Chen, Y.; Zhang, Q.; Tang, H. Lightweight and efficient image dehazing network guided by transmission estimation from real-world hazy scenes. Sensors 2021, 21, 960. [Google Scholar] [CrossRef] [PubMed]
Li, S.; Zhou, Y.; Ren, W.; Xiang, W. Pfonet: A progressive feedback optimization network for lightweight single image dehazing. IEEE Trans. Image Process. 2023, 32, 6558–6569. [Google Scholar] [CrossRef]
Wen, Y.; Gao, T.; Li, Z.; Zhang, J.; Chen, T. Encoder-Minimal and Decoder-Minimal Framework for Remote Sensing Image Dehazing. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 36–40. [Google Scholar] [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Chen, K.; Zou, Z.; Shi, Z. Building Extraction from Remote Sensing Images with Sparse Token Transformers. Remote. Sens. 2021, 13, 4441. [Google Scholar] [CrossRef]
Wang, J.; Wang, B.; Wang, X.; Zhao, Y.; Long, T. Hybrid Attention-Based U-Shaped Network for Remote Sensing Image Super-Resolution. IEEE Trans. Geosci. Remote Sens. 2023, 61, 3283769. [Google Scholar] [CrossRef]
Chen, Y.; Dong, Q.; Wang, X.; Zhang, Q.; Kang, M.; Jiang, W.; Wang, M.; Xu, L.; Zhang, C. Hybrid Attention Fusion Embedded in Transformer for Remote Sensing Image Semantic Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 4421–4435. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Springer: Berlin/Heidelberg, Germany, 2018; Volume 11211, pp. 3–19. [Google Scholar] [CrossRef]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Dinh, B.D.; Nguyen, T.T.; Tran, T.T.; Pham, V.T. 1M parameters are enough? A lightweight CNN-based model for medical image segmentation. In Proceedings of the 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Taipei, Taiwan, 31 October–3 November 2023; pp. 1279–1284. [Google Scholar] [CrossRef]
Hou, Q.; Jiang, Z.; Yuan, L.; Cheng, M.M.; Yan, S.; Feng, J. Vision permutator: A permutable mlp-like architecture for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 1328–1334. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Chen, L.; Chu, X.; Zhang, X.; Sun, J. Simple baselines for image restoration. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 17–33. [Google Scholar]
Demir, I.; Koperski, K.; Lindenbaum, D.; Pang, G.; Huang, J.; Basu, S.; Hughes, F.; Tuia, D.; Raskar, R. Deepglobe 2018: A challenge to parse the earth through satellite images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Salt Lake City, UT, USA, 18–23 June 2018; pp. 172–181. [Google Scholar]
Huang, B.; Zhi, L.; Yang, C.; Sun, F.; Song, Y. Single satellite optical imagery dehazing using SAR image prior based on conditional generative adversarial networks. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 1806–1813. [Google Scholar]
Lin, D.; Xu, G.; Wang, X.; Wang, Y.; Sun, X.; Fu, K. A remote sensing image dataset for cloud removal. arXiv 2019, arXiv:1901.00600. [Google Scholar]
Zhang, L.; Wang, S. Dense haze removal based on dynamic collaborative inference learning for remote sensing images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 3207832. [Google Scholar] [CrossRef]
Du, Y.; Li, J.; Sheng, Q.; Zhu, Y.; Wang, B.; Ling, X. Dehazing Network: Asymmetric Unet Based on Physical Model. IEEE Trans. Geosci. Remote Sens. 2024, 62, 3359217. [Google Scholar] [CrossRef]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 2018; pp. 586–595. [Google Scholar]
Zhang, L.; Zhang, L.; Mou, X.; Zhang, D. FSIM: A feature similarity index for image quality assessment. IEEE Trans. Image Process. 2011, 20, 2378–2386. [Google Scholar] [CrossRef]
Mei, J.; Li, R.J.; Gao, W.; Cheng, M.M. CoANet: Connectivity attention network for road extraction from satellite imagery. IEEE Trans. Image Process. 2021, 30, 8540–8552. [Google Scholar] [CrossRef]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Ancuti, C.O.; Ancuti, C.; Timofte, R. NH-HAZE: An image dehazing benchmark with non-homogeneous hazy and haze-free images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 444–445. [Google Scholar]

Figure 1. Illustration of DSConv and CBAM. The label “Conv” means the standard convolution, and “Conv1×1” indicates the convolution with the 1×1 kernel. “H”, “W”, and “C” denote the height, width, and channel. “MaxPool” and “AvgPool” denote the max pooling and average pooling, respectively.

Figure 2. Illustration of the proposed LRSDN model.

Figure 3. Illustration of the proposed ADRB and the receptive field of AxialDWConv. In (a), “C” means the concatenation operation, and “+” represents elementwise addition. The dashed box below depicts the structure of the ADWConvBlock, while the smaller dashed box above details an AxialDWConv operator. (b) compares the receptive fields of the typical 3×3 convolution and the 5×5 AxialDWConv (dilation = 2), marked in red and green, respectively. AxialDWConv offers a much larger receptive field than the standard convolution with the same computational parameters.

Figure 4. Illustration of the proposed HAB. “×” and “*” represent elementwise and channelwise multiplication, respectively. “-” denotes elementwise subtraction.

Figure 5. Illustration of the proposed haze-image-simulation method based on Perlin noise and the atmospheric scattering model.

Figure 6. The synthesized samples in the proposed RSHD. In the top row, the haze-free remote sensing image, randomly generated atmospheric light, and the corresponding road mask are displayed from left to right. Images in each row are synthesized hazy samples with varying uniformity in haze distribution. Columns (a–c) represent samples with varying haze concentrations, namely thin haze, moderate haze, and dense haze, respectively. In each pair of images, the grayscale image on the left represents the transmission, while the corresponding synthetic haze image is depicted on the right.

Figure 7. Visual comparisons of dehazed results by different methods on the proposed RSHD. “GT” indicates ground truth. (a–f) are samples from the RSHD-HT, RSHD-HM, RSHD-HD, RSHD-IHT, RSHD-IHM, and RSHD-IHD test sets, respectively.

Figure 8. Visual comparisons of dehazed results by different methods on Haze1K dataset. (g–i) are samples from the thin, moderate, and thick haze subsets of the Haze1K test set, respectively.

Figure 9. Visual comparisons of dehazed results by different methods on the RICE dataset.

Figure 10. Visual comparisons of dehazed results by different methods on RSID dataset.

Figure 11. Visual comparisons of dehazed results and corresponding road-extraction results by different methods on the RSHD. The sample (I) is from the homogeneous and dense haze subset RSHD-HD, and the sample (II) is from the inhomogeneous and dense haze subset RSHD-IHD.

Figure 12. Visual comparisons of dehazed results and corresponding land cover-classification results by different deep learning-based methods. “GT” means the haze-free images and the labeled land cover masks.

Figure 13. Full-reference assessment comparisons of all variant models dehazing on 6 test sets of RSHD.

Figure 14. Quantitative results of road extraction on dehazed images by each variant model.

Figure 15. Visualization of the feature maps obtained by each BasicBlock of the LRSDN model. The caption of each subfigure identifies its derivation. The number indicates the order of the current BasicBlock: “A” and “B” stand for “after” and “before” (relative to the HAB). For example, “3-B” indicates that the feature is from the intermediate output before the the HAB of the 3rd BasicBlock.

Figure 16. Failure cases of the proposed LRSDN. The top row shows hazy samples, and the bottom row shows the corresponding dehazed results by the LRSDN.

Table 1. The overview of existing datasets, where “RSHD” means the datasets constructed in this paper.

Name	Test Size	Training Size	Spatial Resolution	Pixel Resolution	Distribution Diversity	Density Diversity	High-Level Task
Hazelk-thin [47]	45	320	0.8 m	512 × 512	⨂	√	×
Haze1k-moderate [47]	45	320
Haze1k-thick [47]	45	320
RICE [48]	8	420	15m	512 × 512	⨂	×	×
RSID [31]	100	900	#	256 × 256	⨂	×	×
DHID [49]	500	14,490	0.13 m	512 × 512	⨂	×	×
LHID [49]	1000	30,517	0.2–153 m	512 × 512	×	×	×
RSHD-HT	510	42,240	0.3/0.05 m	1024 × 1024	√	√	√
RSHD-HM	510
RSHD-HD	510
RSHD-IHT	510
RSHD-IHM	510
RSHD-IHD	510

Table 2. Quantitative comparisons of different algorithms’ dehazing on the RSHD. The abbreviations “HD”, “HM”, “HT”, “IHD”, “IHM”, and “IHT” in the dataset name stand for the six subsets of the RSHD (referring to Table 1), respectively. ↑ indicates better performance with a higher value, while ↓ means the opposite case, and text in bold indicates the best results (all subsequent tables follow this convention).

Dataset	Metrics	CEP[9]	HazeLine[11]	EVPM[12]	IDeRs[21]	AOD [13]	FCTFNet[15]	LDN[14]	RSHNet[35]	AUNet[50]	LRSDN
HD	LPIPS↓	0.450	0.378	0.407	0.469	0.372	0.331	0.348	0.393	0.346	0.301
	PSNR↑	16.346	19.609	16.827	10.040	21.298	23.985	21.544	23.117	22.629	26.871
	SSIM↑	0.716	0.732	0.704	0.626	0.786	0.827	0.806	0.822	0.830	0.845
	FSIM↑	0.847	0.852	0.860	0.885	0.807	0.922	0.876	0.949	0.949	0.966
HM	LPIPS↓	0.386	0.283	0.284	0.399	0.216	0.165	0.185	0.187	0.177	0.172
	PSNR↑	14.582	19.826	19.628	16.458	19.758	28.284	23.940	28.736	28.737	29.856
	SSIM↑	0.696	0.769	0.737	0.608	0.848	0.898	0.879	0.896	0.901	0.903
	FSIM↑	0.818	0.850	0.890	0.790	0.892	0.970	0.932	0.980	0.986	0.986
HT	LPIPS↓	0.375	0.276	0.244	0.359	0.141	0.126	0.133	0.135	0.153	0.130
	PSNR↑	13.758	18.576	18.079	18.104	23.116	31.161	26.537	29.693	29.242	31.748
	SSIM↑	0.675	0.744	0.761	0.654	0.894	0.913	0.898	0.904	0.903	0.913
	FSIM↑	0.825	0.836	0.924	0.820	0.963	0.988	0.956	0.985	0.987	0.989
IHD	LPIPS↓	0.414	0.488	0.489	0.504	0.452	0.377	0.408	0.434	0.371	0.345
	PSNR↑	16.680	12.486	15.435	9.914	17.878	22.930	20.093	22.529	23.047	25.544
	SSIM↑	0.744	0.686	0.669	0.608	0.736	0.798	0.771	0.782	0.806	0.817
	FSIM↑	0.814	0.730	0.744	0.720	0.759	0.890	0.821	0.882	0.912	0.929
IHM	LPIPS↓	0.356	0.331	0.310	0.382	0.254	0.173	0.196	0.203	0.190	0.182
	PSNR↑	15.391	16.903	19.032	15.790	18.645	27.851	23.397	27.887	25.407	29.110
	SSIM↑	0.731	0.743	0.743	0.643	0.831	0.895	0.874	0.891	0.890	0.899
	FSIM↑	0.838	0.792	0.857	0.775	0.846	0.966	0.921	0.970	0.967	0.976
IHT	LPIPS↓	0.367	0.307	0.249	0.353	0.160	0.129	0.140	0.141	0.160	0.133
	PSNR↑	13.977	17.190	18.042	17.976	22.142	29.950	25.775	28.630	28.264	31.317
	SSIM↑	0.684	0.711	0.765	0.665	0.887	0.911	0.896	0.903	0.900	0.912
	FSIM↑	0.831	0.812	0.916	0.822	0.945	0.983	0.949	0.981	0.979	0.987

Table 3. Quantitative comparisons of different algorithms’ dehazing on the Haze1K dataset.

Dataset	Metrics	CEP[9]	HazeLine[11]	EVPM [12]	IDeRs[21]	AOD[13]	FCTFNet[15]	LDN[14]	RSHNet[35]	AUNet[50]	LRSDN
Thick	LPIPS↓	0.222	0.186	0.210	0.311	0.238	0.207	0.232	0.220	0.190	0.157
	PSNR↑	15.089	16.365	16.647	11.754	16.521	19.192	16.506	20.258	19.909	21.897
	SSIM↑	0.759	0.790	0.787	0.702	0.774	0.814	0.777	0.835	0.837	0.847
	FSIM↑	0.889	0.914	0.901	0.872	0.864	0.937	0.862	0.941	0.948	0.959
Moderate	LPIPS↓	0.274	0.198	0.104	0.320	0.175	0.081	0.116	0.061	0.104	0.088
	PSNR↑	13.083	15.454	20.656	14.763	20.078	23.582	20.970	24.880	24.327	25.241
	SSIM↑	0.746	0.798	0.918	0.785	0.906	0.937	0.921	0.941	0.929	0.934
	FSIM↑	0.854	0.895	0.942	0.899	0.917	0.969	0.936	0.966	0.956	0.970
Thin	LPIPS↓	0.287	0.183	0.088	0.279	0.098	0.091	0.095	0.083	0.071	0.070
	PSNR↑	12.194	13.921	20.426	15.048	18.671	21.532	18.648	22.377	23.017	23.673
	SSIM↑	0.701	0.760	0.891	0.772	0.870	0.898	0.873	0.903	0.906	0.913
	FSIM↑	0.834	0.892	0.962	0.911	0.943	0.965	0.947	0.967	0.976	0.978

Table 4. Quantitative comparisons of different algorithms’ dehazing on the RICE dataset.

Metrics	CEP [9]	HazeLine [11]	EVPM [12]	IDeRs [21]	AOD [13]	FCTFNet [15]	LDN [14]	RSHNet [35]	AUNet [50]	LRSDN
LPIPS↓	0.341	0.288	0.293	0.363	0.274	0.072	0.183	0.106	0.075	0.077
PSNR↑	14.234	17.058	15.217	15.750	20.784	29.091	23.108	23.453	28.704	31.662
SSIM↑	0.713	0.723	0.742	0.611	0.834	0.949	0.873	0.919	0.946	0.953
FSIM↑	0.800	0.781	0.865	0.746	0.856	0.980	0.887	0.961	0.980	0.983

Table 5. Quantitative comparisons of different algorithms’ dehazing on RSID dataset.

Metrics	CEP [9]	HazeLine [11]	EVPM [12]	IDeRs [21]	AOD [13]	FCTFNet [15]	LDN [14]	RSHNet [35]	AUNet [50]	LRSDN
LPIPS↓	0.277	0.191	0.202	0.277	0.144	0.093	0.125	0.127	0.054	0.076
PSNR↑	13.091	16.498	16.418	14.254	19.052	22.469	19.026	20.640	25.457	24.878
SSIM↑	0.736	0.848	0.779	0.713	0.901	0.939	0.912	0.912	0.957	0.942
FSIM↑	0.796	0.873	0.826	0.777	0.911	0.944	0.925	0.929	0.962	0.947

Table 6. The average scores of the quantitative evaluation of the road-extraction results by different methods on the RSHD.

Metrics	CEP [9]	HazeLine [11]	EVPM [12]	IDeRs [21]	AOD [13]	FCTFNet [15]	LDN [14]	RSHNet [35]	AUNet [50]	LRSDN
Recall↑	0.546	0.668	0.681	0.397	0.508	0.709	0.634	0.726	0.775	0.783
Precision↑	0.718	0.742	0.733	0.696	0.741	0.780	0.789	0.771	0.760	0.780
IoU↑	0.450	0.541	0.544	0.342	0.429	0.589	0.528	0.594	0.623	0.641
F1↑	0.619	0.700	0.699	0.497	0.570	0.736	0.674	0.741	0.766	0.780

Table 7. The average scores of the quantitative evaluation of the land cover classification results by different methods.

Metrics	Hazy	CEP [9]	EVPM [12]	IDeRs [21]	AOD [13]	FCTFNet [15]	LDN [14]	RSHNet [35]	AUNet [50]	LRSDN
IoU↑	0.161	0.483	0.381	0.286	0.476	0.581	0.591	0.527	0.496	0.593
F1↑	0.229	0.616	0.479	0.403	0.596	0.699	0.704	0.639	0.609	0.709
Recall↑	0.201	0.576	0.458	0.331	0.553	0.655	0.664	0.602	0.572	0.658
Precision↑	0.293	0.675	0.509	0.595	0.685	0.774	0.782	0.698	0.679	0.805

Table 8. Computational complexity and run time of several state-of-the-art algorithms.

Methods	Params (M)	MACs (G)	Run Time (ms)	Methods	Run Time (ms)
RSHNet [35]	1.136	9.850	14.130	DCP [8]	306.251
AOD [13]	0.002	0.107	0.850	CEP [9]	9.182
GCANet [16]	0.670	17.507	7.860	HazeLine [11]	1607.788
GDN [17]	0.911	19.969	17.170	EVPM [12]	47.704
FCTFNet [15]	0.156	9.357	8.416	IDeRs [21]	304.158
LDN [14]	0.029	1.836	1.160	SMIDCP [22]	114.872
AUNet [50]	7.417	48.312	30.230	CAP [10]	48.132
LRSDN (ours)	0.087	5.209	7.664

Table 9. Computational complexity and run time of several variants of LRSDN for ablation study. The label K represents the number of Basicblocks in the model. ADRB and HAB indicate whether the BasicBlock contains the corresponding component or not. SCA and PA denote the simplified channel attention and pixel attention structures in the HAB, respectively. The BasicBlock of model M1 contains only the standard ResBlock [19] as the baseline. The model M4 is the proposed LRSDN.

Model	ADRB	HAB	K	Params (M)	MACs (G)	Run Time (ms)
M1	×	×	5	0.074	4.716	2.111
M2	√	×	5	0.080	5.055	6.327
M3	√	SCA	5	0.085	5.064	7.550
M4	√	SCA+PA	5	0.087	5.209	7.664
M5	√	SCA+PA	10	0.368	22.109	16.307

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

He, Y.; Li, C.; Li, X.; Bai, T. A Lightweight CNN Based on Axial Depthwise Convolution and Hybrid Attention for Remote Sensing Image Dehazing. Remote Sens. 2024, 16, 2822. https://doi.org/10.3390/rs16152822

AMA Style

He Y, Li C, Li X, Bai T. A Lightweight CNN Based on Axial Depthwise Convolution and Hybrid Attention for Remote Sensing Image Dehazing. Remote Sensing. 2024; 16(15):2822. https://doi.org/10.3390/rs16152822

Chicago/Turabian Style

He, Yufeng, Cuili Li, Xu Li, and Tiecheng Bai. 2024. "A Lightweight CNN Based on Axial Depthwise Convolution and Hybrid Attention for Remote Sensing Image Dehazing" Remote Sensing 16, no. 15: 2822. https://doi.org/10.3390/rs16152822

APA Style

He, Y., Li, C., Li, X., & Bai, T. (2024). A Lightweight CNN Based on Axial Depthwise Convolution and Hybrid Attention for Remote Sensing Image Dehazing. Remote Sensing, 16(15), 2822. https://doi.org/10.3390/rs16152822

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Lightweight CNN Based on Axial Depthwise Convolution and Hybrid Attention for Remote Sensing Image Dehazing

Abstract

1. Introduction

2. Materials

2.1. Atmospheric Scattering Model

2.2. Prior-Based Dehazing Methods

2.3. Learning-Based Dehazing Methods

2.4. Fundamental Technologies

3. Methods

3.1. The Lightweight Architecture of the Model

3.2. Axial Depthwise Convolution and Residual Block

3.3. Hybrid Attention Block

3.4. Loss Function

4. Dataset Construction

4.1. Haze Image Simulation Based on Perlin Noise

4.2. Remote Sensing Haze Image Dataset

5. Results

5.1. Implementation Details

5.2. Comparison Algorithms

5.3. Algorithm Evaluation Metrics

5.4. Dehazing on RSHD

5.5. Dehazing on Haze1K Dataset

5.6. Dehazing on RICE Dataset

5.7. Dehazing on RSID Dataset

6. Discussion

6.1. Road Extraction

6.2. Land Cover Classification

6.3. Computational Complexity and Execution Time

6.4. Ablation Study

6.5. Visualization of the Feature Enhanced by Hybrid Attention Block

6.6. Failure Cases and Future Work

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI