Automatic Water Body Extraction from SAR Images Based on MADF-Net

Wang, Jing; Jia, Dongmei; Xue, Jiaxing; Wu, Zhongwu; Song, Wanying

doi:10.3390/rs16183419

Open AccessArticle

Automatic Water Body Extraction from SAR Images Based on MADF-Net

by

Jing Wang

^1,*,

Dongmei Jia

¹,

Jiaxing Xue

¹,

Zhongwu Wu

² and

Wanying Song

¹

Xi’an Key Laboratory of Network Convergence Communication, College of Communication and Information Engineering, Xi’an University of Science and Technology, Xi’an 710054, China

²

School of Electrical and Information Engineering, Changsha University of Science & Technology, Changsha 410114, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(18), 3419; https://doi.org/10.3390/rs16183419 (registering DOI)

Submission received: 5 August 2024 / Revised: 31 August 2024 / Accepted: 12 September 2024 / Published: 14 September 2024

(This article belongs to the Special Issue Remote Sensing of Global Floods: Observing, Modelling, and Forecasting)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Water extraction from synthetic aperture radar (SAR) images has an important application value in wetland monitoring, flood monitoring, etc. However, it still faces the problems of low generalization, weak extraction ability of detailed information, and weak suppression of background noises. Therefore, a new framework, Multi-scale Attention Detailed Feature fusion Network (MADF-Net), is proposed in this paper. It comprises an encoder and a decoder. In the encoder, ResNet101 is used as a solid backbone network to capture four feature levels at different depths, and then the proposed Deep Pyramid Pool (DAPP) module is used to perform multi-scale pooling operations, which ensure that key water features can be captured even in complex backgrounds. In the decoder, a Channel Spatial Attention Module (CSAM) is proposed, which focuses on feature areas that are critical for the identification of water edges by fusing attention weights in channel and spatial dimensions. Finally, the high-level semantic information is effectively fused with the low-level edge features to achieve the final water detection results. In the experiment, Sentinel-1 SAR images of three scenes with different characteristics and scales of water body are used. The PA and IoU of water extraction by MADF-Net can reach 92.77% and 89.03%, respectively, which obviously outperform several other networks. MADF-Net carries out water extraction with high precision from SAR images with different backgrounds, which could also be used for the segmentation and classification of other tasks from SAR images.

Keywords:

SAR; water extraction; deep learning; attention mechanism

1. Introduction

As an important means of water resource management, flood prevention, and disaster mitigation, water body detection is of vital significance for maintaining the ecological environment and safeguarding people’s lives and properties. Currently, the global climate changes largely, extreme weather events occur frequently, and natural disasters such as heavy rainfall and floods bring great challenges to human society. Therefore, strengthening the real-time monitoring and early warning of water bodies, as well as improving the accuracy and timeliness of water information, are of vital importance for effectively responding to natural disasters and mitigating disaster losses. Synthetic aperture radar (SAR) is one of the most important technical means for water monitoring in floods [1]. SAR is not affected by weather, which can be used for all-day and all-weather imaging. The backward scattering coefficient of SAR images for water is very low (due to the mirror reflection), which allows them to be used to perform continuous observation of the spatial distribution of rivers, lakes, and reservoirs, and of the dynamic changes in the areas of timely and effective flood prevention and mitigation.

Traditional water detection methods mainly include the threshold method and classifier method. The threshold method is more often used in medium- and high-resolution images, and is mainly based on the spectral features of the features, using spectral knowledge to construct various classification models and water indices for water extraction. Through an experiment on water detection in Tangshan, Wang et al. [2] constructed a spectral interrelationship and a ratio model and concluded that the water characteristics of the ratio of TM4 to TM2 were less than 0.9. Li et al. [3] designed an improved maximum interclass variance thresholding method for water extraction, which integrated the spectral, textural, and spatial features of the images to extract water information. The experiments in the Dongting Lake region during the dry and flooding periods were performed, respectively, and good results were produced. Duan et al. [4] extracted water bodies by the SVM method, object-oriented method, and water body index method, respectively, and the results showed that the SVM method had the highest extraction accuracy. Aung et al. [5] utilized the Google Earth RGB image and combined Sobel’s rule, the wavelet transform, and the SVM method for the classification of the river area and sandbar area, and the overall accuracy of detection reached 94%. In recent decades, deep learning techniques have exhibited remarkable achievements in diverse image analysis applications spanning various fields. Convolutional Neural Networks (CNNs) have gained popularity in semantic segmentation tasks due to their proficiency in implementing nonlinear decision mechanisms and their capacity to learn features directly from raw image data through the integration of convolutional and pooling layers. CNNs allow the automatic learning of features in an image, making them more accurate and robust than manually designed features.

CNNs can obtain better accuracy and efficiency when dealing with large image datasets. So, more and more researchers have applied deep learning methods in water detection. Singh et al. [6] used the predefined architecture U-Net as a segmentation model for detecting and segmenting water in satellite images, which successfully achieved good results with less data. Liu et al. [7] presented a water extraction network, R50A3-LWBENet, based on ResNet50 and three attention mechanisms. Experiments demonstrated that the network excelled in integrating both global and local information, leading to the sharper refinement of lake water body edges. Hou et al. [8] proposed an automatic and robust water identification architecture without manual labeling, and the experimental results showed that morphological processing was much more effective for the feature extraction of water. Chen et al. [9] introduced a novel technique that leverages feature pyramid enhancement and pixel pair matching as its foundation, which could retain more spatial information and transmit it to the backbone network. By utilizing this approach, Chen et al. [9] effectively addressed the common issue of detail loss that often occurs in deep networks. Xu et al. [10] built an enhanced version of the traditional U-Net, known as the Information-Extended network, which incorporated rotated isovariant convolution, a rotation-based channel-attention mechanism, and an optimized Batch Normalization layer. It could improve the IoU value of water extraction by 7%. Chen et al. [11] introduced a hybrid CNN–Transformer architecture, demonstrating through experimental results its superior performance and efficiency, achieving state-of-the-art results on datasets pertaining to surface water and Tibetan Plateau lakes. Bahrami et al. [12] investigated the utilization of advanced deep learning models, including SegNet, U-Net, and FCN32, for the automated segmentation of flood-affected zones. Their experiments underscored the promising potential of deep learning techniques in augmenting flood detection accuracy and enhancing overall response capabilities, making flood prediction systems more efficient and reliable. Lu et al. [13] introduced a regionalized approach for coastline extraction from high-resolution images, integrating Simple Linear Iterative Clustering (SLIC), Bayes’ Theorem, and the Metropolis–Hastings (M-H) algorithm. Their experimental findings indicated that the proposed methodology achieved a precise and complete extraction of coastlines. Pech-May et al. [14] presented a strategy for the classification of flooded areas using satellite images obtained from synthetic aperture radar, as well as the U-Net neural network and ArcGIS platform. Experiments showed that the results were good. Jonnala et al. [15] proposed a spatial attention residual U-Net architecture to improve the effectiveness of water segmentation. The proposed method used U-Net as the base network to reweight the feature representation spatially to obtain the water element data, which acquired better performance than the existing networks.

In SAR image analysis, DeepLabV3+ has demonstrated its powerful segmentation ability, and has been increasingly used in water detection, land cover classification, and disaster monitoring [16]. Many researchers have improved DeepLabV3+ for water extraction from SAR images. In their latest study, Chen et al. [1] discussed the water detection method of SAR images based on interpretive artificial intelligence (AI) and focused on the attention mechanism of the model, providing a new perspective for understanding the decision-making process of the model. Wu et al. [17] proposed to use the spatial pyramid pool module to combine feature maps at different scales in DeepLabV3+. Chen et al. [18] proposed a multi-level feature attention fusion network (MFAF-Net), which realized the high-precision automatic detection of water from multi-frequency and multi-resolution SAR images. Zhang et al. [19] proposed a deep neural network for the automatic extraction of water and shadows from SAR images by integrating CNN, ResNet, DenseNet, global convolutional network (GCN), and convolutional long short-term memory (ConvLSTM), which was experimentally proved to be effective for water and shadow extraction. Chen et al. [20] introduced a pioneering interpretable deep neural network (DNN) architecture for detecting surface water, which seamlessly integrated SAR domain expertise, DNN, and Extensible Artificial Intelligence (XAI). Experimental results demonstrated its ability to provide transparency into the DNN’s decision-making process during water detection, along with corresponding attribution analysis for a specific SAR image input. Cai et al. [21] proposed an automatic and fast extraction method for InSAR image stacking based on a multi-layer feature fusion attention mechanism, which effectively improved the accuracy and speed of stacked information extraction. Chen et al. [22] designed a multi-scale deep neural network to achieve high-precision water recognition under complex terrain conditions for the problem of water detection in mountainous SAR images.

At present, great progress has been made in the research of SAR image water detection. Many researchers have used attention mechanisms, multi-scale feature fusion, super-resolution reconstruction, and other methods to conduct research. However, most of the water detection algorithms have been designed and verified for SAR images in specific areas or under specific conditions, and their generalization ability is limited. For example, Hou et al. [8] used data from only one area of the Tibetan Plateau lake in their experiment. When applied to other regions or conditions, a large number of parameter adjustments and algorithm optimizations might be required, which would increase the complexity and cost of the algorithm’s application. Most of the research has been conducted on large areas of water; however, for small areas, there are still deficiencies in the accuracy of water detection and edge extraction. For example, Xu et al. [10] mentioned that the Aug-U-Net model was unsatisfactory for the extraction of small tributaries.

In summary, water detection still faces the problem of low generalization, weak extraction ability of detailed information and multi-scale features for water, and a weak suppression ability for complex background noises. To address this, a Multi-scale Attention Detailed Feature Fusion Network (MADF-Net) is proposed in this paper, which obviously improved the performance of water detection for different types of water from SAR images by constructing a deep multi-scale feature extraction (DAPP) module and a dual-attention mechanism (CSAM) module. The contributions of this paper are as follows:

(1): MADF-Net is proposed, which can perform automatic detection for different scales of water regions with high precision.
(2): A feature extraction module, namely DAPP, is proposed based on depth-separable convolution, which can obtain rich semantic information of water bodies and enhance the network’s learning of the detailed features of water from SAR images.
(3): The detailed edge feature extraction module, CSAM, is presented inspired by spatial attention and channel attention, and it carries out the effective distribution of information weights among channels and improves the feature expression and edge information extraction ability in the edge region of the water regions.

The rest of this paper is organized as follows. Section 2 gives detailed information on the SAR images used. Section 3 elaborates on the principle of MADF-Net proposed in this paper. In Section 4, an experiment is performed using Sentinel-1 SAR images for water detection, and several excellent networks are compared. Section 5 discusses several problems in this paper. Finally, the conclusion is given in Section 6.

2. Materials

In this paper, the single-polarized intensity images in IW mode from Sentinel-1 are used, which work in the C-band and have a resolution of 5 m (azimuth) × 20 m (range) in IW mode. In this study, the Sentinel-1 SAR images was downloaded from Scientific Data Hub of the European Space Agency (ESA), which was located in Paris, France. There were three regions in June, July, September, and October 2018, and the latitude and longitude of the three regions were 96°~119°E, 32°~42°N; 113°41′~115°05′E, 29°58′~31°22′N; and 99°38′~100°45′E, 36°32′~37°12′N, respectively. Firstly, the downloaded data are preprocessed through orbit correction, thermal noise removal, radiometric calibration, filtering, and geometric correction by the Sentinel Application Platform (SNAP). Then, the dataset is produced and manually labeled with reference to the PASCAL VOC 2012 format, and the Labelme v5.3.1 software is used to label the water body areas as water and other areas as background. Firstly, the large-scale SAR images are sliced using the sliding window to generate 512 × 512-pixel images, and 11,000 samples are finally generated, in which the ratio of the training set to the validation set is 8:2. The dataset folder contains two folders, training and testing, and the training folder contains the file name of the ImageSet image, the training validation set image, and the validation set image. JPEGImages stores all the images of training and testing, and SegmentationClass stores all the tags of training and testing. At the same time, the images selected to make the dataset include three cases: the combination of tributaries and small tributaries, the combination of large areas of water and small areas, and the combination of large areas of water and tributaries. And each SAR image corresponds to a red and black label, where red is for water and black is for something other than water. In addition, three water scenes with different characteristics are reserved for independent testing, and are not used in the dataset.

3. Methods

3.1. The Overall Framework

In this paper, a Multi-scale Attention Detailed Feature fusion Network (MADF-Net) is proposed, which is an encoder–decoder framework, and the overall network is shown in Figure 1. In the encoder, ResNet101 [23] is selected as the backbone network to extract richer features through deeper network structure, and then the extracted high-level features are input into the proposed DAPP module for further feature extraction, which can obtain rich and accurate semantic information to enhance the network’s learning ability of the detailed features of water. In the decoder, the extracted low-level features are passed through the constructed CSAM to obtain rich texture information, so as to strengthen the network’s learning capability for the detailed features of water. Then, the high-level semantic information and low-level texture features are fused to realize the full extraction of water features, thus giving the extraction results of water.

3.2. Residual Neural Network

Residual Neural Network (ResNet) [23] is a deep convolutional neural network that uses Residual Learning through “Residual Blocks”. These blocks ease the learning of identity mapping by adding inputs directly to the output of convolution layers. We chose ResNet101 with an atrous convolution as our baseline for water detection. It starts with reducing the image size, followed by the use of 4 res-blocks for feature extraction at different levels. The last block enhances features using concatenated dilated convolutions with different void rates, deepening the network and enriching feature maps.

3.3. Deep Atrous Pyramid Pooling (DAPP)

With the increase in the model’s complexity, both the computational load and the total count of parameters undergo a corresponding augmentation, which brings challenges to its practical applications. In this paper, we introduce a module that leverages depth-separable convolution. The fundamental premise of this approach is that the spatial and channel (depth) dimensions of a feature map within a convolutional neural network can be disentangled (decoupled) from each other. Standard convolutional operations involve the combined mapping of spatial and channel features through weight matrices, which comes with the drawbacks of significant computational demands, memory requirements, and a vast array of weight parameters [24,25]. Conceptually, deeply differentiable convolution aims to minimize the number of weight coefficients while maintaining the kernel’s representation-learning capabilities. It achieves this by separately mapping the spatial and channel dimensions and then combining the results in a manner that preserves the integrity of the convolution operation. Depth-separable convolution is divided into two parts. The first part is known as depthwise convolution, in which each channel is convolved separately using a given convolution kernel size and the results are combined. The second part is known as pointwise convolution [24], which involves utilizing a standard convolution operation with a unit-sized convolution kernel to produce an output feature map.

The DAPP module it skillfully integrates atrous convolution with depth separable convolution. Atrous convolution, also known as dilated convolution [26], is formed by injecting varying numbers of zeros into the convolution kernel based on given parameters, altering its element distribution and expanding the convolution’s calculation range. It enables an increased receptive field without sacrificing spatial resolution, capturing high-level semantic features without additional parameters. A larger receptive field aids in segmenting larger targets, while its high resolution facilitates precise target localization. Atrous convolution mitigates issues like internal data structure distortion, loss of spatial hierarchical information from pooling, and small target loss after downsampling. Moreover, atrous convolution enables a larger receptive field, which significantly enhances the performance of small-object recognition and segmentation in tasks such as target detection and semantic segmentation. Due to the existence of small targets such as fine tributaries and fine waters in the water body, the use of atrous convolution instead of downsampling/upsampling can retain the spatial features of the image very well, and there is no loss of image information. The structure of the DAPP module is shown in Figure 2.

The input is first subjected to depthwise convolution with a convolution kernel of 3 × 3. After passing through the BN and Relu layers, Atrous convolution with a settable expansion rate is performed. Then, the features are processed by the BN and Relu layers. Finally, the output feature map is obtained through pointwise convolution, NB layers, and Relu layers. This module not only greatly reduces the parameters of the convolution, but also increases the receptive field with a significant semantic information extraction capability.

In this paper, the DAPP module is constructed using the DSCFE module as the base module, and the structure of the DAPP module is shown in Figure 2. In the beginning, a 1X1 convolutional layer is used for dimensionality reduction; then a three-level pyramid structure is constructed using DSCFE modules with expansion rates of 1, 3, and 5, respectively, to extract features at different scales, as well as to perform global average pooling. Finally, the outputs of each layer are stacked in order by channel, and the final results are obtained by downscaling the output layers of the conv layer, BN layer, Relu layer, and dropout to a given number of channels.

3.4. Channel Space Attention Module (CSAM)

The incorporation of attention mechanisms in image processing aims to capture contextual information and enable the model to prioritize salient regions while discounting irrelevant details [27,28]. These mechanisms encompass channel attention and spatial attention modules, emphasizing the significance of channels and spatial regions in the image, respectively. Specifically, the channel attention module enhances model performance by scrutinizing the interdependencies between various channels and optimizing the distribution of feature maps. This approach facilitates the identification of feature channels that are pivotal for a given task [29,30]. The spatial attention module is primarily concerned with assessing the significance of various pixel regions within an image. This module helps to better understand local regions, which can accurately extract edge features, thus solving the problem that the extraction ability of fine tributaries is not strong enough. Inspired by the SE channel attention module [31,32] and spatial attention module, the CSAM is constructed in this paper, and is shown in Figure 3.

First, the input

X

is processed by

F_{t r}

to obtain feature map U.

F_{t r}

can be regarded as a convolution operator, which is defined as follows:

U = F_{t r} (X) = \sum_{s = 1}^{C^{'}} V_{c}^{s} * X

(1)

where

X \in R^{H^{'} \times W^{'} \times C^{'}}

and

U \in R^{H \times W \times C}

are the input feature map and output feature map, respectively.

V_{c}^{s}

denotes a 2D spatial kernel, and * denotes a convolution operation. Then, the global average pooling (

F_{G A P})

is carried out, and the feature map containing the global information is directly compressed into a 1 × 1 × C eigenvector. In this way, the generated channel-level statistics contain contextual information, alleviating the problem of channel dependency [18]. The definition is as follows:

F_{G A P} (U) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} U (i, j)

(2)

Then, two fully connected operations are performed. To reduce computational complexity, the first fully connected layer in the CSAM compresses the input feature map from C channels to C/r channels, where r represents the compression ratio. Subsequently, a Relu nonlinear activation layer is employed, followed by a second fully connected layer that restores the number of channels back to C, thus maintaining the original dimensionality. Moreover, a Sigmoid activation is carried out to obtain the weights s. The dimension of s is 1 × 1 × C, which is used to inscribe the weights of the feature map U. Finally, the scale operation is performed to weight features of each channel by the previously obtained attention weights, which are served as the output of the channel attention module. The output is performed with maximum pooling and average pooling, respectively, to obtain two feature maps, each with dimensions 1 × H × W. Next, these two feature maps are concatenated, resulting in a combined feature map. Then, this concatenated feature map is fed into a 7 × 7 convolutional layer to generate a single-channel feature map. Subsequently, a sigmoid activation function is applied to produce the spatial attention feature map. Finally, the spatial attention feature map changes back to the size of C × H × W as the module’s output. The definition is as follows:

M_{s} (F) = σ (f^{7 \times 7} ([A v g P o o l (F); M a x P o o l (F)]))

(3)

The CSAM enables multi-scale feature extraction on the input feature map. This enhancement enables a granular understanding of the image content and improves the analysis and recognition capabilities of the model.

3.5. Evaluation Metrics

In order to evaluate the performance of the water detection, Pixel Accuracy (PA) and Intersection over Union (IoU) [29] are used in this paper, which are defined as follows.

P A = \frac{T P + T N}{T P + F P + F N + T N}

(4)

I o U = \frac{T P}{T P + F P + F N}

(5)

where TP, TN, FP, and FN denote True Positives, True Negatives, False Positives and False Negatives, respectively. PA is the ratio of correctly predicted pixels and ground truth pixels. If the PA value is higher, it means more water areas are detected; if the IoU value is high, it means false alarms are fewer.

4. Results

4.1. Training Parameter Settings

In this paper, the experimental software environment is Pytorch 1.7.1, CUDA 11.4, and Python 3.8; the hardware environment is GPU (single) NVIDIA RTX 3090, with 22 GB of video memory. During the training process, the network is trained with ResNet as the backbone, and the learning rate has been adjusted to 0.005 and a weight decay of 0.0005 was applied, and the optimizer adopts the SGD optimization algorithm. The batch size of the input image is eight, and the network is trained for 100 iterations. The network automatically compares the new loss obtained after each iteration, and after 100 iterations, the network finds the model with the lowest loss in that iteration. Then, it is validated after each round of training, and the optimal model weights according to the validation results are selected.

4.2. Comparison Experiments

In order to verify the effectiveness of the proposed MADF-Net, three excellent semantic segmentation networks are compared in the experiment, namely, DeepLabV3+ network [33], MFAF-Net [18], and GCN [34]. In this study, three images from a real Sentinel-1 system with different characteristics were selected for independent testing. In the results of the experiment, blue represents the water area, red represents false alarm, and green represents missed detection. Scenario 1 is mainly a small tributary, Scenario 2 is a large area of water and some small areas of water, and Scenario 3 is a large area of water and tributaries that has complex background. These three scenarios all contain small-scale water bodies, which can well prove that the model proposed can solve the problems faced in water detection. In addition, the inconsistency of the three scenario types can well verify the generalization ability of the model. At the same time, these three scenarios are more complex, which can better prove the effect of the model.

4.2.1. Scene I

According to Figure 4a, Scene I includes a tributary and a small section of tiny tributaries, but the width of each tributary varies largely. In addition, there are many zigzags that make it difficult to accurately extract the boundaries of the water body, which may become missed detection.

The experimental results of Scene Ⅰ are shown in Figure 4. From the results, it can be seen that the two networks, GCN and MFAF-Net, have obvious false alarms and missed detections for water areas, which shows that their capability for extracting detailed information of water is not good. In particular, GCN missed lots of water areas. However, DeepLabV3+ and MADF-Net have much better detection results for water. The DeepLabV3+ network can only roughly identify a part of water in the small tributaries, while the proposed MADF-Net has an obvious advantage over DeepLabV3++ in terms of its extraction ability of the fine tributaries, which embodies the excellent ability of its detailed information extraction for the fine tributaries. Overall, the detection results of the MADF-Net proposed in this study are significantly better than several other networks (from Table 1); although there are still omissions on some fine water bodies, the false alarms are largely decreased.

4.2.2. Scene II

According to Figure 5, we know Scene II includes a large regional water body with multiple small regional water bodies adjacent to it and some scattered water areas. So, it contains water areas with different scales.

The experimental results show that the MFAF-Net (Figure 5f) and GCN (Figure 5e) can both basically detect all water areas, but there are many false alarms, especially the former network, which also can be seen from Table 2. It reflects the weak semantic information extraction ability when coming out of difficult regions. Compared with GCN, DeepLabV3+ has basically the same false alarms, but the detection precision is lower than GCN. However, in the proposed MADF-Net, the precision is higher than DeepLabV3++. Though it is a little lower than GCN, its false alarms are much fewer, which can be seen from Figure 5 and Table 2.

4.2.3. Scene III

From Figure 6a, we can see Scene III consists of many small areas of water, several elongated rivers, and one large regional water adjacent to multiple smaller regional areas of water.

The results of the experiments with different networks for Scene III are shown in Figure 6. From the results, it can be seen that MFAF-Net (Figure 6f) and GCN (Figure 6e) are not effective in detecting small water areas. There are a lot of missed detections, and the extracted results are relatively incomplete. According to Figure 6c, DeepLabV3+ can detect small and slender water areas roughly, but it has obvious false alarms, reflecting its problem of inaccurate semantic information extraction. Figure 6c shows that MADF-Net can extract more detailed water areas, and much fewer false alarms are detected than other networks, which indicates its excellent extraction ability for global semantic information and texture feature of water areas, so the water boundaries are much clearer than in other networks. According to Table 3, we also can see that MADF-Net outperforms other networks in water extraction performance, both in PA and IoU.

4.3. Analysis of Experimental Results

Through the comparison experiments of three regions with different characteristics by different networks, it can be seen that the proposed MADF-Net is superior to the other three networks in its semantic information extraction ability and texture feature extraction capability. Perhaps it is because the proposed DAPP module can obtain deeper and more accurate semantic information, which greatly reduces the number of parameters of convolution under the premise of guaranteeing detection accuracy, and increases the sensory field to grasp the global information through atrous convolution at the same time. The detection precision of DeepLabV3+ and MADF-Net in the large water regions are both better, but there are a lot of false alarms in DeepLabV3+, so the IoU of the DeepLabV3+ network is only 75.63%, which is much lower than in MADF-Net (81.446%). The PAs of GCN and MADF-Net in Scene II are 92.7% and 90.4%, respectively. Though GCN can detect more water areas, it has many more false alarms than MADF-Net, which can be seen from their IoUs (83.052% and 86.78%). It shows the advanced semantic information extraction capability of the DAPP module. Particularly in the results of Scene I, GCN missed the detection of most water areas, which maybe indicates its poor ability for extracting the detailed information of small and slender water areas. In contrast, for MADF-Net, the proposed CSAM helps it obtain accurate texture features by introducing the attention mechanism in the low-level features, which improves its detailed feature extraction ability, to make the model pay more attention to the task of texture feature extraction and to avoid information loss.

4.4. Ablation Experiment

In order to better verify the effect of the MADF-Net proposed in this paper, ablation experiments using DeepLabV3+ network as a baseline are performed, and the average water detection indices of the three testing scenes are shown in Table 4. As shown in this Table, when two modules, DAPP and CSAM, are introduced to DeepLabV3+, obvious performance improvement is achieved. Among them, the DAPP module improves the model’s detection of certain fine tributaries by extracting features through deep convolution, so both the PA and IoU metrics rise greatly. The CSAM improves the edge extraction ability by extracting texture features from low-level features, though PA is not improved greatly, but false alarms are decreased obviously. When two modules are both introduced in the network, much higher water detection precision and fewer false alarms can be achieved, which proves the effectiveness of proposed MADF-Net for water extraction.

5. Discussion

In this paper, single-polarization intensity images in Interferometric Wide (IW) mode in Sentinel-1 are utilized to construct a dataset, and the proposed MADF-Net is verified for automatic water extraction. DeepLabV3+ extracts low-level features from the middle layer of the backbone network, which contains more image detail information, but this process is still limited. Especially when dealing with images with complex boundaries or fine structures, the model may not be able to recover all the details completely accurately. The GCN model proposes a residual-based boundary-refinement module to refine the object boundary, but this method may still not be able to completely solve the boundary localization problem in blurred boundaries or complex scenes. In some extreme cases, the boundary-refinement module may not be effective at distinguishing between adjacent similar objects or handling occlusion. The MFAF-Net model adopts a multi-level feature extraction and attention fusion mechanism, which improves the accuracy of the model for water extraction from multi-source SAR images, but also increases the complexity and computational cost of the model. Compared to these three networks, the proposed MADF-Net obviously outperforms them in performance for water extraction.

In terms of the structure, the proposed network (MADF-Net) has a low-level feature extraction layer and a high-level feature extraction layer. Its low-level feature extraction layers contain a small number of convolutional and pooling layers that capture the basic features of the image. Convolutional layers use small convolutional kernels to extract local features, while pooled layers are used to reduce the size of feature maps while preserving important information. As the depth of the network increases, the convolutional layer uses more convolutional kernels and more complex feature combinations to extract higher-level features. These features are more abstract and capable of representing complex objects or scene structures. Methodologically, the depth separable convolution significantly reduces the number of parameters of the model. This makes the network lighter, reducing the complexity and storage requirements of the model. And through the point-by-point convolutional layer, the features of different channels are fused to maintain good feature extraction and representation capabilities. Multi-scale pooling can capture more diverse feature information, which helps to enhance the invariance of the model to scale changes in input data and improve the robustness of the model. The attention mechanism helps the model to capture the key features of the input data more accurately, improving the performance of the model. Therefore, MADF-Net not only enhances extraction accuracy but also demonstrates significant advantages in capturing finer detailed information.

In addition, our network achieved an IoU of 86.78% on the test set, a 3% improvement over other benchmark networks, demonstrating superior performance. However, it is worth noting that our network has also increased in computational complexity and storage requirements to achieve this increase in accuracy. Specifically, our network reaches 262 GFLOPs and the number of Params reaches 59 million, which is indeed slightly higher than some efficient networks, and the improvement in accuracy may be attributed to the deeper and richer feature extraction and more complex connection patterns used in our network architecture. Although this leads to slightly higher computational complexity and storage requirements, we also believe that such a design leads to stronger feature representation capabilities and better generalization performance, which may be of great value in practical application scenarios. In order to address this problem, we are actively exploring model compression and optimization techniques, such as pruning, quantization, and model distillation, to reduce FLOPs and Params while maintaining higher accuracy. We believe that through these efforts, a more efficient and high-performance network model can be achieved in the near future.

Furthermore, the data currently used are all single-band, single-resolution, and single-source SAR images, which may limit the application of water detection from SAR images of different frequencies and resolutions. In addition, in shallow waters, the penetrating ability of a long-wave radar may cause scattered signals from the water body and the land surface to interfere with each other, affecting the accuracy of water withdrawal. Therefore, in our further research, we will collect many water scenes with different backgrounds and SAR images with different water depth and radar parameters, and fuse data from different sources to find problems in water detection and improve the network for better performance. Finally, the network can be used to automatically detect large areas of water against different backgrounds with high accuracy.

6. Conclusions

In this paper, a multi-scale detailed feature attention fusion network is proposed, namely, MADF-Net, which is used to solve the problem of the poor detection accuracy of fine tributaries and small water areas in SAR images. In the MADF-Net, a deep multi-scale feature extraction module is constructed (DAPP), which not only improves the overall accuracy for water detection but also reduces the number of parameters. It further enhances the network’s ability to capture key water features in complex scenes by performing multi-scale pooling operations on input features. The constructed DAPP module can extract information at different scales, increasing sensitivity to small targets such as small water streams, lake edges, etc.

Furthermore, an edge feature extraction module (CSAM) is also proposed based on the channel attention module and the spatial attention module. By fusing the attention weights of the channel and spatial dimensions, the module enables the network to automatically focus on the feature areas that are critical for identifying the edge of the water body. This attention mechanism greatly improves the network’s ability to distinguish between small objects and backgrounds in complex backgrounds, ensuring that the extracted water body information is both accurate and detailed. In the decoding stage, we fuse the high-level semantic information extracted by the encoder with the low-level edge features generated by the decoder to achieve a comprehensive feature description from coarse to fine, and from the whole to the parts. This multi-level feature fusion strategy not only improves the adaptability of the model to specific scenarios (i.e., improves the accuracy of small target extraction), but also enhances the generalization ability of the model. Because the low-level edge features provide a direct description of the shape and contour of an object, while the high-level semantic information provides an in-depth understanding of the object’s classes and attributes, the combination of the two allows the model to maintain stable performance in different scenarios.

From the selection of ResNet101 to the introduction of DAPP and CSAM to the design of feature fusion strategy, our method has always been optimized around the goal of improving the accuracy of small target extraction. Together, these design elements work across the network, making our approach excellent for small targets in complex scenarios. Experiments are conducted using three water scenes with different features, scales and backgrounds from Sentinel-1 images. The results show that MADF-Net has an obvious performance improvement over DeepLabV3+, GCN, and MFAF-Net in terms of extraction ability for detailed water information, with an average precision and IoU for water extraction up to 92.77% and 89.03% for the three scenes. Therefore, MADF-Net can perform water extraction with high precision from SAR images for different types of water, which could also be extended to perform segmentation tasks of other typical targets from SAR images. If a flood occurs, MADF-Net could be used to detect water so that the disaster caused by water could be evaluated precisely.

Author Contributions

Conceptualization, J.W. and D.J.; methodology, J.W.; software, D.J. and J.X.; validation, J.W., D.J. and Z.W.; formal analysis, W.S. and Z.W.; writing—original draft preparation, J.W. and D.J.; writing—review and editing, J.W., D.J. and J.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Natural Science Foundation of China (61901358), the Outstanding Youth Science Fund of Xi’an University of Science and Technology (2020YQ3-09), and the China Postdoctoral Science Foundation (2020M673347).

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Acknowledgments

The authors are grateful to the European Space Agency (ESA) for providing the Sentinel-1 data through Sentinels Scientific Data Hub (https://scihub.copernicus.eu); And, the authors also would like to thank the reviewers and the editor for their constructive comments.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, L.; Cai, X.; Li, Z.; Xing, J.; Ai, J. Where is my attention? An explainable AI exploration in water detection from SAR imagery. Int. J. Appl. Earth Obs. Geoinf. 2024, 130, 103878. [Google Scholar] [CrossRef]
Wang, J.; Zhang, Y.; Kong, C. Application of inter-spectral relationship method in water body feature extraction. Min. Surv. 2004, 30–32. [Google Scholar]
Li, J.; Huang, S.; Li, J. Water body extraction from ENVISAT advanced synthetic aperture radar data: An improved maximum inter-class variance thresholding method. J. Nat. Hazards 2010, 19, 139–145. [Google Scholar]
Duan, Q.; Meng, L.; Fan, Z. Study on the applicability of water body information extraction methods from GF-1 satellite imagery. Remote Sens. Land Resour. 2015, 27, 79–84. [Google Scholar]
Aung, E.M.M.; Tint, T. Ayeyarwady river regions detection and extraction system from Google Earth imagery. In Proceedings of the 2018 IEEE International Conference on Information Communication and Signal Processing (ICICSP), Singapore, 28–30 September 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 74–78. [Google Scholar]
Singh, S.; Girase, S. Semantic Segmentation of Satellite Images for Water Body Detection. In Data Intelligence and Cognitive Informatics, Proceedings of ICDICI 2021, Tirunelveli, India, 16–17 July 2021; Springer: Berlin/Heidelberg, Germany, 2022; pp. 831–840. [Google Scholar]
Liu, M.; Liu, J.; Hu, H. A Novel Deep Learning Network Model for Extracting Lake Water Bodies from Remote Sensing Images. Appl. Sci. 2024, 14, 1344. [Google Scholar] [CrossRef]
Hou, Z.; Meng, M.; Zhou, G. A noise-robust water segmentation method based on synthetic aperture radar images combined with automatic sample collection. Remote Sens. Lett. 2024, 15, 614–623. [Google Scholar] [CrossRef]
Chen, S.; Liu, Y.; Zhang, C. Water-Body Segmentation for Multi-Spectral Remote Sensing Images by Feature Pyramid Enhancement and Pixel Pair Matching. Int. J. Remote Sens. 2021, 42, 5025–5043. [Google Scholar] [CrossRef]
Xu, X.; Zhang, T.; Liu, H. An Information-expanding Network for Water Body Extraction based on U-net. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1502205. [Google Scholar] [CrossRef]
Chen, B.; Zou, X.; Zhang, Y. A Hybrid CNN-Transformer Architecture for Accurate Lake Extraction from Remote Sensing Imagery. In Proceedings of the ICASSP 2024–2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 5710–5714. [Google Scholar]
Bahrami, B.; Arbabkhah, H. Enhanced Flood Detection through Precise Water Segmentation Using Advanced Deep Learning Models. Civ. Eng. Res. J. 2024, 6, 1–8. [Google Scholar] [CrossRef]
Lu, Z.S.; Wang, Y.; Liang, S.Y. Automatic regionalized coastline extraction method based on high-resolution images. IEEE Access 2024, 12, 13537–13552. [Google Scholar] [CrossRef]
Pech-May, F.; Aquino-Santos, R.; Delgadillo-Partida, J. Sentinel-1 SAR images and deep learning for water body map. Remote Sens. 2023, 15, 3009. [Google Scholar] [CrossRef]
Jonnala, N.S.; Gupta, N. SAR U-Net: Spatial attention residual U-Net structure for water body segmentation from remote sensing satellite images. Multimed. Tools Appl. 2024, 83, 44425–44454. [Google Scholar] [CrossRef]
Chen, L.; Li, Z.; Song, C.; Xing, J.; Cai, X.; Fang, Z.; Luo, R.; Li, Z. Automatic detection of earthquake triggered landslides using Sentinel-1 SAR imagery based on deep learning. Int. J. Digit. Earth 2024, 17, 2393261. [Google Scholar] [CrossRef]
Wu, P.; Fu, J.; Yi, X.; Wang, G.; Mo, L.; Maponde, B.T.; Liang, H.; Tao, C.; Ge, W.; Jiang, T. Research on Water Extraction from High Resolution Remote Sensing Images based on Deep Learning. Front. Remote Sens. 2023, 4, 1283615. [Google Scholar] [CrossRef]
Chen, L.; Long, F.; Li, Z. Multi-level Feature Attention Fusion Network for Water Extraction from Multi-source SAR Images. Geomat. Inf. Sci. Wuhan Univ. 2023. [Google Scholar] [CrossRef]
Zhang, P.; Jin, X.; Li, Z.; Xing, X. Water and shadow extraction in SAR image based on a new deep learning network. Sensors 2019, 19, 3576. [Google Scholar] [CrossRef]
Chen, L.; Xing, J.; Li, Z.; Zhu, W.; Yuan, Z.; Fang, Z. Towards transparent deep learning for surface water detection from SAR imagery. Int. J. Appl. Earth Obs. Geoinf. 2023, 118, 103287. [Google Scholar] [CrossRef]
Cai, X.; Xing, J.; Xing, X.; Luo, R.; Tan, S.; Wang, J. Automatic and fast extraction of layover from InSAR imagery based on multi-layer feature fusion attention mechanism. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar]
Chen, L.; Zhang, P.; Xing, J.; Li, Z.; Xing, X.; Yuan, Z. A Multi-scale Deep Neural Network for Water Detection from SAR Images in the Mountainous Areas. Remote Sens. 2020, 12, 3205. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Pandey, R.; Karmakar, S.; Ramakrishnan, A. Computationally efficient approaches for image style transfer. In Proceedings of the 2018 15th IEEE India Council International Conference (INDICON), Coimbatore, India, 6–18 December 2018; pp. 1–6. [Google Scholar]
Chen, L.C.; Papandreou, G. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef]
Chen, L.; Luo, R.; Li, Z.; Cai, X. Geospatial transformer is what you need for aircraft detection in SAR Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5225715. [Google Scholar] [CrossRef]
Chen, L.; Xing, J.; Li, Z.; Yuan, Z.; Pan, Z.; Tan, S.; Ru, L. Employing deep learning for automatic river bridge detection from SAR images based on adaptively effective feature fusion. Int. J. Appl. Earth Obs. Geoinf. 2021, 102, 102245. [Google Scholar] [CrossRef]
Luo, R.; Chen, L.; Xing, J.; Yuan, Z.; Tan, S.; Cai, X.; Wang, J. A Fast Aircraft Detection Method for SAR Images Based on Efficient Bidirectional Path Aggregated Attention Network. Remote Sens. 2021, 13, 2940. [Google Scholar] [CrossRef]
Chen, L.; Cui, X.; Li, Z.; Xing, J.; Xing, X.; Jia, Z. A new Deep Learning Algorithm for SAR Scenes Classification Based on Space Statistical Modeling and Features Re-calibration. Sensors 2019, 19, 2479. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Ma, H.; Han, G.; Peng, L. Rock thin sections identification based on improved squeeze-and-Excitation Networks model. Comput. Geosci. 2021, 152, 104780. [Google Scholar] [CrossRef]
Azad, R.; Asadi-Aghbolaghi, M.; Fathy, M. Attention DeepLabV3++: Multi-level context attention mechanism for skin lesion segmentation. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer International Publishing: Cham, Switzerland, 2020; pp. 251–266. [Google Scholar]
Peng, C.; Zhang, X.; Yu, G. Large kernel matters--improve semantic segmentation by global convolutional network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4353–4361. [Google Scholar]

Figure 1. Block diagram of the network structure.

Figure 2. DSCFE module structure.

Figure 3. The structure of CSAM.

Figure 4. The detection results of water for Scene I by different networks. (a) is the SAR image. (b) is the ground truth. (c–f) are the results of the DeepLabV3+, MADF-Net, GCN, and MFAF-Net, respectively. The blue color, green color, and red color denote correct water detection, missed detections, and false alarms for water, respectively.

Figure 5. The detection results of water for Scene II by different networks. (a) is the SAR image. (b) is the ground truth. (c–f) are the results of the DeepLabV3+, MADF-Net, GCN, and MFAF-Net, respectively. The blue color, green color, and red color denote correct water detection, missed detections, and false alarms for water, respectively.

Figure 6. The detection results of water for Scene III by different networks. (a) is the SAR image. (b) is the ground truth. (c–f) are the results of the DeepLabV3+, MADF-Net, GCN, and MFAF-Net, respectively. The blue color, green color, and red color denote correct water detection, missed detections, and false alarms for water, respectively.

Table 1. Water detection results of different networks for Scene I.

Networks	PA (%)	IoU (%)
DeepLabV3+	79.96	71.73
MFAF-Net	71.05	63.15
GCN	13.64	13.59
MADF-Net	80.88	76.16

Table 2. Water detection results of different networks for Scene II.

Networks	PA (%)	IoU (%)
DeepLabV3+	89.35	83.13
MFAF-Net	95.84	77.08
GCN	92.74	83.05
MADF-Net	90.40	86.78

Table 3. Water detection results of different networks for Scene III.

Networks	PA (%)	IoU (%)
DeepLabV3+	84.64	75.63
MFAF-Net	53.54	50.57
GCN	51.73	50.94
MADF-Net	86.77	81.45

Table 4. Evaluation of ablation experimental results.

	DAPP	CSAM	PA(%)	IoU(%)
	×	×	86.06	79.57
Baseline	√	×	88.99	82.38
	×	√	86.43	81.38
	√	√	89.71	83.37

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, J.; Jia, D.; Xue, J.; Wu, Z.; Song, W. Automatic Water Body Extraction from SAR Images Based on MADF-Net. Remote Sens. 2024, 16, 3419. https://doi.org/10.3390/rs16183419

AMA Style

Wang J, Jia D, Xue J, Wu Z, Song W. Automatic Water Body Extraction from SAR Images Based on MADF-Net. Remote Sensing. 2024; 16(18):3419. https://doi.org/10.3390/rs16183419

Chicago/Turabian Style

Wang, Jing, Dongmei Jia, Jiaxing Xue, Zhongwu Wu, and Wanying Song. 2024. "Automatic Water Body Extraction from SAR Images Based on MADF-Net" Remote Sensing 16, no. 18: 3419. https://doi.org/10.3390/rs16183419

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Automatic Water Body Extraction from SAR Images Based on MADF-Net

Abstract

1. Introduction

2. Materials

3. Methods

3.1. The Overall Framework

3.2. Residual Neural Network

3.3. Deep Atrous Pyramid Pooling (DAPP)

3.4. Channel Space Attention Module (CSAM)

3.5. Evaluation Metrics

4. Results

4.1. Training Parameter Settings

4.2. Comparison Experiments

4.2.1. Scene I

4.2.2. Scene II

4.2.3. Scene III

4.3. Analysis of Experimental Results

4.4. Ablation Experiment

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI