A Novel Deep Learning Network Model for Extracting Lake Water Bodies from Remote Sensing Images

Liu, Min; Liu, Jiangping; Hu, Hua

doi:10.3390/app14041344

Open AccessArticle

A Novel Deep Learning Network Model for Extracting Lake Water Bodies from Remote Sensing Images

by

Min Liu

^1,2,

Jiangping Liu

^1,2,* and

Hua Hu

¹

College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot 010018, China

²

Inner Mongolia Autonomous Region Key Laboratory of Big Data Research and Application of Agriculture and Animal Husbandry, College of Computer and Information Engineering, Hohhot 010018, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(4), 1344; https://doi.org/10.3390/app14041344

Submission received: 12 December 2023 / Revised: 14 January 2024 / Accepted: 3 February 2024 / Published: 6 February 2024

Download

Browse Figures

Versions Notes

Abstract

:

Extraction of lake water bodies from remote sensing images provides reliable data support for water resource management, environmental protection, natural disaster early warning, and scientific research, and helps to promote sustainable development, protect the ecological environment and human health. With reference to the classical encoding-decoding semantic segmentation network, we propose the network model R50A3-LWBENet for lake water body extraction from remote sensing images based on ResNet50 and three attention mechanisms. R50A3-LWBENet model uses ResNet50 for feature extraction, also known as encoding, and squeeze and excitation (SE) block is added to the residual module, which highlights the deeper features of the water body part of the feature map during the down-sampling process, and also takes into account the importance of the feature map channels, which can better capture the multiscale relationship between pixels. After the feature extraction is completed, the convolutional block attention module (CBAM) is added to give the model a global adaptive perception capability and pay more attention to the water body part of the image. The feature map is up-sampled using bilinear interpolation, and the features at different levels are fused, a process also known as decoding, to finalize the extraction of the lake water body. Compared with U-Net, AU-Net, RU-Net, ARU-Net, SER34AUNet, and MU-Net, the R50A3-LWBENet model has the fastest convergence speed and the highest MIoU accuracy with a value of 97.6%, which is able to better combine global and local information, refine the edge contours of the lake’s water body, and have stronger feature extraction capability and segmentation performance.

Keywords:

remote sensing; water extraction; lake; semantic segmentation; attention mechanism; ResNet

1. Introduction

Lakes play an important role in agricultural irrigation and regional ecological environment regulation [1,2,3,4]. With the impact of climate change and human social activities, the reduction of lake area and water body pollution have become urgent problems [5,6,7]. Obtaining information about lakes and monitoring changes in lake water bodies are important in preventing and controlling pollution, rationally developing water resources, and building an ecological civilization [8,9].

The traditional way of extracting water bodies, which relies on manual measurements and water body monitoring stations, is limited by environmental factors and has low timeliness. In recent years, high-resolution satellite images have become more and more popular, providing a reliable data source for the extraction of lake water bodies from remote sensing images. For long distances, complex terrain and inaccessible areas, remote sensing satellites have the great advantages of contactless measurement, short imaging cycles, and high work efficiency. Therefore, the rapid and accurate extraction of water body information from remote sensing satellite images has great application value. The key of water body extraction is to enhance the difference between the water body and the background part, to find the boundary between the water body and the background, and to achieve the purpose of separating the water body and the background. The band threshold method is the water body extraction method for early remote sensing images, but the accuracy and stability of this method may be affected in the case of complex backgrounds, illumination changes and poor data quality [10,11,12]. In practical applications, the classifier method is commonly used in order to improve the accuracy and reliability of water body extraction [13,14,15]. Although the classifier method integrates remote sensing image information such as spectra, shape and feature texture, the influence of sample selection, segmentation threshold and classification criterion leads to insufficient model generalization ability, which cannot be adapted to the water body extraction project with large data volume. The full convolutional neural network model based on deep learning realizes the whole process of lake water body extraction from remote sensing images without manually set parameters, which can automatically and deeply segment the image based on each pixel point and the regional characteristics of the pixel, and it is excellent in both extraction speed and accuracy. Evan S. et al. proposed fully convolutional networks (FCN), which are capable of accomplishing end-to-end semantic segmentation and significantly improving target segmentation accuracy [16]. Olaf R. et al. constructed U-Net on top of FCN to extract features by decreasing the image height and width while increasing the image dimensions through an encoder and restoring the image through a decoder [17]. He K. et al. introduced residual learning into various algorithms of deep learning, which substantially improved the extraction accuracy of semantic segmentation [18]. Yang F. et al. used the model of mask region with convolutional neural networks (R-CNN) based on ResNet50 and ResNet101 to automatically detect and extract water bodies in remote sensing images with good accuracy [19]. Guo H. et al. established a multi-scale water body extraction CNN to automatically extract water bodies in Gaofen-1 remote sensing images [20]. Wu P. et al. proposed to use the spatial pyramid pooling module to combine feature maps at different scales in DeepLab V3+ [21]. Wang X. et al. improved on the U-Net model of the encoding-decoding architecture and proposed a SER34AUNet model, which finally achieved good results in the extraction of water body experiments [22]. Zhang Y. et al. made MU-Net a new method for automatic extraction of water bodies, which is based on U-Net for modeling the local spatial detail information and global contextual information of images [23]. Although these above methods have greatly improved the accuracy and efficiency of water body extraction, the improvement of water body extraction accuracy still faces major challenges such as multi-scale features, noise interference and boundary blurring. The classical FCN encoder has fewer layers, cannot fully extract features, and loses some detail information in the process of up-sampling to recover the resolution, which leads to a lack of clarity in the boundary of the segmentation results, especially for the recognition of small targets or fine structures. U-Net may have fuzzy boundaries for the segmented objects, especially at the target edge position, and the segmentation results are not fine enough. In addition, U-Net is sensitive to the scale of the input image, which may lead to degradation of the segmentation effect when dealing with images with mismatched scales. Mask R-CNN contains multiple components, and this complexity leads to higher computational cost and training time, and the segmentation results may have blurred boundaries in the edge region of the target object. If the noise introduces inconsistent textures or features, it may cause the DeepLab V3+ model to produce unstable predictions in noisy regions. To address these limitations, we propose a novel lake water body extraction model, R50A3-LWBENet, which uses a more complex ResNet50 network structure and introduces methods such as shortcut connection, attention mechanism, and up-sampling to solve the problems of loss of fine information, insufficient semantic information, information transfer limitations, and boundary blurring. The combined application of these methods helps to improve the performance of the model in dealing with multi-scale features, noise interference, and boundary blurring.

The study includes (1) Encoding, with ResNet50 as the backbone network capturing more abstract and rich feature representations to accurately recognize and segment different classes of objects and regions. The structure of the residual block itself in ResNet allows the network to reuse features at different layers, thus providing more multi-scale information in the decoder part, which helps to solve the multi-scale feature fusion problem in semantic segmentation. (2) Incorporating squeeze and excitation (SE) block into the residual module of ResNet50, which enables the SE-ResNet network to take into account the importance of each feature map during the feature extraction process, to strengthen the useful information and suppress the useless information. (3) After the feature extraction is completed, the convolutional block attention module (CBAM) is applied to pay attention to the channel and spatial features of all the feature maps at the same time to remove the noise interference and improve the accuracy of the edges of the lake water body. The introduction of the attention mechanism helps the network to pay more attention to important feature information, and the network can dynamically adjust the weights of the feature map to enhance the features that help segmentation according to the importance of the features, and reduce the noise and unnecessary information, which helps to improve the model’s attention to the key features, and at the same time inhibit the over-reliance on noisy and other interfering information so as to improve the accuracy and robustness of semantic segmentation. (4) Decoding, the feature maps are up-sampled using bilinear interpolation to bring the low resolution feature maps to their original resolution. The up-sampled feature maps are fused with the high-resolution feature maps obtained by shortcut connection, a mechanism that enables the decoder to utilize both shallow and deep features in the encoder. Shallow features typically contain more local details, while deep features carry more advanced semantic information. This connection enables the decoder to apply different levels of features simultaneously, enhancing the ability to perceive information at different scales of the input data, thus realizing multi-scale feature fusion. Multi-scale feature fusion is used to enhance the network’s ability to capture the boundary information of the lake water body and make the segmentation results more accurate in the boundary region. (5) Extraction experiments are conducted for the R50A3-LWBENet model and U-Net, AU-Net, RU-Net, ARU-Net, SER34AUNet, and MU-Net models for lake water bodies, and the performance of the different models is evaluated and compared by three aspects: visual observation, performance evaluation metrics, and model convergence.

2. Materials and Methods

2.1. Study Area and Datasets

In this paper, we use the visible spectral Google Earth remote sensing image dataset of lake waters on the Tibetan Plateau, which consists of 6773 RGB images each of size 256 × 256 [24]. The dataset was randomly divided into 5419 training set data and 1354 test set data according to the ratio of 8:2. The lakes of the Tibetan Plateau can be divided into saltwater lakes and fresh water lakes in terms of type. Saltwater lakes have salt belts along their shores, which show a complex structure. The spectral characteristics of freshwater lakes are white, light blue, dark blue and black. Black lakes have features similar to mountains and cloud shadows, while white lakes have features similar to snow. In terms of state, there are frozen, semi-frozen and non-frozen lakes. The study area of lake water bodies is shown in Figure 1.

2.2. R50A3-LWBENet Model Components

2.2.1. ResNet

ResNet is a deep convolutional neural network that has achieved remarkable results in several computer vision tasks such as image classification, target detection and semantic segmentation. The introduction of residual blocks in this network solves the problems of gradient vanishing and gradient explosion during the training of deep neural networks, allowing deeper networks to be trained efficiently. By stacking residual blocks, the details of features and contextual information can be captured more fully, improving the expressive power of the network. The core idea of residual blocks is to connect the original inputs directly to the outputs, forming a cross-layer connection. Each residual block contains a main path and a shortcut connection. The main path performs a series of convolutional operations and nonlinear activation functions, while the shortcut connection adds the original input directly to the output of the main path. The advantage of this is that when the main path has difficulty in learning effective features, the network can avoid feature loss by passing information from the original input directly through the shortcut connection. ResNet mainly consists of five backbone networks including ResNet18, ResNet34, ResNet50, ResNet101, and ResNet152. We can choose ResNet for different backbone networks based on task requirements and computational resource constraints [25,26].

2.2.2. SE-ResNet

Our proposed SE-ResNet, which refers to the method of fusing the SE block, a channel-based attention mechanism with stronger feature extraction capability and deeper network layers, into the residual module of ResNet attempts to perform feature extraction in the channel dimension, so that it can better learn the feature information of the water body in the image, and improve the accuracy of the tasks such as image classification and target detection to a certain extent [27,28]. The structure of SE-ResNet network is shown in Figure 2.

The original ResNet residual module is shown in Figure 2a, and the SE-ResNet residual module is shown in Figure 2b. As can be seen in Figure 2b, the biggest difference between the two models is that the feature maps are re-weighted by the SENet Layer for each interconnecting channel of the feature maps after passing through the two layers of the ordinary convolution layer of the ResNet residual module, which makes it possible to emphasize the channel features throughout the whole encoding process, so as to strengthen the ability to pay attention to the part of the water body.

2.2.3. Convolutional Block Attention Module (CBAM)

CBAM is to merge the channel attention mechanism and the spatial attention mechanism in one piece to realize a sequential self-attention structure from channel to space [29]. Channel attention is used to handle the allocation relationship of feature mapping channels while allocating attention to multiple dimensions, thus enhancing the effect of the attention mechanism on the improvement of model performance. The spatial attention mechanism allows the network model to pay more attention to the pixel regions that play a decisive role in image segmentation and ignore irrelevant regions [30]. The feature map goes through the channel attention mechanism module and then the spatial attention mechanism module. The structure of the channel attention mechanism module is shown in Figure 3.

The input feature map F(C × H × W) is changed to the size of (C × 1 × 1) after passing through two parallel global maximum pooling (MaxPool) layer and global average pooling (AvgPool) layer, then passes through the shared MLP Layer, in which it first compresses the number of channels to 1/r of the original one (r = 16), then expands to the original number of channels, and passes through the ReLU activation function layer to get two activated feature matrices, these two output feature matrices are summed, and then through a Sigmoid activation function layer to get the attention weights M_C of each channel, and then multiply M_C with F to get the feature map M_C(F)(C × H × W) that gives the channel attention. The channel attention mechanism can be expressed as Equation (1).

M_{C} (F) = σ (M L P (A v g P o o l (F)) + M L P (M a x P o o l (F))) = σ (W_{1} (W_{0} (F_{a v g}^{C})) + W_{1} (W_{0} (F_{\max}^{C}))

(1)

where σ is the Sigmoid activation function, W₀ is the fully connected hidden layer weight, and W₁ is the fully connected output layer weight. After that the feature map output from the channel attention mechanism module is then passed through the spatial attention mechanism module. The structure of the spatial attention mechanism module is shown in Figure 4.

Firstly, the channel refined feature F, i.e., M_C(F), is passed through the MaxPool layer and AvgPool layer to get two feature maps with the size of (1 × H × W), and secondly, the two feature maps are concatenated, and then it is changed into 1-channel feature map through the convolutional layer with convolutional kernel of the size of (7 × 7), and then it is passed through a Sigmoid activation function to get the spatial attention weights M_S for each pixel point, and finally the M_S is multiplied by M_C(F) to obtain the feature map M_S(F)(C × H × W) endowed with spatial attention. The spatial attention mechanism can be expressed as Equation (2).

M_{S} (F) = σ (f^{7 \times 7} ([A v g P o o l (F); M a x P o o l (F)]) = σ (f^{7 \times 7} ([F_{a v g}^{S}; F_{\max}^{S}]))

(2)

where σ is the Sigmoid activation function and f^7×7 is a convolutional layer with a convolutional kernel size of (7 × 7).

The CBAM module is able to derive the attention map sequentially along the two independent dimensions of channel and space with a very small amount of computation, carry out adaptive feature refinement, and suppress the noise interference, so as to realize adaptive extraction of water body features [31]. The structure of the CBAM module is shown in Figure 5.

2.3. R50A3-LWBENet Holistic Network Instructure

R50A3-LWBENet model takes a single RGB image as input, introduces the ResNet50 backbone, introduces the attention mechanism, and realizes the fusion of global and local information, and the fusion of large-scale and small-scale information, and refines the edge contours of the lake water bodies. In the model encoding part, we used the SE-ResNet50 network composed of residual modules based on the improved SE block, which pays attention to both the holistic and salient features of the feature map channel while ensuring that deepening the number of network layers does not degrade the model performance. After the encoding is finished, CBAM containing channel attention module and spatial attention module is added, in which the channel attention module is processed by the global average pooling and global maximum pooling of the feature map in the fully connected layer, which can adaptively adjust the weight of each channel to help the network pay attention to the important feature information in a specific channel and inhibit the irrelevant features, so as to improve the efficiency of the network in utilizing the features of different channels, and to capture the important information in the image in a better way. The spatial attention module takes the maximum value and the average value on each pixel’s channel, after that these two results are stacked based on channel, after a convolution operation to downsize to 1 channel, and then after a sigmoid activation operation to get the weights of each pixel of the input feature map, and finally multiply this weight by the original input feature map. Spatial attention is achieved by learning the weights of each pixel’s spatial location. It can adaptively adjust the feature map according to the importance of different locations, enhance the feature expression of important regions, and suppress the feature information of secondary regions. The CBAM module is able to fully consider the importance of each channel and each spatial location in the feature map through the joint action of channel attention and spatial attention, which can help the network to better focus on the important regions in the image and extract more discriminative features. The decoding part uses bilinear interpolation for up-sampling, and the feature maps obtained from the up-sampling process are fused with the feature maps of the same size from the down-sampling process of the ResNet50 feature extraction network by shortcut connections, thus obtaining the final segmented image of the lake body, and the holistic network instructure of R50A3-LWBENet is shown in Figure 6.

2.4. Model Performance Evaluation Metrics

Semantic segmentation is pixel-level classification, and lake water body extraction is the scope of semantic segmentation, and the commonly used performance evaluation metrics include pixel accuracy (PA), mean pixel accuracy (MPA), and mean intersection over union (MIoU), and these metrics are calculated based on the confusion matrix, which is used to count the number of samples categorized into right and wrong categories [32,33]. The higher the values of PA, MPA and MIoU, the better the model is for the extraction of the predicted targets, and the three metrics are calculated as follows.

2.4.1. Pixel Accuracy (PA)

PA is the number of pixels predicted to be correctly classified divided by the total number of pixels in the observed labeled image, which can be interpreted as the percentage of correctly classified pixels in the image, and is calculated as shown in Equation (3) as follows.

P A = \frac{\sum_{i = 1}^{c} n_{i i}}{\sum_{i = 1}^{c} \sum_{j = 1}^{c} n_{i j}}

(3)

where n_ii is the number of pixels in which the observed category i is predicted to be category i; n_ij is the number of pixels in which the observed category i is predicted to be category j; and c is the number of categories.

2.4.2. Mean Pixel Accuracy (MPA)

The MPA is first calculated by separately calculating the ratio of the number of pixels in each class that are predicted to be correctly categorized to the total number of pixels in that class in the predicted labels and then averaged cumulatively. The calculation formula is shown in Equation (4).

M P A = \frac{1}{c} \times \sum_{i = 1}^{c} \frac{n_{i i}}{\sum_{j = 1}^{c} n_{j i}}

(4)

where n_ji is the number of pixels in which the observed category j is predicted into category i. c, n_ii are defined as above.

2.4.3. Mean Intersection over Union (MIoU)

MIoU focuses on the accuracy of the prediction of the target region and is a standard accuracy metric. Firstly, IoU is calculated for each class separately, IoU is the ratio of the intersection of observed labels and predicted labels, i.e., the number of correctly predicted pixels, to the union of observed labels and predicted labels, i.e., the number of pixels in the class that were incorrectly predicted, and then the IoU is averaged across all the classes to obtain MIoU, which is computed by using Equation (5).

M I o U = \frac{1}{c} \times \sum_{i = 1}^{c} \frac{n_{i i}}{\sum_{j = 1}^{c} n_{i j} + \sum_{j = 1}^{c} n_{j i} - n_{i i}}

(5)

where c, n_ii, n_ij, and n_ji are defined as above.

2.5. Cross Entropy Loss Function

The cross entropy loss function is a loss function commonly used in dealing with classification problems, expressed as the difference between the observed probability distribution and the predicted probability distribution [34]. The function formula is shown in Equation (6).

L o s s = - \sum_{i = 1}^{c} p (x_{i}) \log (q (x_{i}))

(6)

where p(x) is the observed distribution of the sample, q(x) is the predicted distribution, and c is defined as above. The smaller the value of Loss, the closer the predicted output is to the true sample label and the better the model predicts.

3. Results and Discussion

3.1. Experimental Environment Configuration

In this paper, we adopt the way of model training and model testing successively, i.e., the training set completes one round of iteration and then conducts a test on the test set of images in order to obtain the performance evaluation metrics of the test in a timely manner. This method can greatly save the time of model training and testing, and can obtain the parameters of the model in time, so that the optimal solution can be found at any time according to the test results to change the parameters of the model. In order to get a better water body extraction effect, the software and hardware configuration of the experimental environment is shown in Table 1.

3.2. ResNet Backbone Selection

In this paper, the SE block and CBAM are sequentially fused with ResNet18, ResNet34, ResNet50, ResNet101 and ResNet152 to implement the network performance test, and the test results are shown in Table 2.

As can be seen from Table 2, the running time increases incrementally as the number of network layers deepens, but the feature information is also richer, and the model is more effective. For the three performance evaluation metrics of PA, MPA, and MIoU, ResNet34 improved by 0.1%, 0.1%, and 0.2%, respectively, over ResNet18; ResNet50 improved by 0.2%, 0.2%, and 0.5%, respectively, over ResNet34. When the networks are stacked to a certain depth, there is no case that the deeper networks are less effective than the shallower ones. Compared with ResNet50, ResNet101 and ResNet152 performance is not further improved. From the above analysis, it is clear that ResNet50 has the highest PA, MPA and MIoU values with moderate time cost. Therefore, we select ResNet50 as the backbone based on the principle of satisfying the highest accuracy while minimizing the parameters.

3.3. Ablation Study

To verify the effectiveness of the combination of SE block and CBAM in the ResNet50 network, we conduct a series of ablation experiments on the test set, and the results of the experiments are shown in Table 3.

From Table 3, the following results can be obtained: with the addition of different modules, the model is improved in all three performance evaluation metrics, PA, MPA, and MIoU, which effectively proves the role of each module and its combination on the structure of the ResNet50 network. Compared to ResNet50, on the three performance evaluation metrics, the addition of SE block improves 0.2%, 0.35%, and 0.4%; the addition of CBAM improves 0.4%, 0.6%, and 0.8%; and the combination of the two modules improves 0.9%, 1%, and 0.88%, with a runtime increase of only 5 s, so the full architecture achieves optimal model performance, respectively. The removal of either module causes a decrease in model evaluation performance.

Based on the above experimental results, we propose the R50A3-LWBENet model for semantic segmentation of remote sensing images of lake water bodies (Figure 6). As can be seen in Figure 6, the raw image with input size of (256 × 256 × 3) is first reshaped to (128 × 128 × 64) size by the convolutional layer with convolutional kernel size of (7 × 7 × 64), then cropped to (64 × 64 × 64) size by the MaxPool layer, and then compressed to (8 × 8 × 2048) size by the four down-sampling layers(The first layer has 3 residual modules with a total of 9 convolutional and 3 SENet attention layers; the second layer has 4 residual modules with a total of 12 convolutional and 4 SENet attention layers; the third layer has 6 residual modules with a total of 18 convolutional and 6 SENet attention layers; and the fourth layer has 3 residual modules with a total of 9 convolutional and 3 SENet attention layers). Then, the image size is gradually restored by bilinear interpolation and feature fusion is performed by layer-by-layer concatenate with the feature maps output from the first 4 down-sampling layers. Finally, the feature reduction is carried out by four up-sampling layers, and then the water body segmentation image is output by the convolutional layer with a convolutional kernel size of (1 × 1 × 2).

3.4. Model Performance Comparison

We compare the performance of R50A3-LWBENet model with U-Net, AU-Net, RU-Net, ARU-Net, SER34AUNet, and MU-Net for lake water body segmentation through three aspects: visual observation, performance evaluation metrics, and model convergence, and finally get the enhancement effect of the R50A3-LWBENet model in this process. U-Net is a classical network model that possesses symmetric encoding and decoding structures, where the contextual information of the image is extracted during encoding, while during decoding the up-sampled features are fused with the different scale features generated by the encoding process to obtain rich contextual information [17]. Attention U-Net (AU-Net) proposes an Attention Gate (AG) model that automatically learns target structures of different shapes and sizes. By using a model trained with AG, irrelevant regions in the input image can be suppressed while highlighting salient features that are useful for a specific task, integrating the model into the standard U-Net structure and improving the sensitivity and prediction accuracy of the model [35]. Recurrent U-Net (RU-Net) incorporates Recurrent Convolutional Layers (RCLs) in each convolutional encoding and decoding unit of U-Net. Feature accumulation is performed through the Recurrent Convolutional Layers to ensure a more accurate presentation of the features required for the segmentation task [36]. Attention RU-Net (ARU-Net) applies both AGs and RCLs to U-Net, which both enhances the network’s ability to model contextual dependencies in the image encoding and decoding process and better focuses on image regions of interest. The SER34AUNet model used the U-Net semantic segmentation network and is improved by adding the DANet module and the SE module, which ultimately realizes high-precision water body segmentation [22]. MU-Net is a hybrid MixFormer architecture, where the MixFormer module is embedded into U-Net to model the local spatial detail information and global contextual information of an image, while the attention mechanism module is added to refine the features generated by the encoder [23].

3.4.1. Comparison of Model Visual Observations

In this paper, visual observations are made to analyze the performance of the R50A3-LWBENet model in dealing with three challenges, which are multi-scale features, noise interference and boundary blurring.

(1): Performance comparison for multi-scale features.

As the resolution of remote sensing satellites gradually increases, more and more details in remote sensing images become visible. The lake water bodies in Figure 7 are of varying sizes and shapes, and there are multi-scale features, such as large differences in the size of the lake water bodies in columns (a) and (b), and multiple smaller water bodies in columns (c), (d) and (e). The experimental results show that only R50A3-LWBENet and MU-Net accurately extracted the small water body region and also completely segmented the large water body region, U-Net and RU-Net water body extraction results are more fragmented, and both AU-Net and ARU-Net failed to segment the small water body to varying degrees. SER34AUNet missed two small bodies of water. The R50A3-LWBENet model can take into account different levels of contextual information, automatically learn multi-scale features, and extract more accurate and complete contours of lakes and water bodies of different types and complex geographic environments, with strong generalization ability.

(2): Performance comparison for noise interference.

In Figure 8, the image of column (a) was taken in winter, and the frozen lake and snow appeared, and the spectral features of the two were very similar, and the color of the lake water body in column (b) was dark green, and the difference with the mountains and shadows next to it was small, except for R50A3-LWBENet, the rest of the models were not able to inhibit the noise interference well, and failed to extract the detailed boundaries of the lake water body, and all of them appeared to identify the irrelevant objects such as the snow, the mountain range, and the shadows as the information of the water body. The color of the lake water body in column (c) is light blue, the environment around the lake is complex, there is a salt belt along the shore, the texture is rich, and the segmentation is easy to generate noise, only R50A3-LWBENet, ARU-Net, SER34AUNet, and MU-Net can handle the complex area well, and completely and accurately segmented the lake water body and the land, while the performance of U-Net, AU-Net, and RU-Net is not satisfactory enough. The lake water body in column (d) has grayish-white frozen blocks at the edge and a large cloud cover layer. U-Net, AU-Net, RU-Net, ARU-Net, and SER34AUNet suffer from large mis-extraction of water body extraction due to cloud interference, while R50A3-LWBENet and MU-Net are able to adequately extract the water body features. The image of the water body in column (e) shows green color, and its spectral properties and the spectral properties of the background features are obviously different, but both U-Net and RU-Net more or less mistook the snow and shadows as lakes. AU-Net, ARU-Net and SER34AUNet outperformed U-Net and RU-Net, but also showed minor mis-segmentation, R50A3-LWBENet and MU-Net had the highest accuracy for water body extraction without any noise points, and R50A3-LWBENet was robust to noise interference.

(3): Performance comparison for boundary blurring.

The edges of the lake water body in Figure 9 are uneven, and for the images in columns (a), (b), and (c), the R50A3-LWBENet can extract the water body boundaries more clearly compared to U-Net, AU-Net, RU-Net, ARU-Net, SER34AUNet, and MU-Net. The lake water bodies in columns (d) and (e) have tiny boundaries, but U-Net, AU-Net and RU-Net are not fine enough to extract the tiny boundary regions at the edges of the water bodies, and ARU-Net makes mistakes in judging the water body regions. SER34AUNet and MU-Net still have small errors in edge detail. R50A3-LWBENet has a strong ability to express semantic features of the water bodies, which can refine the edge contours of the lake water bodies, realize the extraction of the boundary information of the water bodies more accurately, and have obvious advantages in boundary pixel classification.

The above visual observation reveals that the R50A3-LWBENet model performs better in dealing with multi-scale features, noise interference and boundary blurring, and the model is more robust with better generalization ability.

3.4.2. Comparison of Model Performance Evaluation Metrics

In order to make a more accurate comparison, we take 1354 remote sensing images as a test set to test the model for 120 times, establish the confusion matrix between the observed label and the predicted label and calculate the optimal values of the R50A3-LWBENet model and the U-Net, AU-Net, RU-Net, ARU-Net, SER34AUNet, and MU-Net models in the three evaluation indexes of PA, MPA, and MIoU based on the confusion matrix, and the calculation results are shown in Table 4.

As can be seen from Table 4, the top 3 models in ascending order of time spent are U-Net, SER34AUNet, and R50A3-LWBENet. Although R50A3-LWBENet time performance is not optimal, it is improved over the remaining six models in the three performance evaluation metrics of PA, MPA, and MIoU. The improvement results are: 1.4, 1.35, and 2.8 percentage points higher than the U-Net model; 1.1, 1.2, and 2.3 percentage points higher than the AU-Net model; 4.9, 5.7, and 9.6 percentage points higher than the RU-Net model; and 0.5, 0.45, and 1.1 percentage points higher than the ARU-Net model; and 0.3, 0.25, and 0.6 percentage points higher than the SER34AUNet model; and 0.2, 0.15, and 0.4 percentage points higher than the MU-Net model. The R50A3-LWBENet model has better performance in dealing with multi-scale features, noise interference and boundary ambiguities, and the model is more robust with better generalization ability.

3.4.3. Comparison of Model Convergence

The rate of model convergence can be judged by the degree of decline in the loss function, which in turn measures the efficiency of a model’s learning. The change curves of the loss functions of the seven models are shown in Figure 10.

As observed in Figure 10, the R50A3-LWBENet, SER34AUNet, and MU-Net model losses are able to drop to near 0.1 in the first 100 iterations, and in the subsequent iterations, the losses flatten out, and the model saturates. Among them, R50A3-LWBENet has the lowest loss value. In contrast, the loss of the four models U-Net, AU-Net, RU-Net, and ARU-Net can only drop to near 0.2, which is much higher. In addition, through the number of 500 iterations, the R50A3-LWBENet model loss decreases faster than the rest of the model losses and decreases to its lowest value faster, which in turn leads to a faster model convergence than the other six models.

In summary, compared with U-Net, AU-Net, RU-Net, ARU-Net, SER34AUNet, and MU-Net models, our proposed R50A3-LWBENet model is able to exclude the influence of complex terrain and various surrounding objects, and has the ability of higher accuracy, higher generalization ability, and faster learning of remote sensing images of lake water bodies.

4. Conclusions

In this paper, a semantic segmentation R50A3-LWBENet model based on a convolutional neural network is proposed to address the main challenges of multi-scale features, noise interference and boundary blurring in lake remote sensing image segmentation. The model is able to combine global and local information to refine the edge profile of the lake water body with better feature extraction capability and segmentation performance. The adopted structure includes an encoder, an attention mechanism module, and a decoder. A ResNet50 network with deeper convolutional layers is used for encoding in the feature extraction layer to avoid feature loss and help to solve the multi-scale feature fusion problem in semantic segmentation. After that, the SE block based on the channel attention mechanism is added to the residual module inside the network, which enhances the correlation of each channel of the remote sensing image, strengthens the local perception domain of the water body part, and fully extracts the water body features. After the feature extraction is completed, CBAM is added to utilize the spatial dimension and channel dimension features so that the model focuses more on the water body pixel regions that play a decisive role in image segmentation and ignores the irrelevant regions. The attention mechanism enables the network to dynamically adjust the weights of the feature map, which helps to improve the model’s attention to key features, while suppressing the over-reliance on distracting information such as noise, thus improving the accuracy and robustness of semantic segmentation. Finally, the feature map is up-sampled using bilinear interpolation and the different levels of features are fused, i.e., decoded, to finalize the extraction of the lake water body. Compared with the U-Net, AU-Net, RU-Net, ARU-Net, SER34AUNet and MU-Net models, the R50A3-LWBENet model has the highest values of PA, MPA, and MIoU and the smallest values of loss, as well as segmenting the lake water body with significantly better integrity and accuracy than the other models. While this model achieved better results on this dataset, there is still room for further improvements that can be made to the network performance. In the future, we should continue to expand remote sensing images of other lakes to increase the coverage of the dataset and further improve the generalization performance of the model for the extraction of lake water bodies in cold and dry areas. With the continuous expansion of the dataset, the extraction of water bodies becomes more and more accurate, and a complete water monitoring and evaluation network is gradually formed to provide strong support for water ecological and environmental management.

Author Contributions

Conceptualization, M.L. and J.L.; methodology, J.L.; software, M.L. and H.H.; validation, M.L., H.H. and J.L.; formal analysis, M.L. and J.L.; investigation, M.L.; resources, J.L. and H.H.; data curation, M.L. and H.H.; writing—original draft preparation, M.L.; writing—review and editing, M.L., J.L. and H.H.; visualization, M.L.; supervision, M.L. and J.L.; project administration, J.L. and H.H.; funding acquisition, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Natural Science Foundation of Inner Mongolia Autonomous Region of China, grant number 2022MS06026; Key Technology Research Project of Inner Mongolia Autonomous Region, grant number 2020GG0169; Program for Improving the Research Ability of Young Teachers in Colleges and Universities in Inner Mongolia Autonomous Region, grant number BR220116; National Natural Science Foundation of China, grant number 62373203.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available in [A dataset of semantic segmentation for the Qinghai-Tibet Plateau in 2020] at [http://dx.doi.org/10.12072/ncdc.NIEER.db0112.2021], reference number [24].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lu, Q.; Si, W.; Wei, L.; Li, Z.; Xia, Z.; Ye, S.; Xia, Y. Retrieval of Water Quality from UAV-Borne Hyperspectral Imagery: A Comparative Study of Machine Learning Algorithms. Remote Sens. 2021, 13, 3928. [Google Scholar] [CrossRef]
Liu, C.; Duan, P.; Zhang, F.; Jim, C.; Tan, M.; Chan, N. Feasibility of the Spatiotemporal Fusion Model in Monitoring Ebinur Lake’s Suspended Particulate Matter under The Missing-Data Scenario. Remote Sens. 2021, 13, 3952. [Google Scholar] [CrossRef]
Faezeh, G.; Taher, R.; Mohammad, Z. Decision Tree Models in Predicting Water Quality Parameters of Dissolved Oxygen and Phosphorus in Lake Water. Sustain. Water Resour. Manag. 2023, 9, 1. [Google Scholar]
Du, Z.; Qi, J.; Wu, S.; Zhang, F.; Liu, R. A Spatially Weighted Neural Network based Water Quality Assessment Method for Large-Scale Coastal Areas. Environ. Sci. Technol. 2021, 55, 2553–2563. [Google Scholar] [CrossRef] [PubMed]
Quan, D.; Zhang, S.; Shi, X.; Sun, B.; Song, S.; Guo, Z. Impact of Water Environment Factors on Eutrophication Status of Lake Ulansuhai Based on Monitoring Data in 2013–2018. J. Lake Sci. 2020, 32, 1610–1619. [Google Scholar]
Song, S.; Li, C.; Shi, X.; Zhao, S.; Tian, W.; Li, Z.; Bai, Y.; Cao, X.; Wang, Q. Under-Ice Metabolism in a Shallow Lake in a Cold and Arid Climate. Freshw. Biol. 2019, 64, 1710–1720. [Google Scholar] [CrossRef]
Dong, S.; He, H.; Fu, B.; Fan, D.; Wang, T. Remote Sensing Retrieval of Chlorophyll-A Concentration in the Coastal Waters of Hong Kong Based on Landsat-8 OLI and Sentinel-2 MSI Sensors. IOP Conf. Ser. Earth Environ. Sci. 2021, 671, 012033. [Google Scholar] [CrossRef]
Wang, Y.; Li, S.; Lin, Y.; Wang, M. Lightweight Deep Neural Network Method for Water Body Extraction from High-Resolution Remote Sensing Images with Multisensors. Sensors 2021, 21, 7397. [Google Scholar] [CrossRef]
Hu, H.; Fu, X.; Li, H.; Wang, F.; Duan, W.; Zhang, L.; Liu, M. Prediction of Lake Chlorophyll Concentration using the BP Neural Network and Sentinel-2 Images Based on Time Features. Water Sci. Technol. 2023, 87, 539–554. [Google Scholar] [CrossRef]
Jiang, J. Review of Geocomputation of High-Resolution Satellite Remote Sensing Imagery. Acta Geogr. Sin. 2009, 64, 2. [Google Scholar]
Bi, H.; Wang, S.; Zeng, J.; Zhao, Y.; Wang, H.; Yin, H. Comparison and Analysis of Several Common Water Extraction Methods Based on TM Image. Remote Sens. Inf. 2012, 27, 77–82. [Google Scholar]
Wang, R.; Liu, B.; Du, Y.; Zhang, H.; Yu, Z. Extraction Method and Accuracy Evaluation of Typical Lake Water Body in Hoh Xil Region Based on GF-6 WFV Data. Bull. Surv. Mapp. 2022, 05, 32–37. [Google Scholar]
Zhu, Y.; Sun, L.; Zhang, C. Summary of Water Body Extraction Methods Based on ZY-3satellite. IOP Conf. Ser. Earth Environ. Sci. 2017, 100, 012200. [Google Scholar] [CrossRef]
Paul, A.; Tripathi, D.; Dutta, D. Application and Comparison of Advanced Supervised Classifiers in Extraction of Water Bodies from Remote Sensing Images. Sustain. Water Resour. Manag. 2018, 4, 905–919. [Google Scholar] [CrossRef]
Zhang, D.; Yang, S.; Wang, Y.; Zheng, W. Refined Water Body Information Extraction of three Gorges Reservoir by using GF-1 Satellite Data. Yangtze River 2019, 50, 233–239. [Google Scholar]
Evan, S.; Jonathan, L.; Trevor, D. Fully Convolutional Networks for Semantic Segmentation. IEEE Tran. Pattern Anal. Mach. Int. 2017, 39, 640–651. [Google Scholar]
Olaf, R.; Philipp, F.; Thomas, B. U-Net: Convolutional Networks for Biomedical Image Segmentation. Olaf Ronneberger 2015, 11, 37. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Identity Mappings in Deep Residual Networks. arXiv 2016. [Google Scholar] [CrossRef]
Yang, F.; Feng, T.; Xu, G.; Cheng, Y. Applied Method for Water-Body Segmentation Based on Mask R-CNN. J. Appl. Remote Sens. 2020, 14, 014502. [Google Scholar] [CrossRef]
Guo, H.; He, G.; Jiang, W.; Yin, R.; Yan, L.; Leng, W. A Multi-Scale Water Extraction Convolutional Neural Network (MWEN) Method for GaoFen-1 Remote Sensing Images. ISPRS Int. J. Geo-Inf. 2020, 9, 189. [Google Scholar] [CrossRef]
Wu, P.; Fu, J.; Yi, X.; Wang, G.; Mo, L.; Maponde, B.T.; Liang, H.; Tao, C.; Ge, W.; Jiang, T.; et al. Research on Water Extraction from High Resolution Remote Sensing Images based on Deep Learning. Front. Remote Sens. 2023, 4, 1283615. [Google Scholar] [CrossRef]
Wang, X.; Fu, X.; Hu, H.; Li, H. Research on Water Extraction Method from Remote Sensing Images of Lakes in Cold and Arid Regions based on Deep Learning. In Proceedings of the 3rd International Conference on Artificial Intelligence, Automation, and High-Performance Computing, Wuhan, China, 21 July 2023. [Google Scholar]
Zhang, Y.; Lu, H.; Ma, G.; Zhao, H.; Xie, D.; Geng, S.; Tian, W.; Sian, K.T.C.L.K. MU-Net: Embedding MixFormer into Unet to Extract Water Bodies from Remote Sensing Images. Remote Sens. 2023, 15, 3559. [Google Scholar] [CrossRef]
Wang, Z.; Gao, X.; Zhang, Y.; Zhao, G. MSLWENet: A Novel Deep Learning Network for Lake Water Body Extraction of Google Remote Sensing Images. J. Remote Sens. 2020, 12, 4140. [Google Scholar] [CrossRef]
Hasanah, A.S.; Pravitasari, A.A.; Abdullah, S.A.; Yulita, I.N.; Asnawi, M.H. A Deep Learning Review of ResNet Architecture for Lung Disease Identification in CXR Image. J. Appl. Sci. 2023, 13, 13111. [Google Scholar] [CrossRef]
Shaheed, K.; Qureshi, I.; Abbas, F.; Jabbar, S.; Abbas, Q.; Ahmad, H.; Sajid, M.Z. EfficientRMT-Net—An Efficient ResNet-50 and Vision Transformers Approach for Classifying Potato Plant Leaf Diseases. Sensors 2023, 23, 9516. [Google Scholar] [CrossRef] [PubMed]
Zheng, X.; Chen, J.; Wang, H.; Zheng, S.; Kong, Y. A Deep Learning-based Approach for the Automated Surface Inspection of Copper Clad Laminate Images. Appl. Intell. 2020, 51, 1262–1279. [Google Scholar] [CrossRef]
Jin, X.; Xie, Y.; Wei, X.; Zhao, B.; Chen, Z.; Tan, X. Delving deep into spatial pooling for squeeze-and-excitation networks. Pattern Recognit. 2022, 121, 108159. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.; Kweon, I. Cbam: Convolutional Block Attention Module. arXiv 2018, arXiv:1807.06521. [Google Scholar]
Hu, Y.; Tian, S.; Ge, J. Hybrid Convolutional Network Combining Multiscale 3D Depthwise Separable Convolution and CBAM Residual Dilated Convolution for Hyperspectral Image Classification. Remote Sens. 2023, 15, 4796. [Google Scholar] [CrossRef]
Xie, W.; Ding, Y.; Rui, X.; Zou, Y.; Zhan, Y. Automatic Extraction Method of Aquaculture Sea Based on Improved SegNet Model. Water 2023, 15, 3610. [Google Scholar] [CrossRef]
Liu, M.; Hu, H.; Zhang, L.; Zhang, Y.; Li, J. Construction of Air Quality Level Prediction Model Based on STEPDISC-PCA-BP. Appl. Sci. 2023, 13, 8506. [Google Scholar] [CrossRef]
Liu, M.; Pan, X.; Liu, F.; Zhou, Y.; Jiang, K. Flame Target Detection Based on Stepwise Discrimination and BP Neural Network. Inner Mong. Agric. Univ. (Nat. Sci. Ed.) 2021, 42, 92–96. [Google Scholar]
Diao, Z.; Jiang, H.; Shi, T. A Unified Uncertainty Network for Tumor Segmentation using Uncertainty Cross Entropy Loss and Prototype Similarity. Knowl. Based Syst. 2022, 246, 108739. [Google Scholar] [CrossRef]
Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B. Attention U-Net: Learning Where to Look for the Pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar]
Alom, M.; Hasan, M.; Yakopcic, C.; Taha, T.M.; Asari, V.K. Recurrent residual convolutional neural network based on u-net (R2U-Net) for medical image segmentation. J. Med. Imaging 2019, 6, 6–14. [Google Scholar] [CrossRef]

Figure 1. Lake water body study area.

Figure 2. SE-ResNet network structure (a) the original ResNet residual module (b) the SE-ResNet residual module.

Figure 3. The structure of the channel attention mechanism module.

Figure 4. The structure of the spatial attention mechanism module.

Figure 5. The structure of the CBAM module.

Figure 6. R50A3-LWBENet holistic network instructure.

Figure 7. Performance comparison of different models for multi-scale features. (a–e) represent 5 images of lake areas containing multi-scale features. The red circles represent key segmentation areas. R50A3-LWBENet is the water body segmentation model proposed in this paper.

Figure 8. Performance comparison of different models for noise interference. (a–e) represent 5 images of lake areas containing noise. The red circles represent key segmentation areas. R50A3-LWBENet is the water body segmentation model proposed in this paper.

Figure 9. Performance comparison of different models for boundary blurring. (a–e) represent 5 images of lake areas containing boundary blurring. The red circles represent key segmentation areas. R50A3-LWBENet is the water body segmentation model proposed in this paper.

Figure 10. Loss function values for seven models.

Table 1. Experimental environment configuration.

Experimental Environment	Platform	Configuration
Hardware	CPU	Intel(R) Xeon(R) Gold 6240 CPU @ 2.60 GHz
	Memory	32 G
	GPU	NVIDIA Tesla V100 32 GB
Software	Operating system	Linux Centos 7.6
	Programming language	Python 3.6

Table 2. ResNet backbone selection test results.

Backbone	Evaluation Metric			Training Time per Epoch (s)
Backbone	PA	MPA	MIoU	Training Time per Epoch (s)
ResNet18	98.5	98.55	96.9	28
ResNet34	98.6	98.65	97.1	80
ResNet50	98.8	98.85	97.6	82
ResNet101	98.8	98.85	97.6	95
ResNet152	98.8	98.85	97.6	114

Table 3. Results of ablation experiments.

SE	CBAM	Evaluation Metric			Training Time per Epoch (s)
SE	CBAM	PA	MPA	MIoU	Training Time per Epoch (s)
−	−	97.9	97.85	95.8	77
+	−	98.1	98.2	96.2	81
−	+	98.3	98.45	96.6	78
+	+	98.8	98.85	97.6	82

Note: the + mark indicates that the corresponding module is retained, and the − mark indicates that the corresponding module is removed.

Table 4. Results of model performance evaluation metrics.

Model	Evaluation Metric			Training Time per Epoch (s)
Model	PA	MPA	MIoU	Training Time per Epoch (s)
U-Net	97.4	97.5	94.8	50
AU-Net	97.7	97.65	95.3	120
RU-Net	93.9	93.15	88	300
ARU-Net	98.3	98.4	96.5	280
SER34AUNet	98.5	98.6	97	80
MU-Net	98.6	98.7	97.2	145
R50A3-LWBENet	98.8	98.85	97.6	82

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, M.; Liu, J.; Hu, H. A Novel Deep Learning Network Model for Extracting Lake Water Bodies from Remote Sensing Images. Appl. Sci. 2024, 14, 1344. https://doi.org/10.3390/app14041344

AMA Style

Liu M, Liu J, Hu H. A Novel Deep Learning Network Model for Extracting Lake Water Bodies from Remote Sensing Images. Applied Sciences. 2024; 14(4):1344. https://doi.org/10.3390/app14041344

Chicago/Turabian Style

Liu, Min, Jiangping Liu, and Hua Hu. 2024. "A Novel Deep Learning Network Model for Extracting Lake Water Bodies from Remote Sensing Images" Applied Sciences 14, no. 4: 1344. https://doi.org/10.3390/app14041344

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Novel Deep Learning Network Model for Extracting Lake Water Bodies from Remote Sensing Images

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area and Datasets

2.2. R50A3-LWBENet Model Components

2.2.1. ResNet

2.2.2. SE-ResNet

2.2.3. Convolutional Block Attention Module (CBAM)

2.3. R50A3-LWBENet Holistic Network Instructure

2.4. Model Performance Evaluation Metrics

2.4.1. Pixel Accuracy (PA)

2.4.2. Mean Pixel Accuracy (MPA)

2.4.3. Mean Intersection over Union (MIoU)

2.5. Cross Entropy Loss Function

3. Results and Discussion

3.1. Experimental Environment Configuration

3.2. ResNet Backbone Selection

3.3. Ablation Study

3.4. Model Performance Comparison

3.4.1. Comparison of Model Visual Observations

3.4.2. Comparison of Model Performance Evaluation Metrics

3.4.3. Comparison of Model Convergence

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI