Water Body Extraction of the Weihe River Basin Based on MF-SegFormer Applied to Landsat8 OLI Data

Zhang, Tianyi; Qin, Chenhao; Li, Weibin; Mao, Xin; Zhao, Liyun; Hou, Biao; Jiao, Licheng

doi:10.3390/rs15194697

Open AccessArticle

Water Body Extraction of the Weihe River Basin Based on MF-SegFormer Applied to Landsat8 OLI Data

by

Tianyi Zhang

¹

,

Chenhao Qin

¹,

Weibin Li

^1,*,

Xin Mao

¹,

Liyun Zhao

²,

Biao Hou

¹ and

Licheng Jiao

¹

Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, Xi’an 710071, China

²

Shaanxi Provincial Hydrology and Water Resources Survey Center, Xi’an 710071, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(19), 4697; https://doi.org/10.3390/rs15194697

Submission received: 23 August 2023 / Revised: 20 September 2023 / Accepted: 22 September 2023 / Published: 25 September 2023

Download

Browse Figures

Versions Notes

Abstract

:

In the era of big data, making full use of remote sensing images to automatically extract surface water bodies (WBs) in complex environments is extremely challenging. Due to the weak capability of existing algorithms in extracting small WBs and WB edge information from remote sensing images, we proposed a new method—Multiscale Fusion SegFormer (MF-SegFormer)—for WB extraction in the Weihe River Basin of China using Landsat 8 OLI images. The MF-SegFormer method adopts a cascading approach to fuse features output by the SegFormer encoder at multiple scales. A feature fusion (FF) module is proposed to enhance the extraction of WB edge information, while an Atrous Spatial Pyramid Pooling (ASPP) module is employed to enhance the extraction of small WBs. Furthermore, we analyzed the impact of four kinds of band combinations on WB extraction by the MF-SegFormer model, including true color composite images, false color images, true color images, and false color images enhanced by Gaussian stretch. We also compared our proposed method with several different approaches. The results suggested that false color composite images enhanced by Gaussian stretching are beneficial for extracting WBs, and the MF-SegFormer model achieves the highest accuracy across the study area with a precision of 77.6%, recall of 84.4%, F1-score of 80.9%, and mean intersection over union (mIoU) of 83.9%. In addition, we used the determination coefficient (R²) and root-mean-square error (RMSE) to evaluate the performance of river width extraction. Our extraction results in an overall R² of 0.946 and an RMSE of 28.21 m for the mainstream width in the “Xi’an-Xianyang” section of the Weihe River. The proposed MF-SegFormer method used in this study outperformed other methods and was found to be more robust for WB extraction.

Keywords:

MF-SegFormer; water body; Landsat 8 OLI images; Weihe River Basin

1. Introduction

Surface water bodies (WBs), such as rivers, lakes, and marshes, are significant for agriculture, industry, aquaculture, and aquatic and terrestrial ecosystems [1]. WB changes have a significant impact on biogeochemical cycles, ecosystem services, and other environmental changes [2,3]. Therefore, it is critical to quickly and accurately obtain the spatial distribution information of surface water for monitoring and management of water resources [4,5]. However, monitoring and understanding the spatiotemporal changes in large areas of water remains a major challenge. Surface waters exhibit highly dynamic characteristics in space and time due to the influence of climate, seasons, and human activities. Therefore, accurate and fast monitoring of WBs is critical for environmental research and management of terrestrial ecosystems [6].

With the benefits of macroscale, dynamic, continuous, and inexpensive monitoring of ground object patterns, satellite remote sensing provides a better understanding of spatial-temporal changes in ground objects [7,8,9,10]. Due to its broad observational capabilities, fast updating time, and rich information, remote sensing data have been widely used in the monitoring of changes in water resources [11,12].

Over the last few decades, numerous researchers have concentrated on surface water mapping methods using remote sensing images [13,14]. In general, WB extraction methods are broadly classified into three types: the single-band image threshold method, the spectral index identification method, and the image classification method. The most commonly used method for WB extraction is the single-band image threshold method [15,16,17]. However, due to the influence of many natural conditions on the reflection characteristics of the WB spectrum, the features reflected by single-band imagery are limited, resulting in poor performance of the single-band image threshold method.

The spectral index identification method can better reflect the geographic characteristics of the image and design the index according to the reflection characteristics of different bands of WBs [18,19,20], but there are situations where the threshold is difficult to determine.

In recent years, with the development of artificial intelligence and computer vision, many researchers have used image classification methods to segment remote sensing images to extract WBs. Landuyt et al. [21] used K-means to cluster SAR images and automatically determine thresholds for classification, achieving segmentation of dry land, permanent water, open flooding, and flooded vegetation. Feng et al. [22] used the random forest (RF) regression method to map the extent of WBs in Alaska from Visible Infrared Imaging Radiometer Suite (VIIRS) images. Li et al. [23] drew an annual WB distribution map of Huizhou, China, based on the Google Earth Engine (GEE) from 1986 to 2020 using the RF method and proposed a temporal consistency modification in surface water sequences and automatic updates of the training sample method to improve the precision of WB extraction. Duan et al. [24] extracted the WB distribution map of Wuhan, China, from GaoFen-1 data based on a lightweight convolutional neural network (CNN). Zhang et al. [25] used a cascaded fully convolutional network (CFCN) to detect WBs from SAR images. Zhong et al. [26] proposed a transformer-based deep learning neural network—NT-Net—for extracting lake WBs. However, due to the interference of complex backgrounds and shadows, the extraction performance of small WBs and WB edges still needs to be improved. Therefore, we need to build a WB extraction model with strong robustness and generalization ability. In addition, most previous studies on using artificial intelligence techniques for remote sensing image WBs’ segmentation have only utilized true color composite images [27,28], disregarding the richness of band combinations available in remote sensing images. Therefore, we need to choose band combination images that are more suitable for artificial intelligence algorithms.

In this study, a new method for extracting WBs from Landsat 8 OLI data in the Weihe River Basin based on the Multiscale Fusion SegFormer (MF-SegFormer) model is proposed. The MF-SegFormer method adopts a cascading approach to fuse features output by the SegFormer encoder at multiple scales. Specifically, aiming at the problem that the spectral differences of WBs and other ground objects lead to poor edge extraction of WBs, the FF module is designed, which can fuse different levels of features to capture richer semantic information. To address the problem that small WBs may be covered by surrounding background interference or noise, we introduce the ASPP module, which improves the extraction accuracy of small WBs by aggregating contextual information of different scales through parallel branching to obtain denser data. Existing WB extraction algorithms are mainly improved CNN-based algorithms, and this study combines the SegFormer network and multiscale fusion strategy. Compared with the local receptive fields relied on by traditional CNN networks, the SegFormer network model uses positional coding to process the sequence information, and this coding method can better retain the absolute positional information. In addition, the multiscale feature fusion strategy allows the model to better understand the relationship between the WBs and its surroundings, which allows global contextual modeling of the entire image.

In addition, the data source was evaluated by comparing the true color composite images (RGB 432), true color composite images enhanced by Gaussian stretching, false color composite images (RGB 564), and false color composite images enhanced by Gaussian stretching. We compared our method with other popular deep learning methods and composite spectral index methods. The comparison shows that our proposed method is more feasible and effective for extracting WB.

2. Study Area and Data

We chose the Weihe River Basin as the study area, covering 103°57′~110°16′E and 33°42′~37°24′N, across the three provinces of Shaanxi, Gansu, and Ningxia Hui Autonomous Region, covering approximately 134,766 square kilometers in total (Figure 1). The Weihe River Basin has a diverse topography, with mountains, hills, terraces, and plains accounting for 15%, 49%, 19%, and 17% of the total area, respectively. In China, the Weihe River Basin is an important biodiversity conservation area and ecological barrier area [29].

The last record warm year occurred in 2016 in conjunction with a strong El Niño, which can significantly affect the water volume of inland WBs [30]. Therefore, in this study, we selected the 2016 Landsat 8 OLI imagery as the data source, primarily using data from June and November. The study utilized images with a cloud cover of less than 2%, which were selected and preprocessed using the Google Earth Engine platform (https://earthengine.google.com/, (accessed on 1 October 2022)) [31]. In order to overcome the problem of mixed pixels, we pansharpened the research data and increased the image resolution to 15 m. Past research has proven that the Gram–Schmidt algorithm has the highest accuracy [27], therefore, we used the Gram–Schmidt algorithm for image pansharpening; Figure 2a and Figure 2b shows the image before pansharpening and after pansharpening, respectively.

In terms of image selection, with the support of the GEE platform, we traversed all the Landsat 8 OLI images of this area in 2016. We used the cloud amount extraction algorithm to calculate the cloud amount of each scene in the study period and selected the image with the least cloud amount in the study area to improve the reliability of WB extraction.

3. Methods

3.1. Proposed Method

Xie et al. [32] proposed the SegFormer network for semantic segmentation tasks in 2021. Our proposed method, Multiscale Fusion SegFormer (MF-SegFormer), is an improved SegFormer method.

3.1.1. SegFormer

SegFormer is a simple, lightweight, and effective semantic segmentation framework based on a Transformer. It consists of a lightweight multilayer perceptron (MLP) decoder and a hierarchical transformer encoder (Figure 3). The hierarchical Transformer encoder outputs multiscale features without positional coding, thereby avoiding the performance degradation caused by the interpolation problem of positional coding, which causes performance degradation when the test resolution differs from the training resolution.

The SegFormer network encoder comprises three modules: overlap patch embedding (OPE), efficient multihead self-attention (EMSA), and mix feed-forward network (Mix-FFN).

Among them, the OPE module’s primary function is 2D convolution, which scales the feature map by modifying the patch size and stride to form the feature hierarchy. Given an input of size H × W × 3, the patch merging operation is used to obtain a multilevel feature map F_i, corresponding to a resolution size of H/2ⁱ⁻¹ × W/2ⁱ⁺¹ × C_i, where the number of channels C_i+1 > C_i. The overlapped patch merging operation transforms the N × N × 3 input vector into a 1 × 1 × C vector.

The EMSA module is comparable to the standard self-attention structure, but it uses the sequence reduction operation to reduce the computational complexity. The estimation of attention in the traditional multihead self-attention mechanism is as follows:

A t t e n t i o n (Q, K, V) = S o f t m a x (\frac{Q K^{T}}{\sqrt{d_{h e a d}}}) V

(1)

where Q, K, and V are all matrices of size N × C (N = H × W), which stands for query, key, and value, respectively. Using the dot product of K and Q, the similarity between various feature maps can be calculated. This similarity is referred to as the attention score, and the attention score matrix is multiplied by the original feature map V to purify the data. This is extremely undesirable for large images, so the sequence reduction operation uses (2) and (3) to shorten the sequence:

\hat{K} = R e s h a p e (\frac{N}{R}, C, R) (K)

(2)

K = L i n e a r (C • R, C) (\overset{\land}{K})

(3)

where R represents the scaling factor of each self-attention mechanism and N represents the number of self-attention mechanism heads. After processing, the shape of the matrix is (R/N) × C.

Since the self-attention mechanism does not introduce position information, the position embedding operation must be introduced to incorporate position information. After training the network, the position encoding (PE) associated with each position is fixed. In the SegFormer network, the Mix-FFN operation is used instead of the position embedding operation, and the issue that the zero-padding operation will lose some localization information is attenuated by directly applying a 3 × 3 convolution to the feed-forward network (FFN). The Mix-FFN operation is as follows:

x_{o u t} = M L P (G E L U (C o n v_{3 \times 3} (M L P (x_{i n})))) + x_{i n}

(4)

where x_in is the output feature in the self-attention module. In particular, the SegFormer network employs depthwise convolutions to reduce the number of parameters to improve efficiency.

The SegFormer network decoder is composed of MLPs. The SegFormer network can utilize such a simple decoder because the hierarchical Transformer encoder has a larger effective perceptual domain than conventional CNN encoders. After feeding each set of features into the MLP layer, they are first projected to a fixed dimensionality via a linear layer following Equation (5). Then, the bilinear interpolation method is used to upsample them to the feature resolution of the first block using Equation (6). Afterward, the 4 sets of features output by the MLP layer are concatenated along the channel dimension, as in Equation (7), and finally the data are processed through a linear layer for semantic segmentation prediction via Equation (8):

{\hat{F}}_{i} = L i n e a r (C_{i}, C) (F_{i}), \forall_{i}

(5)

{\hat{F}}_{i} = U p s a m p l e (\frac{W}{4} \times \frac{H}{4}) ({\hat{F}}_{i}), \forall_{i}

(6)

F = L i n e a r (4 C, C) (c o n c a t ({\hat{F}}_{i})), \forall_{i}

(7)

M = L i n e a r (C, N_{c l s}) (F)

(8)

where F_i is the multilevel feature map extracted from the MixTransformer (MiT), M is the predicted mask, and N_cls is the number of categories.

SegFormer has designed a series of MiT (B0–B5) backbone networks to extract multiscale features for the semantic segmentation task. The experimental results on multiple datasets indicate that MiT-B5 has the best performance, so the encoder of the model used in this study is MiT-B5 [26], with the main hyperparameters listed in Table 1, where “Depths” denotes the number of layers in each encoder block, “Hidden size” is the dimension of each encoder block, and “Num attention heads” are the number of heads in EMSA at different stages. “Patch size” represents the patch size before each encoder block, and “Strides” represent the stride before each encoder block; “MIP ratios” multiplied by “Embed dims” are the increased channel dimensions in MixFNN, and “Sr ratios” represent the sequence reduction ratios in each encoder block. Due to its superior performance in image segmentation tasks, SegFormer has been applied in various fields [33].

3.1.2. MF-SegFormer

The MF-SegFormer model presented in this study consists of an MiT-B5 encoder and a multiscale fusion (MF) decoder network. The overall structure of the model is shown in Figure 4.

The MF network contains two modules: the feature fusion (FF) module and the Atrous Spatial Pyramid Pooling (ASPP) module. Due to the spectral and texture differences between WBs and other ground objects, the FF module can fuse these different features to enhance the extraction of WB edge information and reduce interference from other ground objects. The ASPP module is used to perform dilated convolution operations with different receptive fields on multiple scales to obtain more dense data, which can improve the extraction accuracy of small WBs.

First, three FF modules are used to fuse input features at multiple scales. Next, features of different scales are upsampled using bilinear interpolation, adjusted to 1/4 of the size of the input image, and refined using the ASPP module. Finally, the semantic segmentation decoding head predicts the semantic category of each pixel in the image.

This study references the gated-attention mechanism (GAM) module [34] and presents the FF module. The FF module can reduce the computational effort compared to the GAM module while maintaining accuracy. The FF module can effectively merge low-level detail texture information with high-level semantic information to achieve accurate WB edge segmentation. To have a larger perceptual field without using too many parameters, we used depthwise separable convolution (DSC).

As shown in Figure 4, the low-resolution input feature map

X_{input}^{low 2 x} \in (R^{\frac{H}{32} \times \frac{W}{32} \times C_{4}}, R^{\frac{H}{16} \times \frac{W}{16} \times C_{5}}, R^{\frac{H}{8} \times \frac{W}{8} \times C_{5}})

in the FF module is the result of the features output by the Transformer block 4, FF1, and FF2 modules after double upsampling. The high-resolution input feature maps

X_{input}^{high} \in (R^{\frac{H}{16} \times \frac{W}{16} \times C_{3}}, R^{\frac{H}{8} \times \frac{W}{8} \times C_{2}}, R^{\frac{H}{4} \times \frac{W}{4} \times C_{1}})

are the output feature maps of Transformer block 3, Transformer block 2, and Transformer block 1, respectively. The specific equations are shown in Equations (9)–(12):

X_{output}^{low} = δ {β [f_{3 \times 3}^{d = 2} (X_{input}^{low 2 x})]}

(9)

X_{output}^{high} = δ {β [f_{3 \times 3}^{} (X_{input}^{high})]}

(10)

X_{FF}^{} = δ {β {f_{5 \times 5}^{d = 2} [C oncat (X_{output}^{low}, X_{output}^{high})]}}

(11)

X_{output}^{FF} = X_{F F} \circ X_{output}^{low} + (1 - X_{F F}) \circ X_{output}^{high}

(12)

where

X_{i n p u t}^{l o w 2 x}

represents the low-resolution input feature after double upsampling,

X_{i n p u t}^{high}

represents the high-resolution input feature, Concat denotes a concatenation operation,

f_{3 \times 3}^{d = 2}

denotes the DSC with a convolution kernel size of 3 × 3 and a dilation rate of 2,

f_{3 \times 3}^{}

denotes the DSC with a convolution kernel size of 3 × 3, β denotes the batch normalization (BN) operation, and δ denotes the rectified linear unit (ReLU) activation function.

The ASPP module was applied in the semantic segmentation model DeepLabv2. It aggregates information at different scales through parallel branches, which helps in extracting WBs’ detailed information. Additionally, it uses dilated convolutions with different sampling rates to avoid information loss and retain high-resolution features. The ASPP module is composed of five parallel branches, including one branch of 1 × 1 convolutions, three branches of 3 × 3 dilated convolutions, and one pooling branch. The results of the five branches are concatenated and passed through a 1 × 1 convolutional module to obtain outputs.

The source code for the algorithm is available at https://github.com/tiany-zhang/MF-SegFormer (accessed on 20 June 2023).

3.1.3. Innovations of the MF-SegFormer

In this study, we proposed the MF-SegFormer method, which improved the decoder performance of SegFormer. The FF module was employed to reduce the influence of other ground objects, and the ASPP module was used to enhance the extraction accuracy of small WBs.
We proposed the FF module to fuse features from different levels, thereby capturing richer semantic information. In addition, the FF module could integrate different features according to the spectral difference between WBs and other ground objects, enhancing the extraction of WB edge information and improving the recognition ability of WBs.
We introduced the ASPP module, which aggregated contextual information at different scales through parallel branches to obtain denser data, to improve the extraction accuracy of small WBs.

3.2. Contrastive Methods

3.2.1. U-Net

Ronneberger et al. [35] published U-Net in 2015 with the aim of resolving problems in biomedical images. Due to its excellent performance, U-Net has been widely utilized in various semantic segmentation applications. For example, Li et al. [36] used an enhanced U-Net for the semantic segmentation of buildings based on remote sensing images. Wang et al. [37] combined U-Net with feature pyramid networks (FPNs) to design a segmentation network that is sensitive to edge features and demonstrated the effectiveness of the model through a feature multiclassification task based on publicly available remote sensing image datasets.

U-Net is based on a fully convolutional network (FCN), which fuses high-level feature maps with low-level feature maps by combining high-resolution information in upsampling and low-resolution information in downsampling. In addition, U-Net fills the underlying information with a fusion operation to improve the segmentation accuracy, and the fully symmetric U-shaped structure makes the fusion of front and back features more comprehensive so that the high-resolution information in the target image is augmented with the low-resolution information. In this study, a standard U-Net structure is employed, with the paddling and stride of the convolutional network layers set to 1, the stride of the deconvolutional set to 2, and the activation function set to ReLu.

3.2.2. Seg-Net

Seg-Net was proposed by Badrinarayanan et al. [38] in 2015 and applied as an image semantic segmentation deep network for autonomous driving. Seg-Net can identify the locations of objects in an image, such as roads, cars, and pedestrians. Due to its clear structure and rapid training speed in segmentation tasks, it is also widely applied to remote sensing images. For example, Abdollahi et al. [39] segmented aerial photographs of Massachusetts buildings using a combination of Seg-Net and U-Net. Wang et al. [40] used Seg-Net to semantically segment Savannah panoramic street images and perform an analysis of panoramic streetscape greenscape index features and their distribution.

The Seg-Net architecture utilized in this study is composed of an encoder network, a pixel-level classification layer, and a decoder network. In this study, a standard Seg-Net structure was utilized, and the hyperparameter settings are identical to those used in U-Net.

3.2.3. SETR

Zheng et al. [41] proposed the SEgmentation TRansformer (SETR) model by replacing the CNN encoder with the encoder of the Transformer block structure. The Transformer block consists of 24 layers, each containing a multiheaded self-attention (MSA) module and a two-layer MLP module. The SETR decoder has three structures, including naive unsampling, progressive upsampling (PUP), and multilevel feature aggregation (MLA). In this study, the PUP structure, which demonstrated the best performance in the original paper, was used as the decoder.

3.2.4. Segmenter

Strudel et al. [42] proposed a semantic segmentation model called Segmenter, which utilizes an encoder–decoder architecture based on the Vision Transformer (VIT) model. Segmenter contains two main modules: the Transformer encoder and the mask Transformer. The Transformer encoder module consists of L Transformer layers, where the Transformer layer has the same structure as the Transformer layer in the SETR network. The mask Transformer module decoder is composed of M Transformer encoder layers and uses joint processing of image blocks and class embeddings in the decoding stage, allowing the decoder to perform direct panoramic segmentation by replacing class embeddings with object embeddings.

The main workflow of Segmenter is as follows: First, the input image is divided into many small patches, each of which is then reshaped into a one-dimensional sequence. Next, patch embedding and position embedding are applied to each sequence to add patch and position encoding, respectively. Third, these sequences are sent to the Transformer encoder module for encoding, resulting in multiple sets of contextual encoding sequences containing rich semantic information. Finally, these sequences are sent to the mask Transformer together with class embedding information for the decoding operation, and after upsampling, Argmax is applied to give each pixel a subclass, and the final pixel segmentation map is output.

3.2.5. Composite Spectral Index Method

The main indices used to extract WBs by the composite spectral index method are the modified normalized difference water index (MNDWI) (Equation (13)) [43], normalized difference vegetation index (NDVI) (Equation (14)) [44], enhanced vegetation index (EVI) (Equation (15)) [45], and automated water extraction index (AWEI) [46], including AWEInsh (Equation (16)) and AWEIsh (Equation (17)). Feng et al. [47] used a multi-index combination approach to extract WBs (Equation (18)), using the MNDWI exceeding the EVI or NDVI to identify pixel points containing water signals and the AWEIsh index with the AWEInsh index to further remove mixed pixels [2]. Pixels with pixel frequencies greater than or equal to 0.25 were classified as WBs [48,49]. The specific formulas are as follows:

M N D W I = \frac{(ρ G r e e n - ρ S W I R 1)}{(ρ G r e e n + ρ S W I R 1)}

(13)

N D V I = \frac{(ρ N I R - ρ R e d)}{(ρ N I R + ρ R e d)}

(14)

E V I = \frac{2.5 \times (ρ N I R - ρ R e d)}{(1 + ρ N I R + 6 ρ R e d - 7.5 ρ B l u e)}

(15)

A W E I n s h = 4 \times (ρ G r e e n - ρ S W I R 1) - (0.25 \times ρ N I R + 2.75 \times ρ S W I R 2)

(16)

A W E I s h = ρ B l u e + 2.5 \times ρ G r e e n - 1.5 \times (ρ N I R + ρ S W I R 1) - 0.25 \times ρ S W I R 2

(17)

W a t e r = (A W E I n s h - A W E I s h > - 0.1) a n d (M N D W I > N D V I o r M N D W I > E V I)

(18)

where ρGreen, ρSWIR1-2, ρNIR, ρRed, ρBlue, and ρSWIR2 are the surface reflectance values in the green, shortwave infrared, near-infrared, red, and blue bands of Landsat, respectively.

3.3. Experimental Setup

In this study, we selected the 2016 Landsat 8 OLI image of the Weihe River Basin for dataset production. We used a visual interpretation technique to identify WBs in the entire image and obtained the binary classification label of water areas and non-water areas. Then, we cropped the entire image and the corresponding label to 256 × 256 pixels and obtained a total of 16,610 pairs of samples. The original dataset was divided into a training dataset, test dataset, and validation dataset in a 6:3:1 ratio. We randomly selected 9966 pairs of all samples as the training dataset, 1661 pairs of all samples as the validation dataset, and the remaining 4983 pairs as the test dataset to evaluate the accuracy of the model. We expanded the training dataset samples by rotating, mirroring, blurring, and increasing noise to improve the model’s robustness and prevent overfitting.

To determine the influence of different band combinations and different methods for WB extraction, 11 experimental cases were designed (Table 2). Groups A and B were used to analyze the influence of band combinations and the effects of different methods on WB extraction. To speed up the MF-SegFormer training for Group A and select the best band combination, we set the epochs to 100, with the first 50 epochs being freezing training and the last 50 epochs being unfreezing training. Group B extracted WBs based on the best band combination images using the MF-SegFormer, SegFormer, U-Net, Seg-Net, SETR, Segmenter, and composite spectral index method.

In this experiment, we set the training batch size to 16 and the number of epochs to 100. The AdamW optimizer was used along with an initial learning rate of 1 × e⁻⁴. The hyperparameters of the contrastive methods are consistent with the MF-SegFormer, and all methods were trained and evaluated under identical hardware and software conditions.

Finally, the performance measures of the F1-score, recall, precision, and mean intersection over union (mIoU) were employed to evaluate the accuracy of different extraction results, and the expressions of these indicators are presented in Equations (19), (20), (21) and (22), respectively:

P r e c i s i o n = T P / (T P + F P)

(19)

R e c a l l = T P / (T P + F N)

(20)

F 1 - s c o r e = 2 \times (p r e c i s i o n \times r e c a l l) / (p r e c i s i o n + r e c a l l)

(21)

m I o U = \frac{1}{C} \sum_{i = 1}^{C} \frac{T P}{T P + F P + F N}

(22)

where TP is true positive, FP is false positive, FN is false negative, and C is the number of classes.

Furthermore, we used the determination coefficient (R²) (Equation (25)) and root-mean-square error (RMSE) (Equation (26)) to evaluate the performance of the river width extraction. Equations (23) and (24) provide the formulas for calculating river width:

N_{i} = G E T N U M (L a t i t u d e [1] : L a t i t u d e [n]), i \in (L o n g i t u d e [1], L o n g i t u d e [m])

(23)

W i d t h_{i} = N_{i} * R e s o l u t i o n

(24)

R^{2} = \frac{\sum_{i = L o n g i t u d e [1]}^{L o n g i t u d e [m]} {(P r e W i d t h_{i} - M e a n W i d t h)}^{2}}{\sum_{i = L o n g i t u d e [1]}^{L o n g i t u d e [m]} {(W i d t h_{i} - M e a n W i d t h)}^{2}}

(25)

R M S E = \sqrt{\frac{1}{m} \sum_{i = L o n g i t u d e [1]}^{L o n g i t u d e [m]} {(W i d t h_{i} - P r e W i d t h_{i})}^{2}}

(26)

where n and m represent the number of grid points in the meridional and latitudinal directions, respectively, of the study area, GETNUM indicates the number of meridional river grid points when the longitude is i, Resolution is the resolution of the remote sensing image, and PreWidth_i and Width_i represent the predicted river width and the actual river width when the longitude is i, respectively.

4. Results and Discussion

4.1. Data Selection

Model training is significantly influenced by input data. Most previous studies based on Landsat 8 used true color composite images composed of bands 4, 3, and 2 (RGB 432). Meanwhile, considering that the false color composite image composed of bands 5, 6, and 4 (RGB 564) can effectively distinguish land and water, we consider comparing it with the true color composite image. Due to the low contrast of the above two original images, we consider using Gaussian stretching to enhance them.

To determine the effects of different band combinations of the WB extraction model, we used the MF-SegFormer to establish training methods with Cases 1A–4A to extract WBs in the image, where Gaussian refers to performing a Gaussian stretch on the image. Figure 5 shows the validation loss of different band combination images by the MF-SegFormer in different cases. As the first 50 epochs are freezing training and the last 50 epochs are unfreezing training, the validation loss will increase when the epoch is 51. We can see that the increasing order of minimum validation loss for the experiments is as follows: Case 4A, Case 3A, Case 2A, and Case 1A. This suggests that false color composite images composed of RGB 564 are more suitable for WB extraction than true color composite images, and false color composite images enhanced by Gaussian stretching are beneficial for the model to better extract WBs.

We selected a representative area for analyzing the reasons behind our results, as shown in Figure 6. First, through qualitative analysis, we observed that the WBs in Figure 6b,d are more prominent than those in Figure 6a,c. Moreover, the WBs in Figure 6c,d are also clearer compared to those in Figure 6a,b, indicating that the Gaussian stretching and false color composite image composed of RGB 564 can enhance the distinguishability and identification of WBs.

Through quantitative analysis, we calculated the WB and non-WB variances in the region of interest and obtained interclass variances of 48.30 for Case 1A, 254.37 for Case 2A, 1256.78 for Case 3A, and 2265.18 for Case 4A. These results confirm that Gaussian stretching and false color composite images composed of RGB 564 can effectively boost the contrast between WB and non-WB regions, thereby improving the accuracy of WB recognition.

4.2. Comparison with Various Methods

To analyze the performance of different semantic segmentation network methods on WB extraction, we designed Cases 1B–6B for comparative experiments. Table 3 provides a list of confusion matrix values for different cases on the test dataset, where TN represents a true negative, FN represents a false negative, FP represents a false positive, and TP represents a true positive. Through calculations, we found that Case 5B achieved the highest precision value (81.2%), followed by Case 1B (78.4%), Case 2B (75.2%), Case 4B (74.8%), Case 3B (65.3%), and Case 2B with the lowest precision of 60.5%, and Case 2B achieved the highest recall value (87.1%), followed by Case 1B (86.1%), Case 5B (80.1%), Case 3B (66.7%), Case 4B (60.4%), and Case 7B with the lowest precision of 57.3%. Furthermore, Case 1B achieved the highest F1-score and mIoU values, which were 82.1% and 84.8%, respectively. Case 2B closely followed with values of 80.8% and 83.8%, respectively, while Case 5B, Case 4B, and Case 3B followed in sequence. Case 6B exhibited the lowest values for F1-score and mIoU, which were 58.8%, and 70.8%, respectively. Overall, Case 1B exhibited the best extraction performance, followed by Case 2B, Case 5B, Case 4B, Case 3B, and Case 6B, which exhibited the worst extraction performance (Table 4).

In addition, we compared the overall accuracy of the study area for Cases 7B (Table 5) and found that the experiments exhibited precision values in the following descending order: Case 5B, Case 1B, Case 2B, Case 4B, Case 3B, Case 7B, and Case 6B. The descending order of recall rates was as follows: Case 2B, Case 1B, Case 5B, Case 3B, Case 4B, Case 6B, and Case 7B. The F1-score and mIoU rankings were as follows: Case 1B, Case 2B, Case 5B, Case 4B, Case 3B, Case 6B, and Case 7B. These results demonstrate that the MF-SegFormer exhibited the best WB extraction performance.

Furthermore, the lower precision of Case 1B could be attributed to the MF-SegFormer dividing the riverbank into small WBs. Overall, the MF-SegFormer performed well in the WB extraction task due to the FF module filtering the interference in the extraction of detailed features and the ASPP module expanding the perceptual field without decreasing resolution. As a result, the MF-SegFormer outperformed comparative methods for segmenting WBs in remote sensing images, enabling more accurate and comprehensive extraction of WBs.

4.3. Typical Area Extraction Comparison

As shown in Figure 7, we selected several typical scenarios to compare and analyze the extraction results in different cases. Each column in Figure 7 represents a different scenario, and seven scenarios were selected. The first row displays the color image of the real scenario, the second row displays the labeled image of each scenario, the third row displays the predicted result of the MF-SegFormer model for each scenario, and the fourth to ninth rows show the prediction results of the comparison models for each scenario. Meanwhile, the yellow dotted line indicates that WBs were misclassified as other surface types, and the red dotted line indicates that other surface types were misclassified as WBs. The results show that each model exhibited some errors, with the MF-SegFormer achieving the best results, extracting the highest integrity of the WB. In these scenarios, Case 3B, Case 4B, Case 6B, and Case 7B were not effective in extracting small WBs, as shown in Figure 7a,d. Case 3B misclassified many other land types as WBs, as shown in Figure 7b,e,f. Additionally, Case 6B and Case 7B both exhibited the most serious misclassifications. Compared with Case 1B, Case 2B did not perform well in terms of details for scenario Figure 7d.

Overall, we can observe that Case 1B exhibited the best extraction performance, mainly attributed to the MF-SegFormer’s adoption of a multiscale fusion learning approach and the use of the FF module to retain fuzzy boundaries when merging high-level semantic information features. These capabilities enabled the model to better capture the details and contextual information of small WBs.

4.4. Comparison of the Weihe River’s Mainstream

To further study the extraction performance of different cases, we intercepted the mainstream in the “Xi’an-Xianyang” section of the Weihe River using different methods. It is located in the lower reaches of the Weihe River Basin. However, the continuous acceleration of the urbanization process has resulted in changes in the river form, land use, and water flow of the Weihe River. To help people more clearly understand the diversity of the natural, cultural, and ecological systems of the river basin, to avoid the negative impact of unreasonable development and utilization in the process of urban development, to maintain the dynamic balance of the various environmental factors in the river, and to develop in a healthier direction, it is very important to accurately extract the WBs in the “Xi’an-Xianyang” section of the mainstream of the Weihe River.

The yellow rectangular dotted outline in Figure 8 depicts the focus area, which shows the spatial distribution of the mainstream in the “Xi’an-Xianyang” section of the Weihe River Basin. Figure 9 compares the WB extraction results of the Weihe River in the “Xi’an-Xianyang” section in different cases. We can see from the spatial distribution of the extraction results that Case 1B exhibited the best extraction performance, and the spatial distribution of the extracted WBs aligned the most consistently with the label. At the same time, Cases 3B–6B suffer from over extraction results due to algorithm performance, while Case 7B resulted in numerous missed extraction results due to the complex environment. The overall performance of Case 2B and Case 3B was better than that of the other cases but still showed instances of over extraction and missed extraction.

Furthermore, we analyzed the river width extraction results. In Figure 10a, the extracted river width in different cases at different longitudes is presented, indicating that the extraction results of Case 6B and Case 7B show a larger deviation from the label, while Case 1B exhibits the smallest deviation from the label.

Figure 10b–h depict scatter plots of the labeled river width versus the extracted river width in different cases. We can see that the accuracy is listed in descending order as follows: Case 1B, Case 2B, Case 5B, Case 4B, Case 3B, Case 7B, and Case 6B. The R² and RMSE values are 0.946, 0.941, 0.906, 0.903, 0.901, 0.752, and 0.735, and 28.206 m, 29.21 m, 34.984 m, 39.537 m, 40.547 m, 49.574 m, and 82.09 m, respectively. These results show that Case 1B offers the best performance in river WB extraction, indicating that the MF-SegFormer model can accurately extract the river.

5. Conclusions and Future Perspectives

This study proposed a new method for extracting WBs from Landsat 8 OLI images based on the MF-SegFormer, which can achieve high-precision semantic segmentation of small WBs and WB edges. The main improvement involved replacing the SegFormer decoder with the multiscale fusion (MF) network model combining the FF and ASPP modules. The FF module can fuse features from different levels, thereby capturing richer semantic information and enhancing the extraction of WB edge information. The ASPP module aggregated information at different scales through parallel branches to extract WBs’ detailed information. In this study, we built a WB extraction model using false color composite images (RGB 564) enhanced by Gaussian stretching. Labels were then used for accuracy verification, and precision, recall, F1-score, and mIoU were used for quantitative evaluation. Moreover, we used the R² and RMSE values to evaluate the model’s performance in river width extraction.

The results show that when using the MF-SegFormer model to extract WB, the accuracy in each case is from high to low followed by Case 4A, Case 3A, Case 2A, and Case 1A. Case 4A has the lowest validation loss of 0.141. It means that false color composite images (RGB 564) are more favorable for WB extraction than true color composite images (RGB 432), and images enhanced by Gaussian stretching are more suitable for WB extraction. The F1-score/mIoU of Cases 1B–6B based on semantic segmentation network methods are 82.1%/84.8%, 80.8%/83.8%, 66%/74.6%, 66.8%/75%, 80.7%/83.8%, and 58.8%/70.8% on the test dataset, and the F1-score/mIoU of Cases 1B–7B are 80.9%/83.9%, 79.7%/83.1%, 64%/73.5%, 64.6%/73.8%, 79.3%/82.8%, 55.5%/69.1%, and 34.7%/60.4% on the whole study area. Meanwhile, Case 1B also achieved the highest R² (0.946) and the lowest RMSE (28.21 m) in river width extraction. Overall, the MF-SegFormer model proved to have the highest accuracy and best performance, outperforming other methods.

Overall, our study provides a new method for accurate and efficient extraction of WBs from remote sensing images in complex environments. However, this study only extracted the WBs in 2016. Due to the lack of time series extraction results, the research conclusions still have some limitations. In the future, we will use multisource remote sensing data, such as GaoFen series satellite and Sentinel series satellite images, to extract a longer WB time series and evaluate the accuracy of our model. In addition, we will also try to transform the MF-SegFormer into a semi-supervised model-based network, which can greatly reduce the training sample size.

Author Contributions

Conceptualization, T.Z., C.Q. and W.L.; methodology, T.Z. and C.Q.; validation, X.M. and L.Z.; formal analysis, T.Z. and C.Q.; investigation, W.L. and B.H.; resources, W.L., B.H. and L.J.; data curation, X.M. and L.Z.; writing—original draft preparation, T.Z. and C.Q.; writing—review and editing, T.Z. and W.L.; visualization, T.Z. and C.Q.; supervision, W.L., B.H. and L.J.; funding acquisition, W.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the 2022 Shaanxi Water Conservancy Development Foundation (2022SLKJ-17), the Ningxia Autonomous Region’s 2020 Key R&D Project (2020BFG02013), and the National Natural Science Foundation of China (62171347).

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

The authors would like to thank three anonymous reviewers for their helpful comments and suggestions, which significantly improved the quality of our paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

Hou, X.; Feng, L.; Tang, J.; Song, X.; Liu, J.; Zhang, Y.; Wang, J.; Xu, Y.; Dai, Y.; Zheng, Y.; et al. Anthropogenic transformation of Yangtze Plain freshwater lakes: Patterns, drivers and impacts. Remote Sens. Environ. 2020, 248, 111998. [Google Scholar] [CrossRef]
Zou, Z.; Xiao, X.; Dong, J.; Qin, Y.; Doughty, R.B.; Menarguez, M.A.; Zhang, G.; Wang, J. Divergent trends of open-surface water body area in the contiguous United States from 1984 to 2016. Proc. Natl. Acad. Sci. USA 2018, 115, 3810–3815. [Google Scholar] [CrossRef] [PubMed]
Keller, P.S.; Catalan, N.; von Schiller, D.; Grossart, H.P.; Koschorreck, M.; Obrador, B.; Frassl, M.A.; Karakaya, N.; Barros, N.; Howitt, J.A.; et al. Global CO₂ emissions from dry inland waters share common drivers across ecosystems. Nat. Commun. 2020, 11, 2126. [Google Scholar] [CrossRef]
Huang, C.; Chen, Y.; Zhang, S.; Wu, J. Detecting, extracting, and monitoring surface water from space using optical sensors: A review. Rev. Geophys. 2018, 56, 333–360. [Google Scholar] [CrossRef]
Vörösmarty, C.J.; McIntyre, P.B.; Gessner, M.O.; Dudgeon, D.; Prusevich, A.; Green, P.; Glidden, S.; Bunn, S.E.; Sullivan, C.A.; Reidy Liermann, C.; et al. Global threats to human water security and river biodiversity. Nature 2010, 467, 555–561. [Google Scholar] [CrossRef]
Feng, M.; Sexton, J.O.; Channan, S.; Townshend, J.R. A global, high-resolution (30-m) inland water body dataset for 2000: First results of a topographic–spectral classification algorithm. Int. J. Digit. Earth 2016, 9, 113–133. [Google Scholar] [CrossRef]
Su, H.; Ji, B.; Wang, Y. Sea ice extent detection in the Bohai Sea using Sentinel-3 OLCI data. Remote Sens. 2019, 11, 2436. [Google Scholar] [CrossRef]
Su, H.; Wang, A.; Zhang, T.; Qin, T.; Du, X.; Yan, X.H. Super-resolution of subsurface temperature field from remote sensing observations based on machine learning. Int. J. Appl. Earth Obs. Geoinf. 2021, 102, 102440. [Google Scholar] [CrossRef]
Zhang, T.; Su, H.; Yang, X.; Yan, X. Remote sensing prediction of global subsurface thermohaline and the impact of longitude and latitude based on LightGBM. J. Remote Sens. 2020, 24, 1255–1269. [Google Scholar] [CrossRef]
Su, H.; Zhang, T.; Lin, M.; Lu, W.; Yan, X.H. Predicting subsurface thermohaline structure from remote sensing data based on long short-term memory neural networks. Remote Sens. Environ. 2021, 260, 112465. [Google Scholar] [CrossRef]
Li, Y.; Niu, Z.; Xu, Z.; Yan, X. Construction of high spatial-temporal water body dataset in China based on Sentinel-1 archives and GEE. Remote Sens. 2020, 12, 2413. [Google Scholar] [CrossRef]
Wei, X.; Xu, W.; Bao, K.; Hou, W.; Su, J.; Li, H.; Miao, Z. A water body extraction methods comparison based on FengYun Satellite data: A case study of Poyang Lake Region, China. Remote Sens. 2020, 12, 3875. [Google Scholar] [CrossRef]
Tang, H.; Lu, S.; Ali Baig, M.H.; Li, M.; Fang, C.; Wang, Y. Large-scale surface water mapping based on landsat and sentinel-1 images. Water 2022, 14, 1454. [Google Scholar] [CrossRef]
Wei, Z.; Jia, K.; Liu, P.; Jia, X.; Xie, Y.; Jiang, Z. Large-Scale River Mapping Using Contrastive Learning and Multi-Source Satellite Imagery. Remote Sens. 2021, 13, 2893. [Google Scholar] [CrossRef]
Klein, I.; Dietz, A.J.; Gessner, U.; Galayeva, A.; Myrzakhmetov, A.; Kuenzer, C. Evaluation of seasonal water body extents in Central Asia over the past 27 years derived from medium-resolution remote sensing data. Int. J. Appl. Earth Obs. Geoinf. 2014, 26, 335–349. [Google Scholar] [CrossRef]
Lu, S.; Ma, J.; Ma, X.; Tang, H.; Zhao, H.; Hasan Ali Baig, M. Time series of the Inland Surface Water Dataset in China (ISWDC) for 2000–2016 derived from MODIS archives. Earth Syst. Sci. Data 2019, 11, 1099–1108. [Google Scholar] [CrossRef]
Teodoro, A.C.; Goncalves, H. A semi-automatic approach for the extraction of sandy bodies (sand spits) from IKONOS-2 data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2012, 5, 634–642. [Google Scholar] [CrossRef]
Yang, X.; Qin, Q.; Yésou, H.; Ledauphin, T.; Koehl, M.; Grussenmeyer, P.; Zhu, Z. Monthly estimation of the surface water extent in France at a 10-m resolution using Sentinel-2 data. Remote Sens. Environ. 2020, 244, 111803. [Google Scholar] [CrossRef]
Li, L.; Su, H.; Du, Q.; Wu, T. A novel surface water index using local background information for long term and large-scale Landsat images. ISPRS J. Photogramm. Remote Sens. 2021, 172, 59–78. [Google Scholar] [CrossRef]
Barton, I.J.; Bathols, J.M. Monitoring floods with AVHRR. Remote Sens. Environ. 1989, 30, 89–94. [Google Scholar] [CrossRef]
Landuyt, L.; Verhoest, N.E.; Van Coillie, F.M. Flood mapping in vegetated areas using an unsupervised clustering approach on sentinel-1 and-2 imagery. Remote Sens. 2020, 12, 3611. [Google Scholar] [CrossRef]
Feng, W.; Jin, H. Mapping Surface Water Extent in Mainland Alaska Using VIIRS Surface Reflectance. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Virtual Conference, 12–16 July 2021; pp. 6120–6123. [Google Scholar]
Li, K.; Xu, E. High-accuracy continuous mapping of surface water dynamics using automatic update of training samples and temporal consistency modification based on Google Earth Engine: A case study from Huizhou, China. ISPRS J. Photogramm. Remote Sens. 2021, 179, 66–80. [Google Scholar] [CrossRef]
Duan, Y.; Zhang, W.; Huang, P.; He, G.; Guo, H. A New Lightweight Convolutional Neural Network for Multi-Scale Land Surface Water Extraction from GaoFen-1D Satellite Images. Remote Sens. 2021, 13, 4576. [Google Scholar] [CrossRef]
Zhang, J.; Xing, M.; Sun, G.C.; Chen, J.; Li, M.; Hu, Y.; Bao, Z. Water body detection in high-resolution SAR images with cascaded fully-convolutional network and variable focal loss. IEEE Trans. Geosci. Remote Sens. 2020, 59, 316–332. [Google Scholar] [CrossRef]
Zhong, H.F.; Sun, Q.; Sun, H.M.; Jia, R.S. NT-Net: A Semantic Segmentation Network for Extracting Lake Water Bodies from Optical Remote Sensing Images Based on Transformer. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
Su, H.; Wei, S.; Qiu, J.; Wu, W. RaftNet: A New Deep Neural Network for Coastal Raft Aquaculture Extraction from Landsat 8 OLI Data. Remote Sens. 2022, 14, 4587. [Google Scholar] [CrossRef]
Tambe, R.G.; Talbar, S.N.; Chavan, S.S. Deep multi-feature learning architecture for water body segmentation from satellite images. J. Vis. Commun. Image Represent. 2021, 77, 103141. [Google Scholar] [CrossRef]
Li, Y.; Chang, J.; Wang, Y.; Jin, W.; Guo, A. Spatiotemporal impacts of climate, land cover change and direct human activities on runoff variations in the Wei River Basin, China. Water 2016, 8, 220. [Google Scholar] [CrossRef]
Lei, Y.; Zhu, Y.; Wang, B.; Yao, T.; Yang, K.; Zhang, X.; Zhai, J.; Ma, N. Extreme lake level changes on the Tibetan Plateau associated with the 2015/2016 El Niño. Geophys. Res. Lett. 2019, 46, 5889–5898. [Google Scholar] [CrossRef]
Tamiminia, H.; Salehi, B.; Mahdianpari, M.; Quackenbush, L.; Adeli, S.; Brisco, B. Google Earth Engine for geo-big data applications: A meta-analysis and systematic review. ISPRS J. Photogramm. Remote Sens. 2020, 164, 152–170. [Google Scholar] [CrossRef]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Deng, J.; Lv, X.; Yang, L.; Zhao, B.; Zhou, C.; Yang, Z.; Jiang, J.; Ning, N.; Zhang, J.; Shi, J.; et al. Assessing Macro Disease Index of Wheat Stripe Rust Based on Segformer with Complex Background in the Field. Sensors 2022, 22, 5676. [Google Scholar] [CrossRef]
Tian, X.W.; Wang, J.L.; Chen, M.; Du, S.Q. An Improved SegFormer Network based Method for Semantic Segmentation of Remote Sensing Images. Comput. Eng. Appl. 2023, 59, 217–226. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Volume 18, pp. 234–241. [Google Scholar]
Li, Z.; Liu, Y.; Kuang, Y.; Wang, H.; Liu, C. A semantic segmentation method of buildings in remote sensing image based on improved UNet. In Proceedings of the 2nd International Conference on Signal Image Processing and Communication (ICSIPC 2022), Qingdao, China, 20–22 May 2022; Volume 12246, pp. 298–303. [Google Scholar]
Wang, X.; Ming, Y.U.; Ren, H.E. Remote sensing image semantic segmentation combining UNET and FPN. Chin. J. Liq. Cryst. Disp. 2021, 36, 475–483. [Google Scholar] [CrossRef]
Badrinarayanan, V.; Handa, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for robust semantic pixel-wise labelling. arXiv 2015, arXiv:1505.07293. [Google Scholar]
Abdollahi, A.; Pradhan, B.; Alamri, A.M. An ensemble architecture of deep convolutional Segnet and Unet networks for building semantic segmentation from high-resolution aerial images. Geocarto Int. 2022, 37, 3355–3370. [Google Scholar] [CrossRef]
Wang, J.; Liu, W.; Gou, A. Numerical characteristics and spatial distribution of panoramic Street Green View index based on SegNet semantic segmentation in Savannah. Urban For. Urban Green. 2022, 69, 127488. [Google Scholar] [CrossRef]
Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.S.; et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6881–6890. [Google Scholar]
Strudel, R.; Garcia, R.; Laptev, I.; Schmid, C. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 7262–7272. [Google Scholar]
Xu, H. Modification of normalised difference water index (NDWI) to enhance open water features in remotely sensed imagery. Int. J. Remote Sens. 2006, 27, 3025–3033. [Google Scholar] [CrossRef]
Rouse, J.W.; Haas, R.H.; Schell, J.A.; Deering, D.W. Monitoring vegetation systems in the Great Plains with ERTS. NASA Spec. Publ. 1974, 351, 309. [Google Scholar]
Huete, A.; Didan, K.; Miura, T.; Rodriguez, E.P.; Gao, X.; Ferreira, L.G. Overview of the radiometric and biophysical performance of the MODIS vegetation indices. Remote Sens. Environ. 2002, 83, 195–213. [Google Scholar] [CrossRef]
Feyisa, G.L.; Meilby, H.; Fensholt, R.; Proud, S.R. Automated Water Extraction Index: A new technique for surface water mapping using Landsat imagery. Remote Sens. Environ. 2014, 140, 23–35. [Google Scholar] [CrossRef]
Feng, S.; Liu, S.; Zhou, G. Long-term dense Landsat observations reveal detailed waterbody dynamics and temporal changes of the size-abundance relationship. J. Hydrol. Reg. Stud. 2022, 41, 101111. [Google Scholar] [CrossRef]
Zou, Z.; Dong, J.; Menarguez, M.A.; Xiao, X.; Qin, Y. Continued decrease of open surface water body area in Oklahoma during 1984–2015. Sci. Total Environ. 2017, 595, 451–460. [Google Scholar] [CrossRef] [PubMed]
Deng, Y.; Jiang, W.; Tang, Z.; Ling, Z.; Wu, Z. Long-term changes of open-surface water bodies in the Yangtze River basin based on the Google Earth Engine cloud platform. Remote Sens. 2019, 11, 2213. [Google Scholar] [CrossRef]

Figure 1. Spatial location and extent of the study area of the Weihe River Basin, People’s Republic of China: (a) The administrative division of China and (b) A true color Landsat 8 OLI image of the study area.

Figure 2. The image before pansharpening and after pansharpening, respectively: (a) The image before pansharpening, (b) The image after pansharpening.

Figure 3. SegFormer model structure.

Figure 4. MF-SegFormer model structure.

Figure 5. Validation loss of different band combination images by MF-SegFormer in different cases.

Figure 6. Spatial display of images in a representative area after band combination and Gaussian stretching: (a) Case 1A, (b) Case 2A, (c) Case 3A, and (d) Case 4A.

Figure 7. Classification results of WBs in different cases: (a) The scenario containing small WBs, (b,c) The scenario containing the reservoir, (d) The scenario containing a wide river channel, (e,f) The scenario containing the city, and (g) The scenario containing shadows of hills. The yellow dotted line indicates WBs misclassified as other land types. The red dotted line indicates pixels misclassified as WBs.

Figure 8. Spatial distribution of the mainstream in the “Xi’an-Xianyang” section of the Weihe River Basin.

Figure 9. Comparison of the WB extraction results of the Weihe River in the “Xi’an-Xianyang” section in different cases.

Figure 10. Quantitative evaluation of the WB extraction results of the Weihe River in the “Xi’an-Xianyang” section in different cases: (a) Comparison of the river width extracted by different methods at different longitudes and (b–h) Scatter plots of the labeled river width versus extracted river width in different cases.

Table 1. MiT-B5 hyperparameters.

Name	Number
Embed dims	64
Depths	[3, 6, 40, 3]
Hidden size	[64, 128, 320, 512]
Num attention heads	[1, 2, 5, 8]
Patch size	[7, 3, 3, 3]
Strides	[4, 2, 2, 2]
MIP ratios	[4, 4, 4, 4]
Sr ratios	[8, 4, 2, 1]
Decoder hidden size	768

Table 2. Design of experiments.

Group	Case	Experimental Schemes
Group A	Case 1A	MF-SegFormer (Band 4, Band 3, Band2)
	Case 2A	MF-SegFormer (Gaussian (Band 4, Band 3, Band 2))
	Case 3A	MF-SegFormer (Band 5, Band 6, Band4)
	Case 4A	MF-SegFormer (Gaussian (Band 5, Band 6, Band 4))
Group B	Case 1B	MF-SegFormer (Gaussian (Band 5, Band 6, Band 4))
	Case 2B	SegFormer (Gaussian (Band 5, Band 6, Band 4))
	Case 3B	U-Net (Gaussian (Band 5, Band 6, Band 4))
	Case 4B	Seg-Net (Gaussian (Band 5, Band 6, Band 4))
	Case 5B	SETR (Gaussian (Band 5, Band 6, Band 4))
	Case 6B	Segmenter (Gaussian (Band 5, Band 6, Band 4))
	Case 7B	Composite spectral index method

Table 3. Confusion matrix values for different cases on the test dataset.

Case	TN	FN	FP	TP
Case 1B	325,710,948	95,883	163,867	595,190
Case 2B	325,676,610	88,810	198,205	602,263
Case 3B	325,630,121	229,929	244,694	461,144
Case 4B	325,734,553	273,736	140,262	417,337
Case 5B	325,746,904	137,425	127,911	553,648
Case 6B	325,616,333	295,237	258,482	395,836

Table 4. Table of extraction results in accuracy evaluation for different cases on the test dataset.

Case	Precision	Recall	F1-Score	mIoU
Case 1B	78.4%	86.1%	82.1%	84.8%
Case 2B	75.2%	87.1%	80.8%	83.8%
Case 3B	65.3%	66.7%	66.0%	74.6%
Case 4B	74.8%	60.4%	66.8%	75.0%
Case 5B	81.2%	80.1%	80.7%	83.8%
Case 6B	60.5%	57.3%	58.8%	70.8%

Table 5. Table of extraction results in accuracy evaluation for different cases on the whole study area.

Case	Precision	Recall	F1-Score	mIoU
Case 1B	77.6%	84.4%	80.9%	83.9%
Case 2B	74.6%	85.4%	79.7%	83.1%
Case 3B	63.9%	64.1%	64.0%	73.5%
Case 4B	73.9%	57.4%	64.6%	73.8%
Case 5B	80.0%	78.7%	79.3%	82.8%
Case 6B	57.8%	53.4%	55.5%	69.1%
Case 7B	58.0%	24.8%	34.7%	60.4%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, T.; Qin, C.; Li, W.; Mao, X.; Zhao, L.; Hou, B.; Jiao, L. Water Body Extraction of the Weihe River Basin Based on MF-SegFormer Applied to Landsat8 OLI Data. Remote Sens. 2023, 15, 4697. https://doi.org/10.3390/rs15194697

AMA Style

Zhang T, Qin C, Li W, Mao X, Zhao L, Hou B, Jiao L. Water Body Extraction of the Weihe River Basin Based on MF-SegFormer Applied to Landsat8 OLI Data. Remote Sensing. 2023; 15(19):4697. https://doi.org/10.3390/rs15194697

Chicago/Turabian Style

Zhang, Tianyi, Chenhao Qin, Weibin Li, Xin Mao, Liyun Zhao, Biao Hou, and Licheng Jiao. 2023. "Water Body Extraction of the Weihe River Basin Based on MF-SegFormer Applied to Landsat8 OLI Data" Remote Sensing 15, no. 19: 4697. https://doi.org/10.3390/rs15194697

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Water Body Extraction of the Weihe River Basin Based on MF-SegFormer Applied to Landsat8 OLI Data

Abstract

1. Introduction

2. Study Area and Data

3. Methods

3.1. Proposed Method

3.1.1. SegFormer

3.1.2. MF-SegFormer

3.1.3. Innovations of the MF-SegFormer

3.2. Contrastive Methods

3.2.1. U-Net

3.2.2. Seg-Net

3.2.3. SETR

3.2.4. Segmenter

3.2.5. Composite Spectral Index Method

3.3. Experimental Setup

4. Results and Discussion

4.1. Data Selection

4.2. Comparison with Various Methods

4.3. Typical Area Extraction Comparison

4.4. Comparison of the Weihe River’s Mainstream

5. Conclusions and Future Perspectives

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI