1. Introduction
Water is an indispensable substance for the survival of human beings and all living things and an extremely valuable natural resource for industrial and agricultural production, economic development, and environmental improvement [
1,
2]. The lakes of the Qinghai-Tibet Plateau are important carriers of water resources, numbering more than 1500 lakes [
3] and accounting for 57.6% of the national total lake area; the plateau thus represents the main lake distribution area [
4]. The dynamic changes in lake area on the Qinghai-Tibet Plateau are closely related to global climate change. Thus, monitoring and evaluating the changes in lake area on the Qinghai-Tibet Plateau is of great significance for studying the climate change of the Qinghai-Tibet Plateau under the background of global warming [
5].
The early research on the lakes of the Qinghai-Tibet Plateau mainly relied on the field research of researchers, but due to the harsh natural conditions, which were time-consuming and labor-intensive, and the poor experimental equipment conditions, the research content and research results had great limitations, which could not meet the multi-temporal phase and high-precision monitoring requirements of the researchers [
6]. With the development and application of remote sensing technology, the accuracy of remote sensing data is improving, and it has become common to obtain ground object information. The absorption rate of water in the electromagnetic band of 0.4–2.5 μm is significantly higher than that of other types of features, which makes optical remote sensing a unique advantage in monitoring changes in the areas of large lakes.
The Normalized Difference Water Index (NDWI) [
7] threshold method is a simple water extraction method that uses the differences in the sensitivity of water to spectral characteristics in different bands to distinguish water bodies from other features. Several researchers have also proposed an improved WI [
8,
9] to cope with different application scenarios. The WI’s key disadvantage is that the threshold for distinguishing lake water bodies from other features is too dependent on the prior knowledge from the researchers [
10], and the classification accuracy is not stable enough or universally applicable when facing water classification tasks in different natural environments. However, the considerable overlap between the lake water bodies predicted by the WI and real values can effectively reduce the amount of manual work when sketching a water body, and the researchers need only to modify the wrong lake boundary to obtain the true value, so as to obtain an accurate dataset [
11,
12].
Machine Learning (ML) is another common method for extracting lake water bodies from remote sensing imagery, including supervised classification [
13] and unsupervised classification [
14]. Supervised classification establishes the judgment criteria for training samples and makes category judgments on the study areas. However, the high time and labor cost of labeling due to a large number of samples is especially obvious in large-scale lake extraction tasks [
15]. In addition, this method uses artificial choices for remote sensing imagery interpretation workers, and the subjective factor is strong and lacking in objectivity. Unsupervised classification, also known as cluster analysis or group point analysis, requires no prior data and minimal initial human input. This method ignores the spatial information of the image, and the computer automatically classifies similar characters into one category according to certain rules based on differences in the data themselves, which may lead to the phenomenon of “different objects have the same spectrum” [
16], resulting in the incorrect classification of water bodies and other features.
As a branch of ML, Deep Learning (DL) has the following advantages: it requires no prior rules, input data can output results directly, and DL can fully exploit the deep features of the data. With the development of deep learning technology, many studies using DL methods to extract different objects have emerged [
17,
18,
19]. When Convolutional Neural Network (CNN) [
20], a neural network specifically designed for processing data with a grid-like structure is used to solve large lake extraction tasks, due to the size limitations of the receptive field, only some local features can be extracted, while ignoring the spatial correlation of continuous water bodies. Additionally, a large number of noise points will appear in the extraction results, making the classification effect poor. The Fully Convolutional Network (FCN) [
21] is a semantic segmentation model that replaces the fully connected layer of the convolutional neural network with a convolutional layer, which achieves an improvement from image classification to pixel classification [
22]. However, the model does not take the global spatial relationship between pixels into enough consideration [
23], and the extraction results lack spatial consistency. The U-Net network [
24] proposed based on the FCN increases the number of decoders to make it the same as the number of encoders and uses Skip Connection between the two to concat together the features extracted in this way. Although the accuracy of U-Net is improved compared with FCN, it only uses some convolutional operations and still does not solve the problem of insufficient consideration of global spatial relationships between pixels. Moreover, the redundancy of cutting image patches will lead to longer training of the model. In order to adapt to the extraction tasks of different lake water bodies, the researchers made improvements to all classical deep learning models, including replacing the encoder or decoder with another model [
25,
26,
27,
28,
29,
30,
31] and increasing or decreasing the number of model layers [
32,
33,
34,
35,
36] and mechanisms that increase attention [
37,
38,
39,
40,
41,
42]. In order to meet the requirements of rapid mapping and ensure a certain degree of accuracy, many automated mapping methods have emerged to extract lake water bodies [
43,
44,
45,
46].
In the past, deep learning studies on lake water extraction often used convolution kernels of different sizes as the core for encoder extraction of water features, but these studies were limited by the receptive field and often could not learn the spatial relationship characteristics between multiple pixels well, resulting in leakage and misseparation of water bodies [
47,
48]. As the amount of data in the classification task increases, the degree of missed misalignment will further increase. In this paper, the ViTenc-UNet model is proposed. This model replaces the convolutional layer of the encoder core in the original U-Net model with the Vision Transformer, which enhances the model’s ability to capture the spatial relationship of continuous water. In addition, to reduce the loss of water information in different bands of images, we add the Convolutional Block Attention Module (CBAM) mixed attention mechanism, which consisted of the channel attention module and the spatial attention module to the decoder to increase the confidence weight of the spectral and spatial information of water in the model. Without increasing the number of network structure layers, this model innovatively uses the ViT layer and CBAM hybrid attention mechanism in the identification and extraction task of lake water bodies on the Qinghai-Tibet Plateau, which effectively improves the efficiency of large-scale lake water classification tasks, enhances the extraction capacity of lake water bodies on the Qinghai-Tibet Plateau, and provides effective help in monitoring the dynamic changes of lake water bodies on the Qinghai-Tibet.
3. Methods
3.1. Modification of Original UNet
The presentation of the lake water surface on remote sensing images is affected by the terrain of the lake basin, as well as lake composition, cloud cover, recharge water source, human activities, and other factors. The lake surface offers rich spatial and spectral details [
53]. Although the convolutional layer can complete the feature identification and extraction of each pixel, due to the size of the receptive field, the convolutional layer in the original U-Net model encoder cannot effectively mine the context details between the complex continuous water pixels of the lakes of the Qinghai-Tibet Plateau, resulting in a lack of lake water continuity, which is particularly obvious in complex lake shores, coastal tidal flats, and the bare ground in lakes. The aforementioned three types of terrain and water bodies are intertwined, so the water bodies are cut and presented in fragmented shapes on remote sensing images, which poses difficulties in accurately interpreting the water pixels in the images and demonstrating their spatial continuity. Aiming at restoring water body pixels’ spatial continuity and interpreting them accurately, we applied the ViT structure to the ViTenc-UNet model encoder to solve the problems. The self-attention mechanism of ViT pays more attention to the global spatial relationship between long-distance continuous water pixels and preserves this information. ViT draws on the processing method of the Transformer structure for word and sentence structure in the field of Natural Language Processing (NLP), which divides the input image into several “word embedding”. The spatial relationship and spectral features of the images are then processed in parallel based on their semantic information in the embedded words to completely extract the water body information of lakes on the Qinghai-Tibet Plateau. A convolutional layer added after the ViT output is used to map the feature dimensions of constrained images. In addition, inserting the CBAM mechanism can retain key weighted and confidence information on the lake water when the decoder restores the image feature operation, which can accurately complete the prediction and extraction of the lake water surface and effectively improve the recognition effects for different lakes.
The structure of ViTenc-UNet, an improved model based on U-Net, is shown in
Figure 2. The ViTenc-UNet model includes an Inc layer that initializes the input image into a multidimensional matrix, an Out layer that constrains the output to predict the image feature dimension, four downsampling layers, four upsampling layers, and four skip connections. The Inc layer is used to initialize the input image to be predicted. The function of the downsampling layer is to extract the deep semantic features of the image; additionally, the number of channels increases from 64, 128, 256, and 512, to 1024. The upsampling layer is used to restore the features of the image. The skip connection is used to fuse and stitch the output results of each upsampling layer and the downsampling results of the corresponding dimension, which decreases the feature information loss during downsampling. The role of the Out layer is to map and output prediction results.
3.2. Encoder
The ViTenc-UNet model encoder includes an initialization layer and four successive downsampling layers with the same structure, each of which includes a pooling layer, a ViT layer, and a convolutional layer, with different input and output channel numbers. The initialization layer consists of a set of two continuous 3 × 3 convolutional layer stacks, each of which is activated non-linearly using the ReLU function.
The pooling layer in the downsampling layer consists of a 2 × 2 maximum pooling layers, which segments the input patch into a few parts with the same pooling size to reduce the spatial dimension of the image. ViT for extracting the deep semantic features of images in the downsampling layer is a transformer-based image segmentation method [
54] that processes image data into fixed-size sequence data with a similar structure to text, sentences, paragraphs, etc., and enters the sequence data into the Transformer layer to learn the characteristics of the sequence data. The specific principle for an existing image can be expressed using the following equation:
where
is the splitting patch,
is the original input image,
is the height of the original input image,
is the width of the original input image, and
is the number of channels of the original input image. Here, to embed the image into the encoder, we split it into several image patches. Each patch has the same image size and the same number of channels as the original input image. The principle is as follows:
where
is the number of image patches,
is the size of the image patch, and
is the number of channels of the image patch (consistent with the number of channels of the original input image). An image patch has a size of (
,
,
) and is a flattened vector with length
. Next, a liner transformation layer with dimension D is used to map the vector, constraining the size of the input vector to form an image patch of the D dimension, similar to the “word embedding” of the image patch. The mapped image patch does not have position coding, and a learnable, numerical vector needs to be embedded into the image patch to represent the position embedding of the image patch. The principle is as follows:
where
represents the position encoding of even columns, and
represents the position coding of odd columns. Next, we concatenate the position encoding and image patch fusion to form an image patch with position encoding. Before passing into the encoder’s Multihead Self-Attention (MSA) mechanism, the image patch needs to be subjected to a Layer Normalization expression to reduce the Covariate Shift inside the image patch. The principle is as follows:
where
is the image patch input from the previous layer, and
is the pre-normalized image.
In the MSA mechanism, each image patch needs to compute the Query (hereafter referred to as Q), Key (hereafter referred to as K), and Value (hereafter referred to as V). By calculating the dot product of the transpose matrix of Q and K and applying the softmax function to map these vector products to the (0, 1) interval, the attention weight between each image patch and other image patches is obtained, and these weights are multiplied by the corresponding V and added to obtain a new image patch. The principle is as follows:
where
is the vector information that needs to be queried (Query),
is the queried vector index (Key),
is the value obtained by the query, and d
k is the degree of scaling. ViTenc-UNet models use learnable scaling coefficients, which are continuously adjusted using the training situation during the model training process. The ViT structure is shown in
Figure 3 [
54].
Image patches weighted by the MSA will be passed into the Feedforward Neural Network (FNN) for further feature extraction. The transformer layer will then output a new sequence that contains a high-level feature representation of the input image. For the classification extraction task of water body images, we used the first image patch in the sequence to obtain the final classification probability of water bodies and other features. Finally, the downsampling layer structure connects a 3 × 3 convolutional layer, which acts as a map output image feature matrix.
3.3. Decoder
The structure of the ViTenc-UNet model decoder includes four feature dimensions reduced in the sequence of the upsampling layer and Out layer. Each upsampling layer includes a transposed convolutional layer, a bilinear interpolation layer, a CBAM attention layer, and two continuous 3 × 3 convolutional layers. In addition, skip connections are used in the module to connect the feature dimensions of the downsampling layer and upsampling layer. The core part of the upsampling layer is two parallel upsampling methods, transpose convolution and bilinear interpolation.
The ViTenc-UNet model upsampling methods are Bilinear Interpolation [
55] and Transpose Convolution [
56]. By upsampling the input feature map, we restore the input feature map to the initial value. Specifically, bilinear interpolation obtains the pixel value of an unknown point through the pixel values of four adjacent points of unknown points and performs linear interpolation from two directions. The corresponding formula is as follows:
where the pixel values of unknown points found by
,
,
,
, and
are the pixel values of the known four adjacent points, and
,
,
, and
are the weight values of the four adjacent points, respectively. The weights will change dynamically with the distance of adjacent image points.
In addition, we choose the transpose convolution method according to bilinear variation. Transpose convolution is different from ordinary convolution. Ordinary convolution characterizes the input feature dimension such that the resulting feature dimension will be reduced or unchanged to varying degrees, while transpose convolution will amplify the input feature dimension. However, the transpose convolution process will not restore the corresponding cell value. After the upsampling process is completed, the feature map of the image will perform two consecutive convolution operations to extract the feature dimensions of the image.
To better preserve the spatial and channel features, we need to weight both of them. Therefore, we next introduce the CBAM hybrid attention mechanism [
57], which contains the Channel Attention Module and Spatial Attention Module. The structure of this mechanism is shown in
Figure 4 [
57].
The channel attention pair of the structure performs maximum pooling and average pooling operations on the input patches and then passes these two results into a Multi-Layer Perception (MLP) with two convolutional layers. Next, the maximum pooling feature and the average pooling feature processed in MLP are summed, and sigmoid activation is performed on the summation result to obtain the channel attention weighting coefficient. The principle is as follows:
where
is the channel attention-weighted feature dimension, and
is the input feature dimension.
Spatial attention takes the features from the channel attention weighted as the input patches of this module, performs maximum pooling and average pooling operations in the channel dimension, and concatenates the two pooling results. After a 2 × 2 convolutional layer operation, the dimension is reduced to one dimension and then activated by a sigmoid function to obtain spatial attention weights. The principle is as follows:
where
is the channel attention weighting coefficient, and
is the input feature dimension.
We then multiply this weight by the initial input feature dimension to obtain the channel and spatially weighted feature dimensions. Finally, two consecutive 3 × 3 convolutional layers are used to map the feature dimensions of the output image. The output layer consists of a 1 × 1 convolutional layer.
3.4. Comparison
Five models, FCN, U-Net, DeepLabv3+ [
58], TransUNet [
59], and Swin-Unet [
60], are included in the comparative experiments.
The FCN model consists of five convolution operations and two fully connected operations. In the five convolution operations, the image is scaled in each operation, while the third and fourth operations also retain the feature map of the cell. After five scaling operations, the feature information of the cell changed in the fully connected operation of the sixth and seventh layers. The obtained features are then upsampled, and the fourth and third convolutional layers are used sequentially to supplement the features with details in this process to complete the restoration operation of the image, whose structure is shown in
Figure 5 [
21].
The structure of the U-Net model includes four downsampling layers and four upsampling layers. Convolutional layers are used before the input downsampling layer and after the output from the upsampling layer to initialize the input matrix and map the output results. Each downsampling layer is the maximum pooling layer and two consecutive convolutional layers that extract the feature dimensions of the image. Each upsampling layer includes Bilinear Interpolation and Transpose Convolution, and a skip connection is used to fuse the symmetrical downsampling layer output features. The structure is shown in
Figure 6 [
24].
The encoder of DeepLabv3+ first selects a low-level feature and uses a 1 × 1 convolutional layer to compress the channel to reduce the specific gravity and then gradually obtains the segmentation result in 3 × 3 convolutional layers. The decoder part directly upsamples the output of the encoder by 4 times to make its resolution consistent with the low-level feature. After the two features are fused and stitched, a 3 × 3 convolution is performed. Then, upsampling is performed to obtain the predicted output, whose structure is shown in
Figure 7 [
58].
TransUNet is a model based on U-Net modifications, with specific changes introduced to reduce the number of downsampling layers and upsampling layers of the model to three and add a ViT structure between the third downsampling layer and the first upsampling layer. We also construct an encoder that combines the convolutional neural network and a Transformer structure. The decoder is composed of a cascaded upsampler with a skip connection between the encoder and the decoder. The corresponding structure is shown in
Figure 8 [
59].
The overall structure of Swin-Unet is similar to U-Net, consisting of an encoder, bottleneck block, and decoder, with a skip connection between the encoder and the decoder. The whole structure is shaped like a “U”, the Swin Transformer (Shifted window Transformer) replaces all convolutional layers in the encoder and decoder, and the bottleneck block is directly composed of two Swin Transformer connections. The structure is shown in
Figure 9 [
60].
4. Experiments
The ViTenc-UNet model uses the PyTorch 1.10.2 software package for construction and development. Based on the characteristics of binary classification in this experiment, the loss algorithm of the ViTenc-UNet model uses the Binary Cross Entropy Loss function (BCEL). Moreover, the Adam optimizer is used for weight updates, and Sigmoid is used as the activation function. The learning rate is 0.01; the default number of iteration rounds is 300 epochs; and to optimize memory use, the batch size is set to two. The model is trained on an NVIDIA GeForce RTX3070 Ti (8 G) GPU. After each training iteration is completed, the model saves the weight, loss function, and other metrics of the training completed to the local computer and zeroes out the confusion matrix and loss function of this iteration before the next iteration starts. The loss function is a quantitative value of how close the model’s predicted value is to the true value, wherein the smaller the number, the better the model’s prediction effect.
Figure 10C–E show a comparison of the predicted effects in some areas. Here, the extraction effect of ViTenc-UNet is significantly better than that of U-Net before improvement and MNDWI-threshold, as the former generally retains the continuity and integrity of lake water bodies, while the latter has obvious bare surface misclassification in lakes in addition to the leakage of small lake water bodies.
Figure 11A,B show the changes to the loss of the training set and validation set during the training process of each model. Here, the loss of the training set and the validation set is shown to decrease with an increase in rounds, with oscillation in the interval. In the 11th round of training, the loss value change of the model tended to decrease slowly. At epoch 299, the training loss reached a minimum of 0.0041, which was minimal compared to the training loss of the remaining five models. At epoch 296, the validation loss reached a minimum of 0.0058, which was minimal compared to the validation loss of the remaining five models.
In order to comprehensively compare the performance of the ViTenc-UNet model with other models in the water classification task, we used five indicators: Overall Accuracy (), Intersection over Union (), Precision (), Recall (), and F1 Scores (). The classification results are divided into four categories: Correctly extracted water pixels are considered True Positive (), incorrectly extracted non-water pixels are False Positive (), correctly extracted non-water pixels are True Negative (), and incorrectly extracted water pixels are False Negative ().
is the ratio of the number of correctly extracted water pixels and correctly extracted non-water pixels to the total number of pixels, and the principle is as follows:
is the ratio of the intersection of predicted and true values and the union between the two, and the principle is as follows:
is the ratio of the number of correctly extracted water pixels to all classified water pixels, and the principle is as follows:
is the ratio of the number of correctly extracted water pixels to all pixels that should represent the number of water pixels, and the principle is as follows:
The
scores represent the harmonic average of precision (
) and recall (
), and the principle is as follows:
Table 2 counts the quantitative study results of the ViTenc-UNet model compared to other models and MNDWI-threshold methods. The evaluation index of the ViTenc-UNet model is better than that of other model methods. Moreover, compared to the current advanced DeepLabv3+, TransUNet, and Swin-Unet models, the present model still achieves better overall classification effects for lake water bodies. From the five selected indicators, the overall accuracy, intersection/merge ratio, accuracy, recall rate, and F1 score of the ViTenc-UNet model reached 99.04%, 98.68%, 99.08%, 98.59%, and 98.75%, respectively, which improved by 4.16%, 6.20%, 5.34%, 4.80%, and 5.34% compared to the original U-Net model. Additionally, all indicators were significantly improved compared to the other methods. It was shown that the ViTenc-UNet model has a good effect on the identification and extraction of lakes on the Qinghai-Tibet Plateau.
For comparison, due to the large number of datasets and limited space,
Figure 12 presents actual images of five typical regions, including the labels and image predictions created using the above model in these regions.
Figure 12 shows that the above seven methods have identified and extracted water bodies in different regions, but the extraction effect of the ViTenc-UNet model is better than that of the other six methods. For the area of the simple lake water body and lake shore, ViTenc-UNet has similar effects to the other six methods and can achieve the expected extraction effects. For the complex areas of the lake shore and the area with exposed reefs in the lake, the FCN, U-Net, and MNDWI-threshold methods featured obvious misalignment and leakage phenomena. The DeepLabv3+, TransUNet, and Swin-Unet models have slightly worse effects, which are mainly reflected in the insufficient spatial continuity of lake water bodies. Additionally, part of the surface is misdivided into lake water bodies. The ViTenc-UNet models perform well, showing the distribution of lake water bodies more objectively and realistically. For the area where alluvial fans of recharge rivers and lakes are staggered, FCN and U-Net models cannot accurately distinguish water bodies from other features. The DeepLabv3+ and Swin-Unet models can identify most water bodies, but the large number of details of some lake water bodies still makes them unable to meet the classification requirements. TransUNet model extraction is slightly better, but the prediction of spatial relationships between water bodies is still insufficient. Although the MNDWI-threshold method can extract the correct direction of simple water body boundaries, it can be seen from
Figure 12 that the pixels’ prediction is missing. In addition, in the prediction of complex water body boundaries, its misclassification and missing points are very obvious. The ViTenc-UNet model is significantly better than the other six methods in these areas, and the prediction of the edge of the lake water body is more accurate, and has successfully completed the spatial continuity and spectral information extraction of lake water bodies.
5. Discussion
The traditional semantic segmentation task of classifying objects such as roads [
61], buildings [
62], farmland [
63], and other objects generally has a relatively regular polygonal shape, and the classification difficulty is relatively simple. The water bodies of lakes on the Qinghai-Tibet Plateau are affected by water levels and complex terrains. Moreover, various irregular shapes are presented by remote sensing images in different years, which poses great challenges to the accuracy of the corresponding classification tasks. In the past, the image labeling of lake water datasets generally used labeling software such as “LabelMe” to visually and manually label the target lake water body. In the labeling process, due to the different shapes of lake water bodies, the complex twists and turns of lake shores, and the different mineral content of lakes, water pixels inevitably become misseparated from other feature pixels, which is especially obvious in more than two consecutive water pixels and lake shore mudflats. These misseparation cases can lead to false positives in the training process of the model, thereby affecting model performance. In order to obtain the true value of the water boundary of the lake on the Qinghai-Tibet Plateau, we mainly used the water index threshold method, supplemented by manual correction of unreasonable boundaries, which effectively reduces the working time of manual visual interpretation, so that these times could be used more effectively for the correction of water boundaries based on some regional details. The size of the dataset will also affect the performance of the ViTenc-UNet model because the encoder body of the model, the Vision Transformer layer, lacks the bias induction characteristics provided by the convolutional layer structure, and the absence of prior rules will lead to poor performance of the model in the opposite sample task. Thus, we selected Sentinel-2 remote sensing images with multiple scenes to expand the scale of the dataset.
The Transformer layer was originally designed as a self-attention mechanism for word and sentence translation tasks in the field of natural language processing [
64]. Compared to the text language data represented by words and sentences, remote sensing image data have more dimensions and more complex spatial relationships. Thus, in the ViT used in this article, it was necessary to split the image into several “image patches”, and the size of the input “image patches” depended on the size of the input image. To determine a suitable input size, we selected a number of commonly used image sizes for testing. For this test, the five remote sensing images used in this paper were divided into five sizes: 128, 256, 480, 512, and 600. The amount of information remained unchanged, and the datasets of each size were still tested using ViTenc-UNet.
The test results are shown in
Table 3. The size of the image dataset was reduced to 256 × 256, and the indicators of the measurement test showed an overall upward trend, with the 256 × 256 size being optimal. The recall rate of the dataset with a size of 600 × 600 was significantly lower than that of the other four sizes, which indicates that if the size of the input image is too large, the initialized “image patches” will be too large and not suitable for the transformer layer, resulting in leakage of the target figure. The 128 × 128 size dataset was too small, its spatial relationship was too fragmented, and the test result index was slightly lower than the optimal result index.
FCN, U-Net, and DeepLabv3+ do not contain a Transformer layer and still use the convolutional layer to extract image features. However, these models are limited by the small range of the receptive field of the convolutional layer and cannot accurately reflect the spatial continuity between multiple water bodies. The model with the Transformer layer has different advantages over other models in this lake water classification task, which can be found from the position, role, and number of the Transformer layer in the model. As the core part of the encoder group, the Vision Transformer layer in the ViTenc-UNet model is located in the middle of each encoder group, which replaces the convolutional layer of learning image features in the original U-Net model to learn the deep features of the image. The retained 1 × 1 convolutional layer is then used to constrain the feature dimension of the output image, which can preserve the spatial continuity of the image and reduce misalignment and omission. The ViT layer in the TransUNet model is located between the downsampling layer and the upsampling layer of the model and does not replace the convolutional layer of the original downsampling layer. The encoder essence of this layer is still a continuous 3 × 3 convolutional layer, and the Vision Transformer layer and convolutional layer jointly participate in learning image features, with the weighted confidence in the image features extracted using the convolutional layer. The difference between the Swin Transformer layer and the Vision Transformer layer in the Swin-Unet model is that the Swin Transformer’s MSA exists inside each window, and its receptive field increases with an increase in the layer. Additionally, the overall structure presents a “pyramid” shape, while the Vision Transformer’s MSA has weighted confidence in global features. Moreover, the Swin Transformer’s receptive field is fixed, and its overall structure is columnar. The Swin Transformer in this model completely replaces the convolutional layer in the original downsampling layer and the upsampling layer. Compared to the other two models containing a Transformer layer, the output layer of each sampling layer in the Swin-Unet model has no direct connection path to the previous Swin Transformer layer. Additionally, the gradient flow is blocked by the layer normalization module, which will lead to the disappearance of the gradient [
60]. The classification effect is not as good as that of the ViTenc-UNet model and the TransUNet model and offers fewer transformer layers. In addition, too many Swin Transformer layers in the model can greatly increase the running time of the model.
An ablation experiment was applied in this paper to explore the influence of the CBAM attention mechanism and ViT layer on model improvement. We modified the U-Net source model to obtain two models for the ablation experiments. The encoder remained the same as the ViTenc-UNet encoder, but the U-Net model did not contain the CBAM attention mechanism in the decoder. The decoder was the same as the ViTenc-UNet decoder, but the encoder was not modified for the U-Net model. The results of the ablation experiment are shown in
Table 4.
When only the CBAM attention mechanism was added to the decoder of the model, except for a slight increase in IoU, all other indicators decreased. When only the ViT layer was added to the encoder of the model, all evaluation indicators were significantly improved, but the improvement was still slightly smaller than that of ViTenc-UNet. This experiment showed that the ViT layer can effectively improve the recognition extraction ability of the model. When the CBAM attention mechanism was added to the model as an independent structure, it did not effectively improve the recognition extraction ability of the model. However, as a supplementary structure, the CBAM attention mechanism improved the retention of spatial relationship information and spectral information extracted by the model.
To verify the reliability and stability of ViTenc-UNet, we also test the model in the public lake semantic segmentation dataset [
65]. The additional experiment results are shown in
Table 5.
In this experiment, the ViTenc-UNet model continued its excellent performance and demonstrated stable and reliable water boundary recognition ability, with FCN having poor performance and could not complete prediction tasks well. The five indicators of the ViTenc-UNet are superior to other models, especially in terms of IoU and F1 scores, which are significantly higher than other models. To some extent, the experimental results demonstrate that the ViTenc-UNet model has significant potential for identifying and extracting lake water boundaries, which is reasonable. Although ViTenc-UNet’s time cost for training and testing is ranking second, its performance in five indicators is the best of six models, which will help improve the effects of lake water bodies. In future improvements, we will also consider reducing model parameters to optimize runtime.
6. Conclusions
With the diversification of remote sensing data sources and increasingly rich remote sensing image information, traditional lake water extraction methods can no longer meet the current requirements for the efficient and accurate monitoring of lake water bodies. To better cope with the task of extracting large lake water bodies and realize the high-precision extraction of various lake water bodies from the complex terrain of the Qinghai-Tibet Plateau, we innovatively proposed an advanced model for extracting the main structure with Vision Transformer as the image feature. We named this new model ViTenc-UNet. ViTenc-UNet replaces the convolutional layer used to extract image features in traditional semantic segmentation model encoders with the Vision Transformer structure, which can accurately preserve the spatial continuity of the lake water body in remote sensing images and reduce interference from noise points and the pepper salt phenomenon on lake water information. We also added a spatial attention mechanism and spectral attention mechanism to the decoder to fully mine the spatial information and multispectral information of the water body in the image. To verify the performance of ViTenc-UNet, we collected data from multiple typical lake groups in the inflow and outflow areas of the Qinghai-Tibet Plateau and completed the experiment using these datasets. In these test areas, compared to other semantic segmentation models, the ViTenc-UNet model achieved different degrees of advantages in multiple index data, showing excellent performance in the semantic segmentation of lake water.
In future work, we will continue to improve the ViTenc-UNet model, including, but not limited to, the following: improving the model based on the position, number, and role of the Transformer structure in the model; starting from the number of layers and skip connections between the encoder and decoder; and using a lightweight encoder and decoder to optimize the structure of the model and reduce the model training time. In this study, the scope of the dataset was expanded, the proportion of lake water bodies in different topographic and geomorphological environments was increased, and the generalizability of the model was expanded to include lake water bodies.